Saving Staff Time Without Compromising Quality

Evidence-based task distribution between humans and LLMs on formal editing of Medical Student Performance Evaluations

This abstract was part of the 2026 UCSF AI and Education Symposium

Collaborators: 
Nick Wadsworth, LJ Moore-McLelland, Sha’Kuana Ona, Polo Black Golde

Abstract:

Each year, multiple staff members in SOM effortfully edit student performance narratives according to strict style, consistency, and accuracy standards to create a formal summary document (MSPE) critical to a student’s applications to residency programs. LLMs present an opportunity to save time and effort in this process but the quality threshold for quality and time savings in this setting is particularly high. In a small quality improvement study, LLM and human editing were evaluated on several measures in a novel rubric in order to determine appropriate tasks for LLM assistance. We measured inter-rater agreement on the rubric, then compared human vs. LLM editing quality in terms of averages and and effect sizes. Based on this data, we are designing a hybrid workflow in which LLMs are assigned editing tasks shown to be reliable and high performing, followed by a human round of more expedient review. In this show-and-tell we’ll share our analytic methods and results, our workflow design, the technical architecture underpinning the study and the ultimate process, and an estimation of time and budget savings to the institution. We hope this can serve as an exemplar for rigor in LLM validation for an application in which we are very exacting in our standard of quality. We also hope to showcase an example of using LLMs to automate repetitive human effort in service of the education mission. 

Contacts

Nick Wadsworth, [email protected]

Sha'Kuana Ona, [email protected]

LJ Moore-McClelland, [email protected]

Polo Black Golde, [email protected]