Developing and Validating a Theory Based & LLM Graded Reflection Assessment Tool

Introduction

Reflection is a critical metacognitive process through which learners evaluate their performance, interpret direct and indirect feedback, and plan for future improvement¹. In health professions education reflection has been shown to have positive effects on learning², clinical skills^3–5, and provide emotional benefits^3,6. Despite this recognized importance of reflection in medical education⁷, reflective ability is rarely assessed with theoretical rigor or in ways that scale to routine educational use. This project develops and validates a theory-based reflection rubric explicitly designed for large language model (LLM) grading.

Methods

During the 2023–2024 academic year, 119 3^rd year medical students participated in a reflective exercise run by the School of Medicine. These medical students were given dedicated curricular time to review received clinical feedback and respond to series of short answer questions that were designed to support students’ reflection on their performance.

In order to assess these student responses for varying levels of reflective ability, creation of the reflection assessment tool drew on Self-Regulated Learning(SRL) theory, and specifically on Zimmerman’s Cyclical Model of SRL¹. SRL is a broad and well studied framework that describes the ways in which learners utilize their skills and abilities in order to actively manage their own learning processes^8(p23),9, and has been found to be positively associated with markers of academic achievement^10–12 and clinical skills^11,13,14, and are negatively associated with symptoms of depression¹⁵. By operationalizing SRL theory into clearly defined rubric elements, the assessment was intentionally designed to support reliable grading by an LLM. Rubric development utilized design principles that have been found to be useful in emerging LLM assessment studies^16–18.

Once the rubric was developed, a random sample of 55 student responses was graded by the project lead (PL), a second human rater (SHR) and Versa using Chat GPT-5 framework (V). Versa scoring was done through prompt engineering, with a finalized prompt that that utilized the created rubric as well as iterative LLM-assisted prompt engineering strategies¹⁹.

Results

Pairwise weighted Cohen’s Kappas were calculated for all four items comparing each pair of raters (PL vs V, PL vs SHR, SHR vs V). Across items, the LLM grader demonstrated substantial agreement with the project lead (Cohen’s Kappa 0.71–0.75), the SHR demonstrated moderate to almost perfect agreement with the project lead(Cohen’s Kappa 0.56–0.87), and the LLM grader demonstrated moderate to substantial agreement with the SHR(Cohen’s Kappa = 0.50–0.80). Overall, the LLM grader’s ratings were consistently aligned with the rubric, comparable to or slightly better than the consistency of the second human rater on certain items.

Conclusion

This theory-based reflection rubric can be reliably graded by an LLM. The LLM demonstrates scoring consistency comparable to human raters, which supports scalable, rigorous assessment of reflective ability as well as providing further support to the LLM grading rubric and prompt development strategies used in its creation.

Contact

Nicole Thomason

[email protected]