Rater-based Assessment: Are We Doing It All Wrong?

By Kyle John Wilby, BSP, ACPR, PharmD

Setting the scene
Three assessors watched an OSCE performance and evaluated the student’s communication skills on a generic rubric with three descriptors and a five-point scale. Scores from the three assessors were 2, 3, and 5. The student therefore failed, was borderline, or aced the station based simply on who evaluated them.

What do we do now?
This situation is likely not an uncommon phenomenon. The literature persistently shows that assessors vary in how they interpret and judge performance and that what they focus on when observing a student is not necessarily aligned with the program’s competency framework.1 Despite being potentially problematic for pass-fail decision making, this is actually quite intuitive as humans experience and perceive stimuli in different ways. Take sushi, for example. In any given group of people, there will be some who love sushi and enjoy the taste experience, but there will also be those who think it is appalling. Either case is not wrong (i.e. error), but rather an individualized preference formed through years of tasting different types of food. The same reasoning can explain variation in communication preferences, which in our example, is evident by variations in scores.

Can’t we just train better?
Assessor training must be the answer, right? Well, in the example above, all three assessors were “expert” assessors who had evaluated OSCEs multiple times and received a significant amount of training. Would more training actually help? The literature seems to tell us, no.1 Plus, how can we train someone to interpret communication skills in ways that do not match their own communication preferences? Furthermore, will all patients the student eventually encounters have the same preferences as the ones we train with? Eye contact, gestures, and voice tone are all known to be favorable or offensive, depending on the specific communication context.2 For example, some cultures find direct eye contact inappropriate, even though we typically train students (and assessors) that this is an example of good communication. Also, people differ considerably with respect to if they value what was said, versus how it was said. Instead of attempting to ignore these differences and standardize or calibrate assessors’ judgements to reflect a program’s perception of what constitutes good communication, shouldn’t we really seek to understand how actions are interpreted differently by others? This, however, would threaten what seems to be our never ending pursuit of reliability coefficients as close to 1.0 as possible.

What is the answer?
Instead of questing to achieve perfect reliability in assessment and standardizing assessors to think and act the same way, we must embrace assessor variability, capture it, and use it to inform assessment decisions, as well as to provide rich feedback to students. We should perhaps, be striving for saturation of assessment data, in order to identify patterns or red flags in student performance that need remediation prior to program completion. Saturation, typically associated with qualitative research, simply means to gather enough data until no new patterns or ‘themes’ emerge. As such, we must develop new assessment methods that account for variability and allow chief examiners, program directors, or competency committees to make decisions based on multiple data sources. Work has already been done to investigate the use of qualitative assessment methods (such as narrative), which are able to capture rich performance information, including variability in assessor judgements, and have been shown to promote credibility in decision-making.1,3 Narrative may also diagnose behavioral red flags early in communication training that may assist remedial efforts prior to students entering practice-based experiences.3 We should continue to explore, refine, and perfect these options in order to achieve authentic and accurate assessment.

Moving forward…
Embracing assessor variability and development of qualitative assessment tools may be difficult for some pharmacy educators, but we must collectively begin to think about assessment differently. From the example above, receiving three different numerical scores was troubling and did not provide clear guidance as to whether the student communicated effectively with the patient. Capturing assessors’ reasoning behind the scores using narrative, however, would have provided the insight necessary to better understand student performance and allow for defendable justification for any pass-fail decision. It would have also helped identify any problematic behaviors recurring across multiple cases within the OSCE to support decision-making.

This debate will likely continue to create sparks about how we overcome (or embrace) variability in rater-based assessment, but the pursuit of accurate and authentic assessment should be our goal. Reliability has traditionally been purported as a gold standard marker of good assessment, but we are now identifying limitations with this approach, as we can have very reliable scores that are actually completely inaccurate. Pursuing alternative approaches to assessment may require thinking ‘out-of-the-box’ but ultimately, shouldn’t we act in the best interest of our students and ultimately our patients?


1. Eva KW. Cognitive influences on complex performance assessment: Lessons from the interplay between medicine and psychology. J App Res Mem Cogn. 2018;7:177-188.

2. Bonaccio S, O’Reilly J, O’Sullivan SL, Chiocchio F. Nonverbal behavior and communication in the workplace: A review and an agenda for research. J Manage. 2016;42(5):1044-1074.

3. Ginsburg S, van der Vleuten CPM, Eva KW. The hidden value of narrative comments for assessment: A quantitative reliability analysis of qualitative data. Acad Med. 2017;92(11):1617-1621.

Kyle John Wilby is an Associate Professor at the School of Pharmacy, University of Otago in Dunedin, New Zealand. Educational scholarship interests include assessment, including assessor cognition and cultural influences on judgement. In his free time, Kyle enjoys running up hills, barbequing, and living in new countries. You can follow him on Twitter @KJ_Otago.

Pulses is a scholarly blog supported by Currents in Pharmacy Teaching and Learning

1 Comment

  1. Well said, Dr.Wilby!
    Regarding rater training, if you do not have it in your files, might I recommend Newble et al. The selection and training of examiners for clinical examinations. Med Educ. 1980 (It is an RCT of rater training).

    Regarding quantitative reliability and qualitative saturation, I would be cautious.

    First, I completely agree with you that targeting a quantitative reliability of 1 is problematic (and likely counterproductive). However, quantitative reliability is more complex. Importantly, if the stakes of an assessment are high, you should want (ethically) to be fair and so reliable with your assessment (see the Testing Standards for Educational and Psychological Testing). Furthermore, as numerous generalizability theory analyses show, inter-rater issues pale in comparison to other sources of variability in examination scores (as in citation above). Some indices of reliability (internal consistency, inter-rater reliability, etc) are too simplistic for a performance-assessment with more moving parts (items, raters, stations …by the way, station-related variation is MUCH larger than inter-rater variation–though simple reliability indices do not measure it).

    Second, in lower-stakes testing, I would not throw out rigor. Quantitative reliability needs to be REPLACED with qualitative validation strategies. Saturation is not itself a validation strategy but it can provide evidence of thoroughness & confirmability. See Driessen et al. The use of qualitative criteria for portfolio assessment as an alternative to reliability evaluation. Med Educ. 2005. (…note the shear number of validation strategies used for this higher-stakes assessment that was a barrier to program progression)

    Third, I call a middle-ground a “mixed approach” to creating rater rubrics [Peeters. measuring rater judgments within learning assessments–Part 2, Curr Pharm Teach Learn. 2015]. Your blogpost observation of substantial variation is too common for analytic rubrics, but their combination with holistic can help alter that variation issue. If you are observing substantial variation with holistic ratings, you need to re-evaluate more than just a rubric.

    Once again, thank you Dr.Wilby. You discussed important insights (and shortcomings with simplistic quantitative reliability). Even more, your experiance and insight on assessing students’ cultural competence performance is helpful!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s