The problems with running human evals

Blogs

Marketplace

Discover Marketplace

Groups

Discover Groups My Groups

Pages

Discover Pages Liked Pages

More

Popular Posts Discover Posts

Marketplace Blogs Pages Groups

See All

Upgrade to Pro

shared a link

2025-05-03 00:49:47 ·

The problems with running human evals

And how you can debug themDisclaimer: The opinions stated here are my own, not necessarily those of my employer.In my previous essay, I ranted about why benchmarks are pretty much useless for judging model use in actual products (BEFORE Satya Nadella, if I might add) and why you need custom evals. This essay is a rant about what could go wrong when you implement custom evals, especially when you use human raters.Running evaluations (evals for short) is likely the single most important step in building an AI product. Evals tell us if model use is actually valuable, safe, and aligned with user needs — not just good at gaming metrics. Human evaluations, in particular, catch the gaps that automated tests miss, like nuance, context and judgment.There are multiple steps that go into running these evals:Product defines the eval; eval results redefine the productEvals always start with product goals. Once the goals are clear, they get translated into a template that is compatible with the pool of raters who perform a set of rating tasks based on instructions provided to them. The results obtained from this activity are then analyzed and translated into changes for the product itself : improving the prompt being used, modifying the UX and setting better expectations with users, etc.But what could possibly go wrong in this entire process? Here are the main things to watch out for:#1 Your eval results are ambiguousResult ambiguity can come in different forms. The lack of agreement among raters is the most common one. This is a well-studied problem called Inter Rater Reliability (IRR), with its roots in psychology from the 1950s. Imagine that three raters got a rating task where they have to choose between “Yes” and “No”.I would’ve said “Yes” FWIWIf 2/3 raters said “Yes” and one said “No”, how would you judge the quality of the response? While it’s tempting to conclude that their level of agreement is 66.66%, it’s actually 33.33%: only 1/3 pairs agreed with each other!In practice, IRR is measured with more versatile metrics like Krippendorf’s alpha that corrects for “fluke” agreement and different rater configurations. An alpha value of at least 0.67 is considered a lower bound for tentative conclusions; 0.8 is considered safe. In large-scale tech products, the threshold varies depending on the severity of what’s being evaluated — a safety evaluation that identifies hate speech will have a higher alpha threshold compared to, say a code formatting evaluation.Another type of ambiguity is when there are contradictory results within the same eval. Here’s a bad example to illustrate:Raters providing contradictory answersContradictory ratings on the same task are hard to reconcile. If the eval result contains multiple contradictions, it’s no longer a reliable indicator of quality. This problem usually arises due to poor eval design (like asking about “too long” AND “too short”) or unclear instructions given to raters.#2 Your eval results don’t align with product outcomesIn the previous problem, the raters didn’t agree with each other. In this problem, they didn’t agree with actual users of the product.Here’s an example:ChatGPT’s “joke”Unlike the previous case, let’s say all the raters thought this joke was funny (alpha = 1). Using this result, you decide to launch a new “joke” feature, only to find out that users completely hate it and don’t find it funny at all! This is a classic case of rater preferences not aligning with user preferences. Fixing this problem usually costs much more (more time, more users exposed) than fixing rater alignment.A lot of evals that try to assess subjective aspects like humor, interesting-ness and usefulness tend to suffer from this issue. Evaluating personalized content is another minefield filled with rater-user disagreement.How to debug these issuesNone of these problems call for skipping evals in any form. Remember that even imperfect evals are still much better than living in the dark without understanding your product quality.The first (strategically lazy) question to ask is whether this is even your problem to solve. If it can be delegated to a partner team/vendor whose sole job is to improve IRR and eval quality, that’s the first thing you should consider. However, if you’re a PM/Tech Lead that’s working with an eval partner, you still need to steer them towards the right outcomes without shying away from the details.Then, it’s time to scrutinize the entire eval. Start with whether the evaluation is aligned with your product goals at a high level. Then look at the instructions and every single task, making sure they aren’t leaving any room for errors. Here are some questions to ask —Does every rating task correspond to product goals?Are rating instructions unambiguous? Were sufficient examples provided to raters?Will raters accurately represent user preferences? Have they been provided with the appropriate context to accurately rate responses?After making changes, dry runs within your team (or using another LLM) are a great way to test evals before sending them to the rater pool. Think of it as a QA test for your evals, which is very useful in the early stages.Automating the evaluation is another approach that’s increasingly being used by product teams. Some rating tasks can be automated with simple rules. For example, verifying the model’s ability to solve a simple math or coding problem can be done by directly checking the solution in the output.Other rating tasks might require a different model to accurately judge the work. For example, verifying the model’s ability to prove a mathematical theorem will need a different LLM that’s prompted to read the proof and provide a rating. Most state-of-the-art LLMs are reasonably good at these types of rating tasks, given the right context and prompting. Based on your product needs, a mix of human raters + automated ratings with human annotators is likely to yield balanced results.The last “fix” for issues with human evals is two fold: (a) Get started & (b) Iterate quickly. None of these issues will emerge until you have tried running your eval at least once. The nature of issues and their underlying reasons are likely to be very specific to your product’s context. You’ll never know until you have personally gone through the motions. So get started right now! (and then rant about it when things go wrong)Disclaimer: The opinions stated here are my own, not necessarily those of my employer.The problems with running human evals was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

·47 Views