Using LLMs (large language models) as a judge is an exciting topic in the field of AI research and application. When being used as judges, LLMs aren’t generating answers, but rather acting as an evaluator. They’re judging whether another model’s response is correct, helpful, or high quality based on a given prompt. This evaluation method is gaining traction because it offers a scalable and efficient way to assess LLM outputs, especially in scenarios where human evaluation may be too slow, inconsistent, or expensive.
Let’s set more context around what this concept is and how it is typically used.
Two Main Evaluation Methods for LLMs
Typically, LLMs are evaluated in two formats:
1. Using Automatic Benchmarks
This includes benchmarks like MMLU and BBH. In this format, the evaluation is based on either multiple-choice questions or tasks such as text summarization, where:
- A reference answer is provided.
- The output from the LLM is compared against the reference.
- The evaluation checks whether the LLM’s response is correct or not.
This approach has its uses, particularly in structured settings. However, as LLMs have been developing rapidly, there is potentially some risk of test site contamination.
2. Using Human Evaluators
The idea behind human evaluation, which is typically done through A/B tests, is that:
- Two LLMs are asked, where one is the model that you want to test and the other serves as a reference.
- A set of prompts is sent to each of the models.
- The responses are collected and then presented to different human evaluators.
- Evaluators are asked to decide whether they prefer one model’s response over the other or if the responses are about the same.
Right now, this is seen as the gold standard in LLM evaluation.
Challenges with Human Evaluation
There are two problems with human evaluation:
- It is very costly to set up.
- It is not very scalable, requiring a lot of evaluators to see a statistical signal. For example, when the models are of equal quality.
The Concept of LLMs as a Judge
To help automate the process, evaluations can be done using an LLM as the judge.
- Instead of human evaluators, another LLM is tasked with judging responses.
- A rubric or set of criteria is defined to guide the evaluation.
- This automation serves as a proxy for human evaluation, making the process cheaper.
Efficiency and Practical Considerations in Human Evaluation
Using an LLM as a judge is primarily done for efficiency. To get some context, let’s look at how many prompts are typically run and how many people are needed for purely doing human evaluation.
Human Evaluation Requirements
- Typically, you need at least 20 to 30 participants to begin the conversation and see statistical signals.
- In terms of the number of prompts, this can vary. Typically, the more prompts you have in your evaluation set, the better it is, but there’s a trade-off: human evaluators begin to get fatigued.
The Challenge of Fatigue in Human Evaluation
If you have too many prompts and you’re showing all of them to human evaluators, they begin to get fatigued when choosing which model is better.
For example, consider if you’re sending a prompt that has longer context. It’s going to be fatiguing for the human to:
- Read through all of the prompts
- Read the LLM responses, and
- Choose the best one.
We have to be intentional when choosing what prompts we actually want to include in our evaluation set so that we get fair, unbiased judgment from the humans.
To illustrate, the order of magnitude is hundreds of prompts. Popular benchmarks can range anywhere from 500 to 800, even up to 1,500 prompts. Given that, it’s not really scalable for humans to do the evaluation, and we’re trying to automate it with LLMs.
Accuracy Concerns with LLM Judges
People often think that LLMs have a tendency to hallucinate and aren’t very accurate.
That poses the question of whether accuracy is a concern when it comes to leveraging an LLM as a judge to help automate the evaluation process.
The answer is that it depends on which audience is trying to evaluate whether the LLM is “accurate” or not.
- For a general layperson, or a human evaluator without deep expertise, the LLM will typically have more in-depth domain knowledge about a particular topic.
- For example, if we’re asking GPT-4 to evaluate “how to write a dynamic programming algorithm,” a typical human evaluator may not even know what that is. From that perspective, LLMs are very accurate.
- An expert in the field will usually have much more knowledge than the LLM. They’ll be able to critique.
- For example: “GPT-4 gave a pretty naïve algorithm; there’s a better way to do it.”
Generally, GPT-4 and the LLMs used in these LLM-as-a-judge frameworks tend to give pretty good reasoning.
Common LLM Models Used as Judges
GPT-4 is the model typically used as a judge in LLM-as-a-judge frameworks. As a large foundational model, it is well-suited for evaluating the core of chat quality.
For example, it can be used to figure out whether the style of one model is better than another, which may often be an ill-defined problem in chat settings.
This approach can also be applied to specific domains. For example, if the focus is on the medical domain, GPT-4 can be prompted to act as a medical expert, and it will tailor its responses to fulfill that kind of request.
There is ongoing debate about whether it makes sense to fine-tune judges instead of using GPT-4. Right now, this remains an open research question.
Some papers have explored creating models designed exclusively for judging. For how much research has gone into it, they are actually pretty decent. But they are not at the level of the current GPT-4, with the exception of the recent Llama 405B.
Two Main Parameters for Judging an LLM Judge
When evaluating the performance of a judge model, there are two main criteria:
1. Statistical Confidence in Rankings
When ranking a list of models using a particular judge, it should clearly separate out the performance of all models with some level of confidence.
For example, if one model has an average performance of 40% ±2% and another of 35% ±2%, it can be said with confidence that the first model is better than the second model.
An LLM-as-the-judge model should be able to give that kind of guarantee.
2. Alignment with Human Preferences
The judge should also correspond to human judgment about which model is better. Even if a judge model can separate out the performance of different models, if its evaluations do not correlate with human preferences, then it is not providing the signal that is wanted.
Establishing Ground Truth for Evaluations
To achieve alignment with human preference, the idea of a ground truth should be set.
There are frameworks out there that can do this, like Chatbot Arena. It is a large, online-hosted platform with different closed- and open-source models.
What it does is run a large-scale experiment to show one model vs. another, anonymize them, and allow users to type any kind of prompt. Then, users can choose which model is better. This is mostly a community-driven platform.
When using an LLM as a judge, the evaluations should correlate to those, as we’ve taken that as a source of ground truth, at least for now.
How SambaNova Uses LLM as a Judge
At SambaNova, we are benchmarking lots of different models and trying to help our customers think about the landscape. For example, we use LLM-as-a-judge in our Composition of Experts (CoE) development.
What is CoE?
- Composition of Experts (CoE) is an approach that stitches together multiple small models with a router.
- The goal is to get the performance of a much larger model, but at the inference cost of a very small model.
Challenges in CoE Development
The main challenge is finding out which models are good at specific tasks. This is not always trivial.
While traditional benchmarks can be used and automated to choose models based on their performance, they raise the concern of test contamination.
Instead, what we do is use LLM-as-a-judge and leave it to the LLM to figure out which expert is best in a particular domain (whether it be mathematics, coding, finance, or legal).
Broader Impact
What LLM-as-a-judge has helped us do is select which expert was good for each domain and then build the CoE.
This information is incredibly helpful to our customers and developers globally to understand all the nuance that’s coming up in the space as we actively grow use cases that are getting deployed.