Having robust evaluation systems for language models is critical for developing and improving existing technologies. However, given the difficulty of judging a model's output for a specific task and the prohibitive cost of human judges, recent practitioners have begun to use Large Language Models (LLMs) as a proxy for human judgments. In practice, LLM’s act as strong judges but also exhibit patterns of failures and biases when compared to human judges. Given the importance of understanding the failure points of modern LLM evaluation systems, we need a rigorous examination of where LLM-as-a-Judge can satisfyingly act as a judge of the outer model's performance and where it fails to perform as well as humans.
We have embarked on this study and done the expensive and tedious human annotation. We sampled completions from several commonly used models across several tasks, from finance question answering to French translation. We then use both humans and LLM-as-a-Judge to judge those completions. We then manually analyze how and where LLM-as-a-Judge's judgments differ from human judgments. We cover over $1300$ human annotations on over $500$ prompt-completion pairs. Most importantly, we observe $4$ consistent patterns in the quality of LLM-as-a-Judge judgments vs. human judgments that evaluation system designers will need to heed. Our main observations are as follows.
We arrived at these conclusions through two separate sets of experiments. Our first set of experiments included prompts from niche domains in English, such as financial and legal question-answering. Our second set of experiments included prompts from generic language tasks, such as summarization in different languages. We utilized a set of $30$ human judges and LLM-as-a-Judge to annotate model performance across both sets of experiments. For each conclusion above, we curated several examples to provide empirical and tangible evidence of LLM-as-a-Judge's biases and strengths compared to humans.
LLM-as-a-Judge is a simple method of evaluating other language models. With certain types of questions such as multiple choice, evaluating language models comes down to accuracy of reporting the correct choice. However, for more open-ended questions such as evaluating instruction-following capabilities or question-answering where there is no one exact correct answer. For these, we use LLM-as-a-judge, where we provide an advanced LLM the question, a potential correct answer, and the model’s completion and we ask the LLM-as-a-judge to evaluate the model completion.
We meticulously crafted our study, starting with the careful selection of tasks. Our task set includes a wide range of domains such as Dialogue Summarization, Edit and Insert, Natural Language Inference, OpenBookQA, Summarization, Text Classification, Writing, Coding, Common Sense, Counterfactual Reasoning Estimation, Fill-in-the-middle, Generic Questions, Trivia, Math, Multi-Hop QA, and Roleplay. We sourced prompts for these tasks from reputable datasets such as HotPotQA \citep{yang2018hotpotqa}, the PILE \citep{gao2020pile}, SelfInstruct \citep{wang2023self}, and Vicuna \citep{vicuna2023}, and also crafted some prompts ourselves. To ensure a comprehensive evaluation, we translated these prompts into Hungarian, Japanese, Hindi, Arabic, and Turkish.
To generate a set of diverse completions for our tasks, we used models that could be realistically deployed in various generation settings. We began with several base models, including Llama 2 7b and Llama 3 8b. To fine-tune these models on a diverse set of tasks, we used the large, high-quality, and diverse datasets of Dolly \citep{DatabricksBlog2023DollyV2} and OpenAssistant \citep{kopf2024openassistant}.
We analyze LLM-as-a-Judge by referencing manual annotations by an internal set of $30$ human judges, ranging from students to software engineers. The judges were not informed of the other judgments to prevent bias. To generate LLM-as-a-Judge judgments, we use the API by FastChat \citep{zheng2023judging}. This includes asking LLM-as-a-Judge to judge with a standardized prompt and generation settings.
Observation 1. When the domain of the task is niche and the human judge is not well-versed in the domain, the human judge often defaults to picking randomly, but LLM-as-a-Judge makes more consistent choices.
In cases where the task involves specific knowledge, such as coding, non-expert humans tend not to make meaningful judgments. They often make random judgments in these settings, resulting in ties. However, LLM-as-a-Judge tends to be more reasonable and make more well-informed decisions than non-expert humans.
For example, if LLM-as-a-Judge and humans are asked to judge two model’s solutions to a dynamic programming problem, humans tend to not have the patience nor expertise to provide consistent judgements. We see this with the sample prompt:
Prompt: Implement a Python function to find the longest common subsequence of two input strings using dynamic programming.
Model 1
def longest_common_subsequence(input_str1, input_str2):
# Traverse the input strings
# Initialize two pointers to the start and end of the subsequence
start_ptr, end_ptr = 0, 0
# Loop through the elements of the input strings
for element in input_str1:
# If the element is the start of a subsequence, update the pointers
if element <= end_ptr or element >= start_ptr:
start_ptr += 1
else:
end_ptr += 1
# If the element is the end of a subsequence, update the pointers
if element >= end_ptr:
end_ptr -= 1
# Find the length of the longest common subsequence
max_len = len(subsequence)
if max_len <= len(subsequence):
raise ValueError("Unable to find the longest common subsequence")
# Return the length of the longest common subsequence
return max_len
Model 2
def longest_common_subsequence(str1, str2):
# Find the indices of the common characters
common_indices = [i for i in range(len(str1) - 1) if str1[i] != str2[i]]
# Compute the length of the longest common subsequence
length = 0
# Iterate over the input strings
for index in common_indices:
if index != len(str1):
# Update the length of the subsequence
length += 1
else:
# If the subsequence is found, return its length
return length
# Return the longest common subsequence
return longest_common_subsequence(str1, str2)
In this setting, LLM-as-a-Judge recognizes that both model answers have incorrect logic and denotes this as a tie, our human judges are not able to catch this and end up picking random winners.
Observation 2. When the task necessitates long completions, human judgments become more noisy. However, LLM-as-a-Judge can better attend to longer answers and give a more complete judgment.
Simply put, human judges tend to not have the patience nor attention to properly digest and evaluate long completions and give random judgements when this is the case. LLM-as-a-Judge tends to be more accurate in reading long completions and gives more consistent judgements. For example, we saw that when the prompt asks the models to summarize long conversations, LLM-as-a-Judge can more accurately detect errors or inconsistencies in the summaries than humans can.
Observation 3: Observation 3. LLM-as-a-Judge tends to lack the knowledge to notice nuanced details in niche domains, while expert humans can more consistently give judgments in domains they are experts in.
While LLM-as-a-Judge can outperform non-expert humans in niche domain areas and judge answers holistically, expert humans pay attention to individual details more carefully. LLM-as-a-Judge tends to miss these details and give more varied judgments, but expert humans give more consistent judgments with consistent reasoning.
In the following example, a prompt asks for treatment options for hepatomegaly with diffuse parenchymal disease in the following example. Among the options given by the model, corticosteroids are listed. However, corticosteroids are not appropriate as a treatment for hepatomegaly. Human judges can quickly identify this mistake and mark the response as incorrect. However, LLM-as-a-Judge-as-a-judge does not identify this mistake, missing the individual details due to not having detailed domain knowledge.
Prompt: Diagnosed with hepatomegaly with diffuse parenchymal disease. Treatment advise?
Model
[Text shortened for brevity] … Corticosteroids (e.g., prednisone) to reduce inflammation and prevent liver damage …
LLM-as-a-Judge marks the entire completion as correct, but the human judge with medical experience caught this error.
Observation 4: Across many domains and tasks, LLM-as-a-Judge is a poor judge in low-resource non-English languages, missing important details. However, humans who know that language are much more consistent judges.
In our observations, the reasoning capabilities of LLM-as-a-Judge-as-a-judge in low-resource non-English languages (where low resource is defined based on % of language data seen during pre-training) tend to degrade. However, humans maintain reasoning capabilities across languages. This leads to better judgments on such tasks. For example, we see when the prompt asks in Chinese "How to prove or disprove Goldbach's conjecture?", the model completion mentions that Michael Atiyah helped create the solution. However, Michael Atiyah had nothing to do with proving Goldbach's Conjecture. The human judge can identify this mistake and mark the response as correct. However, LLM-as-a-Judge does not pay attention to this small detail and mistakenly marks the whole response as correct.
Prompt: 如何证明、或证否哥德巴赫猜想?
Model Completion
尽管有许多重要 的进展,但哥德巴赫猜想的证明仍然是一个 开放的问题。在2013年,数学家迈克尔·阿特 尔(Michael Atiyah)宣称他已经发现了哥 德巴赫猜想的证明,但他的证明在数学界引起了广泛的争议
We systematically studied where and when human and LLM-as-a-Judge judges differ the most. Our observations hold across several tasks, domains, and languages. We see that in certain situations, LLM-as-a-judge is a very reasonable evaluation system, especially in generic English tasks. However, in several niche domains or languages, human judges are far superior and more careful evaluation system design is necessary. Such observations will hopefully inform future language evaluation system design.