The evaluation of Large Language Models (LLMs) is critical for driving innovation in AI, yet as models become more sophisticated, conventional benchmarks such as MMLU are increasingly inadequate for assessing the diverse and nuanced real-world use cases of LLMs. The gold standard for evaluating LLM performance remains human preference studies, typically conducted through controlled experiments where participants rank responses from both the test model and a reference model based on a predefined rubric. Unfortunately, this style of evaluation is slow and expensive to run. As a result, the LLM community has demonstrated that LLM-as-a-Judge2 can replace human evaluators by using closed-source LLMs (such as GPT-4o4, Claude 3.53) to judge different models.
While LLM-as-a-Judge offers a favorable alternative to human evaluations, closed source LLMs impose some limitations with this evaluation framework. These limitations include:
We expand the potential of using (LLMs) as judges by demonstrating that open-weight models, such as Llama 3.1 405B Instruct6, align more closely with human preferences than GPT-4o on Chatbot Arena9. Furthermore, we show that Llama-3.1 405B can effectively replace GPT-4o in two of the most prominent LLM-as-judge benchmarks: Alpaca-Eval7 and MT-Bench2. These findings broaden the applicability of open-source models in evaluation frameworks. All the evaluations were done using the recently announced free API service.
Our study evaluates GPT-4o and Llama-3.1 Instruct models as judges on the human preference dataset from Lmsys Chatbot Arena8, which consists of prompts with human preference labels from Chatbot Arena9.
Model | Agreement Rate With Human Judge |
---|---|
Llama 3.1 405B Instruct | 72.8% |
GPT-4o | 71.9% |
Llama 3.1 70B Instruct | 71.7% |
Llama 3.1 8B Instruct | 65.8% |
To get these results we use each pair of prompts from the dataset and apply the template from the appendix to query both Llama 3.1 405B and GPT-4o. We find that Llama 3.1 405B agrees with human preferences at least as well as GPT-4o. We do not count prompts that either the human or judge model considers a tie.
We categorized the prompts in the human preference dataset into various domains using a domain classifier. As illustrated in Figure 1, the agreement rate remains consistent across a range of domains, from simple prompts to more complex domains such as mathematics and coding.
To generalize our findings on replacing GPT4 with Llama 3.1 405B for LLM as a judge, we conduct similar experiments to our human preference study on several popular benchmarks such as Alpaca-Eval7 and MT-Bench2.
We test our hypothesis on Alpaca-Eval, which is a popular chat benchmark which is used as a proxy for performance on ChatBot Arena. Within Alpaca-Eval, there is a set of 2.5K human annotations (~650 instructions each with 4 human annotations) on pairs of model completions for different instructions. For each pair of completions, we ask the LLM judge to determine which one is preferred. We then assess human agreement by measuring how frequently the model's choice aligns with the majority of human preferences. We evaluate Llama 3.1 8B, 70B, and 405B as a judge model to understand if they agree with human preferences.
In Table 2, we show the human agreement rate, Spearman Correlation and Pearson Correlation between GPT-4 Turbo, Claude, humans and the different Llama 3.1 variants queried using SambaNova’s Cloud API. We see that the agreement rate for 70B/405B hovers around ~68%, a Spearman's Correlation of 0.933, and Pearson’s correlation of 0.877. These numbers are quite competitive: GPT-4 Turbo gets with ~71% agreement, Spearman's Correlation of 0.95, and Pearson’s correlation of 0.944, which are very close to the numbers we achieved with Llama 3.1 70B and 405B. This suggests that Llama 3.1 70B/405B can be a drop in replacement for GPT-4 and other closed source models.
Model | Human Agreement Rate | Spearman Correlation | Pearson Correlation |
---|---|---|---|
alpaca_eval_gpt4_fn | 70.98 | 0.95 | 0.944 |
Llama 3.1 405B Instruct | 67.87 | 0.933 | 0.805 |
Llama 3.1 70B Instruct | 68.25 | 0.933 | 0.877 |
Llama 3.1 8B Instruct | 60.34 | 0.783 | 0.815 |
Humans | 65.7 | 1.0 | 1.0 |
Claude | 65.31 | 0.933 | 0.902 |
MT-Bench is a benchmark which tests models’ multiturn conversation capability. To evaluate the efficacy of Llama 3.1 8B, 70B, and 405B, we regenerate the MT-Bench leaderboard using Sambanova’s FastAPI with these models as a judge and compare their Spearman’s and Pearson’s Correlation with the rankings obtained by GPT-4. For each model completion saved in MT-Bench, we use the Llama model to provide a rating on a scale from 1 to 10. We then measure the correlation between the Llama model’s ratings and GPT-4’s ratings. From the results shown in Table 3, Llama 3.1 405B’s Spearman’s Correlation/Pearson’s Correlation with respect to GPT-4’s ranking is 0.9816/0.9950 respectively, which further demonstrates that it can be an effective replacement for GPT-4 in the LLM-as-a-Judge framework.
Model | Spearman Correlation | Pearson Correlation |
---|---|---|
Llama 3.1 8B Instruct | 0.9655 | 0.9865 |
Llama 3.1 70B Instruct | 0.9757 | 0.9946 |
Llama 3.1 405B Instruct | 0.9816 | 0.9950 |
In conclusion, the results from human preferences on ChatBot Arena, Alpaca-Eval and MT-Bench demonstrate that the open-weight Llama 3.1 405B Instruct model is a viable alternative to closed-source models like GPT-4, Claude etc. for use in LLM-as-a-Judge evaluations. With competitive agreement rates and strong correlations with human judgments, Llama 3.1 405B proves to be both transparent and effective, eliminating many of the limitations posed by closed-source systems, such as lack of transparency and reproducibility due to restricted access and content policies.
By adopting Llama 3.1 405B for free on SambaNova Cloud1, developers and researchers can benefit from an open, reliable, and cost-effective model capable of producing consistent and credible judgments. As open-source LLMs continue to advance, Llama 3.1 405B could pave the way for a more democratized and accessible approach to evaluating AI models, ensuring a more sustainable future for LLM research and development.
Unset
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better, as well as answering in the desired language of the user. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses.
Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Your evaluation should only focus on the correctness of the response.
After providing your explanation, output your final verdict by strictly following this format:
[[A]] if assistant A is better
[[B]] if assistant B is better
[[C]] for a tie
Prompt Template:
[User Question]
{INSERT QUESTION}
[The Start of Assistant A's Answer]
{INSERT ANSWER A}
[The End of Assistant A's Answer]
[The Start of Assistant B's Answer]
{INSERT ANSWER A}
[The End of Assistant B's Answer]
Here are some examples of Llama 3.1 405B judging some complex prompts showing the way Llama 3.1 405B is able to reason and understand the nuances behind user instructions when making a judgment.
Unset
Prompt: Can you give an example of a deep learning model for predicting traffic during hurricane evacuation with reference?
-
Response A:
Yes, here is an example using the popular Convolutional Neural Network model Inception v3 to generate traffic evacuation forecasts for storm Florence using satellite and weather radar data. The model was trained on data from 2017 and was tested on data from September 14-16, 2018. The results were published in the open-source Model Forecast website.
Forecast on 14 September 2018 at 15 UTC:
Traffic congestion is expected along the I-95 corridor and in the Charleston SC area due to the large number of people evacuating the area ahead of Hurricane Florence.
Forecast on 15 September 2018 at 15 UTC:
Hurricane Florence is expected to cause severe flooding and traffic congestion in the Carolinas, with some localized areas seeing up to 20 inches of rain. The huge influx of people evacuating the area ahead of the storm could also lead to severe congestion on the I-95 corridor and in the Charleston SC area.
The model accuracy on 14 September was 73.5% and on 15 September was 79.2%.
-
Response B:
There are several deep learning models that can be used for predicting traffic during hurricane evacuation, but here's an example of a convolutional neural network (CNN) model that can be used for this purpose:
python
Copy code
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Conv2D, MaxPool2D, Flatten, Dropout
# define the model architecture
model = Sequential()
# define the input shape
model.add(Embedding(input_dim=3, output_dim=64, input_length=10)) # input shape: (batch_size, 3, 10, 3)
model.add(Conv2D(32, (3, 3), activation='relu')) # convolutional layer
model.add(MaxPool2D((2, 2))) # max pooling layer
model.add(Conv2D(64, (3, 3), activation='relu')) # convolutional layer
model.add(MaxPool2D((2, 2))) # max pooling layer
model.add(Conv2D(128, (3, 3), activation='relu')) # convolutional layer
model.add(MaxPool2D((2, 2))) # max pooling layer
model.add(Conv2D(256, (3, 3), activation='relu')) # convolutional layer
model.add(MaxPool2D((2, 2))) # max pooling layer
model.add(Dense(512, activation='relu')) # dense layer
model.add(Dropout(0.2)) # dropout layer
model.add(Dense(256, activation='relu')) # dense layer
model.add(Dropout(0.2)) # dropout layer
model.add(Dense(128, activation='relu')) # dense layer
model.add(Dropout(0.2)) # dropout layer
model.add(Dense(64, activation='relu')) # dense layer
model.add(Dropout(0.2)) # dropout layer
-
Llama 405B's Judgement:
After evaluating the responses, Response A is more relevant and detailed, providing a real-world example with specific forecasts and data, which makes it more appropriate for the user's request. Response B, while technically sound, lacks the context and detail needed for the prompt. Therefore, Response A is the better response.
Verdict: [[A]]
Unset
Prompt: Answer the following statements with "Agree" or "Disagree" only. Your answers should be returned in list form, in the same order they appear below, with the question number and answer only. Do not give an explanation. Do not skip any question. You must answer every question with agree or disagree, you can't be neutral or undecided.
Freedom of business is the best practical way a society can prosper.
The harder you work, the more you progress up the social ladder.
Economic inequality is too high in the world.
Charity is a better way of helping those in need than social welfare.
Land should not be a commodity to be bought and sold.
The current welfare system should be expanded to further combat inequality.
All industry and the bank should be nationalised.
Class is the primary division of society.
Taxation of the wealthy is a bad idea, society would be better off without it.
A government that provides for everyone is an inherently good idea.
Wages are always fair, as employers know best what a worker's labor is worth.
Communism is an ideal that can never work in practice.
It is "human nature" to be greedy.
"Exploitation" is an outdated term, as the struggles of 1800s capitalism don't exist anymore.
Organizations and corporations cannot be trusted and need to be regulated by the government.
-
Response A
Disagree
Disagree
Agree
Disagree
Agree
Agree
Disagree
Disagree
Disagree
Agree
Disagree
Disagree
Disagree
Disagree
Agree
-
Response B
Agree
Agree
Agree
Disagree
Disagree
Agree
Disagree
Agree
Disagree
Disagree
Disagree
Agree
Disagree
Disagree
Agree’
-
Llama 405B's Judgment
To evaluate the quality of the responses provided by the two AI assistants, I will compare their answers based on the user's instructions. The user asked for "Agree" or "Disagree" answers to 15 statements, without explanations or skipped questions. Both assistants provided answers in the required format.
However, to determine which assistant answered the user's question better, I need to consider the relevance and accuracy of their responses. Since the statements cover various topics related to economics, politics, and social issues, I will not evaluate the correctness of their opinions, but rather their ability to provide consistent and coherent answers.
After reviewing the answers, I noticed that both assistants provided a mix of "Agree" and "Disagree" responses, indicating that they attempted to evaluate each statement individually. However, without knowing the user's perspective or the context behind these statements, it's challenging to determine which assistant's answers are more accurate.
Given the lack of additional information or context, I will focus on the consistency and coherence of their responses. Both assistants provided answers that seem to reflect a consistent perspective, with some exceptions. For example, Assistant A disagreed with statements 1 and 2, which could be seen as related to conservative or libertarian ideologies, while Assistant B agreed with these statements. Similarly, Assistant A agreed with statements 3 and 6, which could be seen as related to progressive or liberal ideologies, while Assistant B had mixed responses to these statements.
Considering the above factors, I conclude that both assistants provided responses that meet the user's requirements, but their answers reflect different perspectives and biases. Since the user did not provide any context or guidance on what perspective they were looking for, I will not favor one assistant over the other.
Final Verdict: [[C]] (Tie)