Blog

Replacing the Judge: Can Llama 405B Outperform GPT4 in the Court of AI?

Written by Ravi Raju | September 20, 2024

The evaluation of Large Language Models (LLMs) is critical for driving innovation in AI, yet as models become more sophisticated, conventional benchmarks such as MMLU are increasingly inadequate for assessing the diverse and nuanced real-world use cases of LLMs. The gold standard for evaluating LLM performance remains human preference studies, typically conducted through controlled experiments where participants rank responses from both the test model and a reference model based on a predefined rubric. Unfortunately, this style of evaluation is slow and expensive to run. As a result, the LLM community has demonstrated that LLM-as-a-Judge2 can replace human evaluators by using closed-source LLMs (such as GPT-4o4, Claude 3.53) to judge different models.

While LLM-as-a-Judge offers a favorable alternative to human evaluations, closed source LLMs impose some limitations with this evaluation framework. These limitations include:

  • Lack of transparency due to the black-box nature of closed-source models
  • High costs associated with usage due to single providers.
  • Dependency on a single provider, with the lack of replicability

We expand the potential of using (LLMs) as judges by demonstrating that open-weight models, such as Llama 3.1 405B Instruct6, align more closely with human preferences than GPT-4o on Chatbot Arena9. Furthermore, we show that Llama-3.1 405B can effectively replace GPT-4o in two of the most prominent LLM-as-judge benchmarks: Alpaca-Eval7 and MT-Bench2. These findings broaden the applicability of open-source models in evaluation frameworks. All the evaluations were done using the recently announced free API service.

Agreement With Human Preferences

Our study evaluates GPT-4o and Llama-3.1 Instruct models as judges on the human preference dataset from Lmsys Chatbot Arena8, which consists of prompts with human preference labels from Chatbot Arena9.

Model Agreement Rate With Human Judge
Llama 3.1 405B Instruct 72.8%
GPT-4o 71.9%
Llama 3.1 70B Instruct 71.7%
Llama 3.1 8B Instruct 65.8%

Table 1: Agreement rate between Humans and LLM Judge on the human preference data from Chatbot Arena9

To get these results we use each pair of prompts from the dataset and apply the template from the appendix to query both Llama 3.1 405B and GPT-4o. We find that Llama 3.1 405B agrees with human preferences at least as well as GPT-4o. We do not count prompts that either the human or judge model considers a tie. 

We categorized the prompts in the human preference dataset into various domains using a domain classifier. As illustrated in Figure 1, the agreement rate remains consistent across a range of domains, from simple prompts to more complex domains such as mathematics and coding.

Figure 1: Agreement rate between LLM judge and human preferences by domain on the human preference  dataset from Chatbot Area8,9

Alpaca-Eval and MT-bench evaluation

To generalize our findings on replacing GPT4 with Llama 3.1 405B for LLM as a judge, we conduct similar experiments to our human preference study on several popular benchmarks such as Alpaca-Eval7 and MT-Bench2.

Alpaca-Eval results

We test our hypothesis on Alpaca-Eval, which is a popular chat benchmark which is used as a proxy for performance on ChatBot Arena. Within Alpaca-Eval, there is a set of 2.5K human annotations (~650 instructions each with 4 human annotations) on pairs of model completions for different instructions. For each pair of completions, we ask the LLM judge to determine which one is preferred. We then assess human agreement by measuring how frequently the model's choice aligns with the majority of human preferences. We evaluate Llama 3.1 8B, 70B, and 405B as a judge model to understand if they agree with human preferences.

In Table 2, we show the human agreement rate, Spearman Correlation and Pearson Correlation between GPT-4 Turbo, Claude, humans and the different Llama 3.1 variants queried using SambaNova’s Cloud API. We see that the agreement rate for 70B/405B hovers around ~68%, a Spearman's Correlation of 0.933, and Pearson’s correlation of 0.877. These numbers are quite competitive: GPT-4 Turbo gets with ~71% agreement, Spearman's Correlation of 0.95, and Pearson’s correlation of 0.944, which are very close to the numbers we achieved with Llama 3.1 70B and 405B. This suggests that Llama 3.1 70B/405B can be a drop in replacement for GPT-4 and other closed source models.

Model Human Agreement Rate Spearman Correlation Pearson Correlation
alpaca_eval_gpt4_fn 70.98 0.95 0.944
Llama 3.1 405B Instruct 67.87 0.933 0.805
Llama 3.1 70B Instruct 68.25 0.933 0.877
Llama 3.1 8B Instruct 60.34 0.783 0.815
Humans 65.7 1.0 1.0
Claude 65.31 0.933 0.902

Table 2: Alpaca-Eval Human Agreement Numbers/Spearman Correlation/Pearson Correlation7

MT-Bench results

MT-Bench is a benchmark which tests models’ multiturn conversation capability. To evaluate the efficacy of Llama 3.1 8B, 70B, and 405B, we regenerate the MT-Bench leaderboard using Sambanova’s FastAPI with these models as a judge and compare their Spearman’s and Pearson’s Correlation with the rankings obtained by GPT-4. For each model completion saved in MT-Bench, we use the Llama model to provide a rating on a scale from 1 to 10. We then measure the correlation between the Llama model’s ratings and GPT-4’s ratings.  From the results shown in Table 3, Llama 3.1 405B’s Spearman’s Correlation/Pearson’s Correlation  with respect to GPT-4’s ranking is 0.9816/0.9950 respectively, which further demonstrates that it can be an effective replacement for GPT-4 in the LLM-as-a-Judge framework.

Model Spearman Correlation Pearson Correlation
Llama 3.1 8B Instruct 0.9655 0.9865
Llama 3.1 70B Instruct 0.9757 0.9946
Llama 3.1 405B Instruct 0.9816 0.9950

Table 3: MT-Bench Spearman Correlation/Pearson Correlation

Conclusions

In conclusion, the results from human preferences on ChatBot Arena, Alpaca-Eval and MT-Bench demonstrate that the open-weight Llama 3.1 405B Instruct model is a viable alternative to closed-source models like GPT-4, Claude etc. for use in LLM-as-a-Judge evaluations. With competitive agreement rates and strong correlations with human judgments, Llama 3.1 405B proves to be both transparent and effective, eliminating many of the limitations posed by closed-source systems, such as lack of transparency and reproducibility due to restricted access and content policies. 

By adopting Llama 3.1 405B for free on SambaNova Cloud1, developers and researchers can benefit from an open, reliable, and cost-effective model capable of producing consistent and credible judgments. As open-source LLMs continue to advance, Llama 3.1 405B could pave the way for a more democratized and accessible approach to evaluating AI models, ensuring a more sustainable future for LLM research and development.

References

  1. Cloud Developer Portal
  2. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  3. Claude 3.5 Sonnet Announcement
  4. GPT-4o Announcement
  5. Gemini-1.5-Pro Announcement
  6. Introducing Llama 3.1: Our most capable models to date
  7. Alpaca-Eval github repo
  8. lmsys/lmsys-arena-human-preference-55k
  9. Chatbot Arena
  10. Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Appendix

Prompt Template For LLM As A Judge

Unset

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better, as well as answering in the desired language of the user. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses.

Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Your evaluation should only focus on the correctness of the response.

After providing your explanation, output your final verdict by strictly following this format:

[[A]] if assistant A is better
[[B]] if assistant B is better
[[C]] for a tie

Prompt Template:

[User Question]
{INSERT QUESTION}

[The Start of Assistant A's Answer]
{INSERT ANSWER A}
[The End of Assistant A's Answer]

[The Start of Assistant B's Answer]
{INSERT ANSWER A}
[The End of Assistant B's Answer]

Examples of Llama 3.1 As A Judge

Here are some examples of Llama 3.1 405B judging some complex prompts showing the way Llama 3.1 405B is able to reason and understand the nuances behind user instructions when making a judgment.

Example 1

Unset

Prompt: Can you give an example of a deep learning model for predicting traffic during hurricane evacuation with reference?

-

Response A:
Yes, here is an example using the popular Convolutional Neural Network model Inception v3 to generate traffic evacuation forecasts for storm Florence using satellite and weather radar data. The model was trained on data from 2017 and was tested on data from September 14-16, 2018. The results were published in the open-source Model Forecast website.
Forecast on 14 September 2018 at 15 UTC:
Traffic congestion is expected along the I-95 corridor and in the Charleston SC area due to the large number of people evacuating the area ahead of Hurricane Florence.
Forecast on 15 September 2018 at 15 UTC:
Hurricane Florence is expected to cause severe flooding and traffic congestion in the Carolinas, with some localized areas seeing up to 20 inches of rain. The huge influx of people evacuating the area ahead of the storm could also lead to severe congestion on the I-95 corridor and in the Charleston SC area.
The model accuracy on 14 September was 73.5% and on 15 September was 79.2%.
-
Response B:
There are several deep learning models that can be used for predicting traffic during hurricane evacuation, but here's an example of a convolutional neural network (CNN) model that can be used for this purpose:
python
Copy code
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Conv2D, MaxPool2D, Flatten, Dropout

# define the model architecture
model = Sequential()

# define the input shape
model.add(Embedding(input_dim=3, output_dim=64, input_length=10)) # input shape: (batch_size, 3, 10, 3)
model.add(Conv2D(32, (3, 3), activation='relu')) # convolutional layer
model.add(MaxPool2D((2, 2))) # max pooling layer
model.add(Conv2D(64, (3, 3), activation='relu')) # convolutional layer
model.add(MaxPool2D((2, 2))) # max pooling layer
model.add(Conv2D(128, (3, 3), activation='relu')) # convolutional layer
model.add(MaxPool2D((2, 2))) # max pooling layer
model.add(Conv2D(256, (3, 3), activation='relu')) # convolutional layer
model.add(MaxPool2D((2, 2))) # max pooling layer
model.add(Dense(512, activation='relu')) # dense layer
model.add(Dropout(0.2)) # dropout layer
model.add(Dense(256, activation='relu')) # dense layer
model.add(Dropout(0.2)) # dropout layer
model.add(Dense(128, activation='relu')) # dense layer
model.add(Dropout(0.2)) # dropout layer
model.add(Dense(64, activation='relu')) # dense layer
model.add(Dropout(0.2)) # dropout layer

-
Llama 405B's Judgement:
After evaluating the responses, Response A is more relevant and detailed, providing a real-world example with specific forecasts and data, which makes it more appropriate for the user's request. Response B, while technically sound, lacks the context and detail needed for the prompt. Therefore, Response A is the better response.
Verdict: [[A]]

Example 2

Unset

Prompt: Answer the following statements with "Agree" or "Disagree" only. Your answers should be returned in list form, in the same order they appear below, with the question number and answer only. Do not give an explanation. Do not skip any question. You must answer every question with agree or disagree, you can't be neutral or undecided.
Freedom of business is the best practical way a society can prosper.
The harder you work, the more you progress up the social ladder.
Economic inequality is too high in the world.
Charity is a better way of helping those in need than social welfare.
Land should not be a commodity to be bought and sold.
The current welfare system should be expanded to further combat inequality.
All industry and the bank should be nationalised.
Class is the primary division of society.
Taxation of the wealthy is a bad idea, society would be better off without it.
A government that provides for everyone is an inherently good idea.
Wages are always fair, as employers know best what a worker's labor is worth.
Communism is an ideal that can never work in practice.
It is "human nature" to be greedy.
"Exploitation" is an outdated term, as the struggles of 1800s capitalism don't exist anymore.
Organizations and corporations cannot be trusted and need to be regulated by the government.
-
Response A
Disagree
Disagree
Agree
Disagree
Agree
Agree
Disagree
Disagree
Disagree
Agree
Disagree
Disagree
Disagree
Disagree
Agree
-
Response B
Agree
Agree
Agree
Disagree
Disagree
Agree
Disagree
Agree
Disagree
Disagree
Disagree
Agree
Disagree
Disagree
Agree’
-
Llama 405B's Judgment
To evaluate the quality of the responses provided by the two AI assistants, I will compare their answers based on the user's instructions. The user asked for "Agree" or "Disagree" answers to 15 statements, without explanations or skipped questions. Both assistants provided answers in the required format.

However, to determine which assistant answered the user's question better, I need to consider the relevance and accuracy of their responses. Since the statements cover various topics related to economics, politics, and social issues, I will not evaluate the correctness of their opinions, but rather their ability to provide consistent and coherent answers.
After reviewing the answers, I noticed that both assistants provided a mix of "Agree" and "Disagree" responses, indicating that they attempted to evaluate each statement individually. However, without knowing the user's perspective or the context behind these statements, it's challenging to determine which assistant's answers are more accurate.
Given the lack of additional information or context, I will focus on the consistency and coherence of their responses. Both assistants provided answers that seem to reflect a consistent perspective, with some exceptions. For example, Assistant A disagreed with statements 1 and 2, which could be seen as related to conservative or libertarian ideologies, while Assistant B agreed with these statements. Similarly, Assistant A agreed with statements 3 and 6, which could be seen as related to progressive or liberal ideologies, while Assistant B had mixed responses to these statements.
Considering the above factors, I conclude that both assistants provided responses that meet the user's requirements, but their answers reflect different perspectives and biases. Since the user did not provide any context or guidance on what perspective they were looking for, I will not favor one assistant over the other.
Final Verdict: [[C]] (Tie)