Products
Developers
About

LLM-Judge for Multilingual Document Question Answering

Posted by SambaNova Systems on April 2, 2025

Document Visual Question Answering (DocVQA) is a core capability of Vision Language Models (VLMs) that is essential for enterprise deployments of VLMs. It is generally more nuanced than simple classification or summarization tasks, and as such is generally measured with n-gram similarity metrics such as BLEU 10, ANLS 11, or exact-match (EM). On the namesake DocVQA dataset 1, modern VLMs such as Claude 3.5 Sonnet, Llama 3.2 90B and Qwen-2.5 72B are performing at or above human-level at 94.4 ANLS.

However, enterprises around the world process documents in more than just English. The performance of modern VLMs on this task in non-English languages is significantly less well-understood. To understand the needs of our customers, we evaluated frontier closed- and open-sourced VLMs on the Japanese version of this task with JDocQA 2, a dataset curated by researchers at Nara Institute of Science and Technology, RIKEN, and ATR in Japan.

blog-llm-judge-1
All scores collected using lmms-eval. Values shown are ANLS, multiplied by 100.

It’s apparent that many of these models are significantly worse at the Japanese version of this task than the English, despite many of them claiming strong multilingual performance. However, there is more nuance to this supposed discrepancy than meets the eye. In this blogpost, we’ll dive deeper into why these models perform so much worse in Japanese, why we should evaluate with LLM-Judge instead of ANLS, and what models worked for us as an LLM-Judge.

DocVQA vs JDocQA

Let’s take a look at an example from each dataset to understand why the difference in performance is so drastic. Take this sample from the DocVQA validation split:

Question: What is the name of the company?
Answers: [“itc limited”, “ITC Limited”]

blog-llm-judge-2

The answer to the question is in big bold letters at the top, as well as in the header of the document. (provide example answers from models). This particular example is reflective of the rest of the dataset; the dataset is meant to be an extractive VQA task, where the ground truth answers are present, verbatim, in the text (see 1 page 3, Section 3.1 Data Collection, Questions and Answers). So this task is really a test of the model’s OCR and instruction following abilities.

In contrast, let’s look at an example from the JDocQA training split:

Question: 12月24日の選挙当日に外出の予定が入っており、投票所へ行けないのですがどうしたらいいでしょうか?解答は自由に記述してください。
(I have plans to be out on the day of the election on December 24th, and I can't go to the polling station. What should I do? Please write your answer freely.)
Answer: 投票日に投票所へ行けない場合は、市内2か所の期日前投票所にて投票することができます。
(If you are unable to make it to the polling station on the election day, you can vote at one of two early voting stations in the city.)

blog-llm-judge-3

In this particular example, answering this question requires careful reading and interpretation of the entire text of the document, and there is no verbatim wording of the ground truth answer in the document text. This open-ended type of VQA problem is one of 5,827 such examples in the dataset 2, over 50% of the dataset. The other 50% consists of Yes/No, Factoid, and Numerical questions, which may actually contain the verbatim ground truth answer in the document.

So we can see that, in one sense, JDocQA is a harder dataset than DocVQA, disregarding the language barrier. But even if VLMs got every open-ended question in JDocQA wrong, shouldn’t there be another 50% of the dataset that they should be achieving higher scores in?

This discrepancy can be further explained by our choice of evaluation metric. In the next section, we’ll discuss ANLS and how it fails to capture the nuances of ground truth answers in JDocQA.

ANLS

Average Normalized Levenshtein Distance (ANLS) is the standard evaluation metric for visual question-answering (VQA) tasks proposed by Mathew et al. in the DocVQA paper 1. It is typically reported as a number between 0 and 1, where 0 represents no similarity to a ground-truth answer and 1 represents a perfect match. Intermediate values measure how close, in terms of edit distance, a proposed answer is to the ground truth.

As for why ANLS might be hiding model performance, let’s look at an example completion from the JDocQA validation set:

Question: 富士市の教育ローン(富士市勤労者教育資金貸付制度)の利率、融資限度額、貸し付け期間はどうなっていますか?
What are the interest rate, loan limit, and loan period for Fuji City's education loan (Fuji City Workers' Education Fund Loan System)?

Ground truth answer: 富士市の教育ローン(富士市勤労者教育資金貸付制度)の利率は年1.70%、融資限度額は300万円、貸付の期間は5年以内で、勤労者やその家族の、高校・大学などへの進学時や在学中の教育費用を、市が静岡県労働金庫と協調して貸し付けます。
Fuji City's education loan (Fuji City Workers' Education Fund Loan System) has an interest rate of 1.70% per year, a loan limit of 3 million yen, and a loan period of up to 5 years. The city cooperates with the Shizuoka Prefectural Labor Bank to provide loans to workers and their families for the educational expenses of entering high school, university, etc. and while they are in school.

blog-llm-judge-4

Llama 3.2 90B Instruct provides the following answer:

富士市の教育ローン(富士市勤労者教育資金貸付制度)の利率、融資限度額、貸し付け期間は以下の通りです。

*   利率:年1.70%
*   融資限度額:300万円
*   貸し付け期間:5年以内

この制度は、富士市に在住する学生や在学中の家族を対象に、教育費用の資金援助を目的とした制度です。

The interest rate, loan limit, and loan period for Fuji City's education loan (Fuji City Workers' Education Fund Loan System) are as follows.

* Interest rate: 1.70% per year
* Loan limit: 3 million yen
* Loan period: up to 5 years

This system is intended to provide financial assistance for educational expenses for students living in Fuji City and their families currently enrolled in school.

This seems like a pretty reasonable and correct answer when compared to the ground truth: it mentions all 3 points of a 1.7% interest rate, a loan limit of 3 million yen, and a loan period up to 5 years. If asked for a binary decision on whether the answer is correct, a human would surely give a 1 instead of a 0. However, this answer only gets an ANLS score of 0.3945 (39.45 if we multiply by 100 for comparison to the chart above). Since ANLS is generally thresholded by 0.5, it ends up being a score of 0!

When we look through the rest of the dataset, we notice that not only do correct answers get a low ANLS score, there are also wrong answers that obtain a high ANLS score:

Question: 各地区まちづくりセンターでの出張受付の表によると、2月18日(木)では、浮島で行われますか。
According to the on-site registration table at each district urban development center, will the event be held at Ukishima on Thursday, February 18th?
Ground truth answer: いいえ。
No

blog-llm-judge-5

Llama 3.2 90B Instruct answer: はい。
Yes.

Here, we can see the model is clearly getting a yes or no question wrong. Despite this, this answer nets an ANLS score of 0.5, which is above the correctness threshold! This despite the question being denoted a “yes-or-no” question, which should be easily compared against using automatic metrics. So what can we do? How can we approach human-level judgments of correctness? In the next section, we’ll discuss how LLM-Judge can be used to solve this issue.

Aside: ANLS explanations

Why does the second, incorrect example get a higher ANLS score? Let’s see how ANLS is defined:

blog-llm-judge-6
Equation 1 from Section 3.4 page 5 of 11

The outer summation is over N questions in a dataset, so let’s just focus on the metric s over the ground truth answer aij and a candidate answer oqi. NL is the normalized Levenshtein distance 12, or edit distance between the 2 strings. This is essentially the ratio between the minimum number of edits between aij and oqi, and the maximum length of the two strings. 

blog-llm-judge-7

For the second example, we have
blog-llm-judge-8

These answers are short enough to make it clear that the Levenshtein distance is 2. The max length is len(aij) = 4, so the ANLS for this single instance is
blog-llm-judge-9

We similar calculation for the second example, although the Levenshtein distance and max length are less obvious:

blog-llm-judge-10

We can interpret it this way: edit distance to get from the predicted answer to the ground truth, as a ratio compared to the string length, was smaller for the yes-or-no example compared to the “富士市の教育” example.

LLM-Judge as a metric

LLM-Judge was popularized in 2023 as a method to evaluate alignment with human preferences 3567. In the open-ended task of selecting human preferences amongst chat completions, Zheng et al. found that, while the method is still subject to potential bias and error depending on the LLM, LLM-Judge matched 80% of human annotations while being more nuanced and flexible than BLEU score.

In the MT-bench paper, authors use LLM-Judge for pairwise preference ranking ([A] better than [B]) or Likert score rating (rate this response on a scale of 1-10). Since then, LLM-Judge has been widely adopted for a variety of use cases, including for evaluating model summaries 4, detecting gender bias 8, and in evaluating safety refusal 9.

We propose using LLM-Judge as a proxy for the simple question: given a question and answer pair, does the provided response actually answer the question? Compared to metrics like ANLS, EM, and BLEU score, this can encompass the varied ways that a model might correctly answer the question without much n-gram overlap (e.g. 1万 vs 10000, or “page 1” vs “page one”). Conversely, false positives are also completely possible, with high ANLS/BLEU score attached to “page 3” when “page 1” is the answer. We use the following prompt template:

I am going to give you a question, the answer to the question, and model's answer to the question. You are to tell me if the model is correct. Respond [[1]] if correct and [[0]] if incorrect. Then give me an explanation of your judgement. Here is the question:

{question}

 Here is the answer to the question:

{ground_truth_answer}

 Here is the model completion:

{model_answer}

 Judgement:

Using this prompt template, we extract the [[0]] or [[1]] from the LLM-Judge response to mark a model’s response and correct or incorrect. We calculate the mean score for JDocQA across the GPT-4o, Claude 3.7, Llama 3.2 90B Instruct, and Qwen 2.5 VL 72B using either GPT-4o or Llama 3.1 405B Instruct as the LLM-Judge.

blog-llm-judge-11

We can see that  ANLS appears to significantly underestimate the performance of frontier models on the Japanese question answering task. In fact, frontier models appear to get about 45% of the questions correct, significantly higher than the 20% that ANLS would let on.

Comparing to DocVQA

So are modern VLMs really underperforming on document VQA in non-English languages? We can’t really compare the ANLS score of DocVQA to the LLM-Judge score of JDocQA fairly. So let’s calculate the LLM-Judge score of each of the above models for DocVQA as well.

blog-llm-judge-12

blog-llm-judge-13

Depending on the LLM-Judge used, DocVQA scores for English are either around the same or up to 10 points higher when reporting with LLM-Judge as the metric. When compared to the performance using the same metric on JDocQA, VLMs are generally 2x better at the simpler English task.

Conclusions

In summary: although we cannot blindly compare the immediate performance of frontier VLMs on English DocVQA and JDocQA, even after accounting for the extra complexity of the Japanese task and switching our metric, it’s clear that modern VLMs still have a ways to go when processing images in other languages. However, when we compare the performance of frontier models on even the English task as close as a year ago, the signs are encouraging: the field is improving rapidly, and we should have production ready models for more than just English documents soon. We’ll just note that choosing the right evaluation dataset and evaluation metric will go a long way in helping us get there even faster.

Aside: GPT-4o vs the field

In many of our internal training experiments to break new ground in this field, we choose to use Llama 3.1 405B Instruct on the SambaNova Cloud over GPT-4o as an LLM-Judge to evaluate new checkpoints. This is for a variety of reasons:

  • Speed: when using 405B Instruct on the SambaNova cloud, running LLM-Judge on about 800 examples of JDocQA takes about 30 minutes, when compared to 50 minutes using GPT-4o on Azure
  • Reliability: depending on the dataset, GPT-4o may refuse to answer questions due to OpenAI content restrictions, and may even fail to follow the [[1]] and [[0]] format of the LLM-Judge prompt. 405B, as an open source model, does not have any such content restrictions, and never failed to provide a correctly formatted output
  • GPT-4o tended to be overly harsh as a judge, which is reflected in its lower scores compared to 405B. See some examples below where GPT-4o marks an answer as incorrect, despite 405B (and human common sense) saying otherwise
blog-llm-judge-14

Appendix: Comparisons to Human Judgments

While it is infeasible to compare human judgments against our ANLS, GPT4Judge and 405BJudge for every model and every example, we show these comparisons for the first few examples from the JDocQA subset. While all three metrics have around the same Spearman’s correlation with the human judgments (0.41, 0.41, and 0.36 respectively), it’s worth noting that with such a small sample, the p-values are very high as well (0.24, 0.24 and 0.31 respectively).

A few important observations:

  • For 6/10 of the ratings where a human gave a 0, all other metrics reported 0 as well, suggesting high precision for wrong answers
  • For 4/10 ratings where a human gave a 1, all other metrics also reported a 0, except for the first instance where 405BJudge also reported a 1
blog-llm-judge-table-1

 

Ground Truth Answer
久留米シティプラザの会館に向けて、久留米市は平成25年度から多くの市民の方々に、ボランティアスタッフとして、例えば芸術活動をしている社会人や、まちづくりに興味がある学生など、約100人がイベントの準備や当日の運営や、展示の企画や制作に携わりました。

Since fiscal 2013, in preparation for the Kurume City Plaza hall, Kurume City has engaged many citizens as volunteer staff, about 100 people including working adults engaged in artistic activities and students interested in urban development, to help with event preparations, on-site management, and exhibition planning and production.

Llama 3.1 11B Response
久留米シティプラザの魅力に迫ると書かれた表題の資料によると、以下のような取り組みが行われていることがわかります。
...
*   **市民がスタッフとして活躍**
    *   久留米市は、門前に創てての機運を感めようと、平成25年から多くの市民の皆さんに、ボランティアスタッフとしてプレ開業などに参加してもらっています。文化・芸術活動をしている社会人や、まちづくりに興味がある学生など、これまでに約100人がイベントの準備や当日の準備、会場案\
内の他、パソコンアートロードでの展示の企画などに携わりました。
...
(response truncated for length)

According to the document titled "Approaching the charm of Kurume City Plaza," the following efforts are being made.
...
* **Citizens working as staff**
* In order to create momentum in front of the gate, Kurume City has had many citizens participate in pre-opening events as volunteer staff since fiscal 2013. About 100 people, including working adults engaged in cultural and artistic activities and students interested in urban development, have been involved in preparing for events, preparing for the day, providing venue information, and planning exhibitions at PC Art Road.
...

Human Score: 1

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: The model's completion contains several segments that overlap with the given answer, but it is not fully coherent and repeats some information. Additionally, there are errors and unfinished sentences within the completion. The model's answer also introduces elements (like "門前に創てての機運" and "パソコンアートロードでの展示") that do not match the specifics of the originally provided answer. Therefore, the model's response is not an accurate or reliable representation of the original answer.

405B Judge response
The answer is correct, so I rate [[1]].

 Explanation: The model's answer provides similar information to the correct answer, mentioning that approximately 100 people, including socialites involved in artistic activities and students interested in city development, participated as volunteer staff in event preparation, operation, and exhibition planning for the Kurume City Plaza. Although the model's answer is more detailed and includes additional information, the core content matches the correct answer.

blog-llm-judge-table-2

 

Ground Truth Answer
中学生以上200円、小学生100円です。

It's 200 yen for junior high school students and above, and 100 yen for elementary school students.

 

Llama 3.1 11B Response
300円です。

It's 300 yen.

 

Human Score: 0

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: The model's answer "300円です" is incorrect. It incorrectly combines the fares for different age groups. The correct breakdown as provided is "中学生以上200円、小学生100円です," which means 200 yen for middle school students and older, and 100 yen for elementary school students. Instead of combining them, the answer should provide the separate fares for each category as requested.

405B Judge response
The answer is incorrect, so I rate [[0]].

 Explanation: The model answers 300円, but according to the correct answer, the fare in 北野町 (Kitano town) is 200円 for middle school students and above, and 100円 for elementary school students. There is no mention of a 300円 fare.

blog-llm-judge-table-3

 

Ground Truth Answer
市内産の農林水産物やそれらの加工品を食材として扱う飲食店などを「伊勢市地産地消の店」に認定し、伊勢市産物の消費および需要の拡大を推進する取り組みを行っています。2015年2月1日現在で「伊勢市地産地消の店」認定店は64店舗あります。

Restaurants and other establishments that use agricultural, forestry, and fishery products produced in the city as ingredients are certified as "Ise City Local Production and Consumption Stores" to promote the consumption and demand for Ise City products. As of February 1, 2015, there are 64 certified "Ise City Local Production and Consumption Stores."

 

Llama 3.1 11B Response
伊勢市では、地産地消の取り組みを推進するために、以下のような取り組みを行っています。

*   地産地消の取り組みを推進するための組織を設置する
*   市内産の農林水産物やそれらの加工品を食材として扱う飲食店などを「伊勢市地産地消の店」に認定しています。
*   この取り組みにより、伊勢市産品の消費をより多くすることを目指しています。
*   伊勢市地産地消の店の認定を希望する事業所は、農林水産課へ問い合わせてください。

Ise City is making the following efforts to promote local production and consumption.

* Establish an organization to promote local production and consumption.
* Restaurants and other establishments that use agricultural, forestry, and fishery products produced in the city as ingredients are certified as "Ise City Local Production and Consumption Stores."
* Through this effort, we aim to increase the consumption of Ise City products.
* If you would like to be certified as an Ise City Local Production and Consumption Store, please contact the Agriculture, Forestry, and Fisheries Division.

 

Human Score: 1

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: While the model's completion includes some relevant details that are consistent with the provided answer, it adds extra information that was not mentioned in the given answer (e.g., the establishment of an organization to promote local consumption). Additionally, the model's answer refers to contacting the agricultural, forestry, and fisheries department, a detail absent in the original answer. Therefore, although partially correct, the completion is not entirely accurate according to the provided information.

405B Judge response
The answer is partially correct, but I rate [[0]].

 Explanation: The model's answer captures the essence of the correct answer, mentioning the "伊勢市地産地消の店" certification for stores using local products. However, it does not provide the specific details mentioned in the correct answer, such as the number of certified stores (64 as of 2015) and the exact goal of the initiative (expanding consumption and demand for Ise City products). The model's answer also includes additional information not present in the correct answer, such as the establishment of an organization and the contact information for businesses interested in certification.

blog-llm-judge-table-4

 

Ground Truth Answer
伊勢市です。

Ise City.

 

Llama 3.1 11B Response
伊勢市社会福祉協議会です。

Ise City Social Welfare Council.

 

Human Score: 0

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: The model's completion includes additional information ("社会福祉協議会") that is not present in the provided answer ("伊勢市です"). Therefore, it is not an exact match to the required answer.

405B Judge response
The answer is incorrect, so I rate [[0]].

 Explanation: The model answers 伊勢市社会福祉協議会 (Ise City Social Welfare Council), but the correct answer is 伊勢市 (Ise City).

blog-llm-judge-table-5

 

Ground Truth Answer
36.4トン

36.4 tons

 

Llama 3.1 11B Response
約42.8トンです。

About 42.8 tons.

 

Human Score: 0

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: The model's completion should provide the specific quantity of 36.4 tons as stated in the answer. The provided completion of "約42.8トンです" ("about 42.8 tons") does not match the correct quantity of 36.4 tons and is therefore incorrect.

405B Judge response
The answer is incorrect, so I rate [[0]].

 Explanation: The model answers approximately 42.8 tons, but the correct answer is 36.4 tons, indicating a discrepancy in the model's calculation or estimation.

blog-llm-judge-table-6

 

Ground Truth Answer
無意識のうちに教育の場で、男子には経済的な自立や、女子には家庭的なことを求めがちであったり、将来就きたい職業でも男女差がある状況のため、社会全体が男女共同参画の意識を持つ必要があると考えられます。

In education, we tend to unconsciously expect boys to become economically independent and girls to become domestically involved, and there are also gender differences in the careers that boys and girls want to have in the future, so I think society as a whole needs to be aware of gender equality.

 

Llama 3.1 11B Response
はい、残っています。

Yes, it still exists.

 

Human Score: 1

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: The model's answer "はい、残っています" (Yes, it still exists) is too brief and does not provide a detailed explanation to support the opinion, as required by the question which asks for a freeform, descriptive response. The given answer includes specific reasons and suggests a solution, thus meeting the requirement of the question more adequately.

405B Judge response
The answer is partially correct, but I rate [[0]].

 Explanation: The model simply answers "yes, it remains" (はい、残っています), but the question asks for a free description of the respondent's thoughts on the matter. The correct answer provides a more detailed explanation of why and how gender discrimination still exists, whereas the model's response lacks depth and insight.

blog-llm-judge-table-7

 

Ground Truth Answer
地域全体で見回りを行うこと、建物の回りに燃えやすいものを置かないよう気を付けること、可燃物や資源物を出す曜日を守ることです。

 

Llama 3.1 11B Response
くるめ市では、放火の被害を減らすために、以下のような取り組みが行われました。

*   電気器具による火災を防ぐポイント
    *   電気コードは柔らかくてきれいなものを1つつに束ねたり、良い電気コードを折り曲げるととがりして使わないでおこう。電気コードが熱を持ちやすいので、火災の原因になることがあるからです。

(response truncated for length)

The following efforts have been made in Kurume City to reduce the damage caused by arson.

* Tips for preventing fires caused by electrical appliances

* Bundle soft, clean electrical cords together, and do not bend good electrical cords into sharp points. Electric cords tend to heat up easily and can cause fires.

 

Human Score: 0

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: The model's completion does not match the provided answer. The provided answer specifically mentions measures such as conducting patrols by the entire community, taking care not to place flammable items around buildings, and following the schedule for disposing of combustible and recyclable materials. The model's completion instead discusses various points related to preventing fires caused by electrical equipment, which is unrelated to the given answer.

405B Judge response
The answer is incorrect, so I rate [[0]].

 Explanation: The model's answer does not match the correct answer. The correct answer mentions three specific measures to reduce arson damage in Kurume City: conducting patrols throughout the region, being mindful not to place flammable objects around buildings, and adhering to designated days for disposing of combustible materials and recyclables. The model's answer, on the other hand, discusses general fire prevention tips related to electrical appliances, which are not relevant to the specific question about Kurume City's measures to reduce arson damage.

blog-llm-judge-table-8

 

Ground Truth Answer
挑発的な服装をしていたため性暴力に合った、抵抗が足りなかったなどと被害者をさらに傷つけることがあり、そのため被害を周囲に話せなかったり、孤立することがあります。

Victims may be further hurt by being told that they were sexually assaulted because they were dressed provocatively, or that they did not resist enough, which may prevent them from talking about the assault to those around them or lead to isolation.

 

Llama 3.1 11B Response
性暴力被害者に対して人権侵害が起きているということがありますが、具体的にどのような偏見が起きていますか。以下にその例を挙げます。

*   **性別役割の固定観念**:男性は強くて力が強い、女性は弱くて従順でなければならないという固定観念が存在します。被害者が男性であれば、強くて力が強いはずなのに、被害を受けたのは自分の力不足のせいだと考えることがあります。被害者が女性であれば、従順でなければならないという固定観念から、被害を受けたのは自分のせいだと考えることがあります。

(response truncated for length)

It is said that human rights are being violated against victims of sexual violence, but what kind of prejudice is occurring specifically? Below are some examples.

* **Gender role stereotypes**: There is a stereotype that men should be strong and powerful, and women should be weak and submissive. If the victim is male, they may think that the assault was due to their own lack of strength, even though they should be strong and powerful. If the victim is female, they may think that the assault was their own fault, due to the stereotype that they should be submissive.

 

Human Score: 0 (The text specifically calls out the two examples in the ground truth answer)

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: The model's completion does not directly match the answer provided. While the model provides relevant examples of biases and stereotypes that affect sexual violence victims, the answer specifically mentions how victims are further hurt by accusations of provocative clothing or not resisting enough, leading to reluctance to speak out and isolation. The model's completion is more general and includes additional types of biases, which could be considered correct in a broader context, but it does not directly address the particular aspects mentioned in the answer.

405B Judge response
The answer is correct, so I rate [[1]].

 Explanation: The model's answer provides specific examples of biases that occur against victims of sexual violence, such as victim-blaming, fixed gender roles, and gender-based discrimination. These biases are similar to the ones mentioned in the correct answer, which states that victims are often blamed for their clothing or lack of resistance, leading to further harm and isolation. The model's answer provides a more detailed and comprehensive explanation of the biases, but the core idea is the same as the correct answer.

blog-llm-judge-table-9

 

Ground Truth Answer
はい

Yes

 

Llama 3.1 11B Response
はい

Yes

 

Human Score: 1

ANLS: 1

GPT-4o Judge response
[[1]]

Explanation: The model's completion matches the provided answer exactly. Both the model completion and the provided answer are "はい," which means "yes." This indicates that the model correctly identified that real estate consultations can only be done on the 6th floor of the main building of Kurume City Hall.

405B Judge response
The answer is correct, so I rate [[1]].

Explanation: The model's answer matches the given answer, both stating "はい" (hai), which means "yes" in Japanese.

blog-llm-judge-table-10

 

Ground Truth Answer
キッズ(U-6)サッカーフェスティバル2008に参加するためには、温水プール・陸上競技場・市立富士体育館で配布されている申し込み用紙に必要事項を記入して、直接またはFAXで温水プールにご連絡ください。

To participate in the Kids (U-6) Soccer Festival 2008, please fill out the application form distributed at the heated swimming pool, athletics stadium, or Fuji City Gymnasium, and contact the heated swimming pool directly or by fax.

 

Llama 3.1 11B Response
JFA・りぷす富士が行っているキッズ(U-6)サッカーフェスティバル2008に参加するには、以下のような手続きが必要になります。

1.  申込方法:参加費用を含む申込書を、指定された期間内に郵送する。
2.  申込期間:12月7日から21日まで。
3.  申込先:りぷす富士(座間富士フール)36-2131

(response truncated for length)

To participate in the Kids (U-6) Soccer Festival 2008 organized by JFA and Repus Fuji, you will need to follow the procedures below.

1. How to apply: Post the application form, including the participation fee, by post within the specified period.
2. Application period: December 7th to 21st.
3. Application: Rips Fuji (Zama Fuji Fool) 36-2131

 

Human Score: 0

ANLS: 0

GPT-4o Judge response
[[0]]

Explanation: The model completion, while detailed and providing information about the procedures, includes significantly different information and specifics compared to the given answer. The correct answer mentioned a specific method of obtaining and submitting the application form for the soccer festival (温水プールで申し込み用紙を記入して、直接またはFAXで送る). The model completion specifies different steps, such as mailing the application fee, the periods, the venue, and the conditions that were not mentioned in the given answer. This divergence in information indicates that the model completion does not accurately address the details provided in the original answer.

405B Judge response
The answer is incorrect, so I rate [[0]].

 Explanation: The model's answer provides a detailed but incorrect procedure for participating in the JFA・りぷす富士キッズ(U-6)サッカーフェスティバル2008. The correct answer states that participants need to fill out an application form available at specific locations (温水プール・陸上競技場・市立富士体育館) and submit it directly or via FAX to the 温水プール. The model's answer provides a different set of procedures, including mailing an application form with a participation fee, which is not mentioned in the correct answer.

References

1 Mathew, Minesh, Dimosthenis Karatzas, and C. V. Jawahar. "Docvqa: A dataset for vqa on document images." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2021.

2 Onami, Eri, et al. "JDocQA: Japanese document question answering dataset for generative language models." arXiv preprint arXiv:2403.19454 (2024).

3 Zheng, Lianmin, et al. "Judging llm-as-a-judge with mt-bench and chatbot arena." Advances in Neural Information Processing Systems 36 (2023): 46595-46623.

4 Gao, Mingqi, et al. "Human-like summarization evaluation with chatgpt." arXiv preprint arXiv:2304.02554 (2023).

5 Li, Junlong, et al. "Generative judge for evaluating alignment." arXiv preprint arXiv:2310.05470 (2023).

6 Zhu, Lianghui, Xinggang Wang, and Xinlong Wang. "Judgelm: Fine-tuned large language models are scalable judges." arXiv preprint arXiv:2310.17631 (2023).

7 Liu, Yuxuan, et al. "Calibrating llm-based evaluator." arXiv preprint arXiv:2309.13308 (2023).

8 Kumar, Shachi H., et al. "Decoding biases: Automated methods and llm judges for gender bias detection in language models." arXiv preprint arXiv:2408.03907 (2024).

9 Xie, Tinghao, et al. "Sorry-bench: Systematically evaluating large language model safety refusal behaviors." arXiv preprint arXiv:2406.14598 (2024).

10 Papineni, Kishore, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002.

11 Biten, Ali Furkan, et al. "Scene text visual question answering." Proceedings of the IEEE/CVF international conference on computer vision. 2019.

12 https://en.wikipedia.org/wiki/Levenshtein_distance

Topics: technology, Blog