How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
Work on instruction-tuned Large Language Models (LLMs) has used automatic methods based on text overlap and LLM judgments as cost-effective alternatives to human evaluation. In this paper, we perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks. In eva...
Saved in:
Main Authors: | , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
16-02-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Work on instruction-tuned Large Language Models (LLMs) has used automatic
methods based on text overlap and LLM judgments as cost-effective alternatives
to human evaluation. In this paper, we perform a meta-evaluation of such
methods and assess their reliability across a broad range of tasks. In
evaluating how well automatic methods align with human evaluations, correlation
metrics are the most commonly employed method despite their inherent
limitations when dealing with ties and different scales. To address these
shortcomings, we use Pairwise Accuracy as an alternative to standard
correlation measures. We observe that while automatic evaluation methods can
approximate human ratings under specific conditions, their validity is highly
context-dependent. Specifically, the simple ROUGE-L metric correlates very well
with human ratings for short-answer English tasks but is unreliable in
free-form generation tasks and cross-lingual scenarios. The effectiveness of
the more advanced method of using GPT-4 as a judge diminishes significantly if
reference answers are not included in the prompt, which is the scenario where
this method has the potential to provide the most value compared to other
metrics. Our findings enhance the understanding of how automatic methods should
be applied and interpreted when developing and evaluating instruction-tuned
LLMs. |
---|---|
DOI: | 10.48550/arxiv.2402.10770 |