MMLU benchmark

#evaluation #multi-metric #llm

MMLU (Measuring Massive Multitask Language Understanding) is a comprehensive test designed to evaluate text models’ multitask accuracy and capabilities across a broad range of subjects. This test comprises 57 tasks that cover diverse fields such as elementary mathematics, US history, computer science, law, and many more. The goal is to assess a model’s world knowledge and problem-solving abilities in these varied domains.

Getting a high-score at MMLU requires extensive world knowledge and problem-solving abilities across a wide array of subjects.

Smaller models usually struggle to demonstrate significant competence in this challenging benchmark.

More details here:

https://arxiv.org/abs/2009.03300
https://github.com/hendrycks/test
https://medium.com/aimonks/benchmark-of-llms-part-2-mmlu-helm-eleuthera-ai-lm-eval-e6fc54053e3d