#evaluation #multi-metric #llm
MMLU (Measuring Massive Multitask Language Understanding) is a comprehensive test designed to evaluate text models’ multitask accuracy and capabilities across a broad range of subjects. This test comprises 57 tasks that cover diverse fields such as elementary mathematics, US history, computer science, law, and many more. The goal is to assess a model’s world knowledge and problem-solving abilities in these varied domains.
Getting a high-score at MMLU requires extensive world knowledge and problem-solving abilities across a wide array of subjects.
Smaller models usually struggle to demonstrate significant competence in this challenging benchmark.