Perplexity (PPL) is the measure of how predictable the text is. It is one of the most common metrics for evaluating the performance of language models and is widely used as the accuracy metric for well-defined LLM tasks such as Question-Answering.
It refers to how well the model is able to predict the next word in a sequence of words. LLMs generate text procedurally, i.e., word-by-word. LLMs selects the next probable word in a sentence from a K- -number of weighted options in the sample.
Perplexity is based on the concept of entropy, which is the amount of chaos or randomness in a system. In the context of language modeling, it’s a way to quantify the performance of the model:
- a lower perplexity score indicates that the model is more accurate. The generated sentence is more likely to sound natural to human ears,
- a higher perplexity score indicates that the model is less accurate.
Basically, the lower the perplexity, the better the model is at predicting the next sample from the distribution. This indicates better generalization and performance.
Perplexity (and loss at large) is only a weak proxy for a model's ability to generate quality text. For this reason, one usually also calculate BLUE score or ROUGE score (depending on the generative task).