⚠️ TODO: complete and explain


Response (End-to-end) evaluation metrics

Hallucinations (binary)

Correctness (binary)

Relevance (binary)


Retrieval evaluation metrics

Precision@K

Recall@K

MRR

NDCG


Base model NLP evaluation metrics

Perplexity

ROUGE score

Used to evaluate Summarization, Translation

BLUE score

Used to evaluate Translation

METEOR score

Used to evaluate Translation

CIDEr score

Used to evaluate Captioning