⚠️ TODO: complete and explain
Response (End-to-end) evaluation metrics
Hallucinations (binary)
Correctness (binary)
Relevance (binary)
Retrieval evaluation metrics
Precision@K
Recall@K
MRR
NDCG
Base model NLP evaluation metrics
Perplexity
ROUGE score
Used to evaluate Summarization, Translation
BLUE score
Used to evaluate Translation
METEOR score
Used to evaluate Translation
CIDEr score
Used to evaluate Captioning