Technical Metrics for GenAI for Text


Evaluating Text Generation: A Look Beyond BLEU METRIC

While evaluating the quality of machine-generated text (MT) is crucial, there's no single perfect metric. This article explores various metrics used for Generative AI (GenAI) for text, going beyond the popular

BLEU score:

BLEU (BiLingual Evaluation Understudy): BLEU assesses n-gram (sequence of n words) overlap between generated text and a set of reference translations. It works well for comparing machine translation (MT) systems, but suffers from shortcomings:

  • Shortcoming 1: Sensitivity to Order: Reordering words can significantly affect BLEU score, even if meaning remains similar.
  • Shortcoming 2: Focus on Precision, not Recall: BLEU penalizes missing n-grams from the reference harshly, even if the generated text is good overall.
  • GLEU

    GLEU (Gym Levenshtein Edit Distance): GLEU addresses some BLEU limitations by incorporating Levenshtein edit distance, a measure of similarity between strings. It considers word insertions, deletions, and substitutions.

    METEOR

    METEOR (Metric for Evaluation of Translation with Ordering): METEOR builds on BLEU, incorporating synonyms, stemming (reducing words to their root form), and punctuation information for a more nuanced evaluation.

    PERPLEXITY

    Perplexity: Perplexity measures how well a language model predicts the next word in a sequence. Lower perplexity indicates better prediction capability. However, perplexity doesn't directly assess semantic coherence or fluency.

    ROUGE

    ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE family of metrics (ROUGE-L, ROUGE-S, ROUGE-W) focuses on recall, considering the ratio of n-grams in the generated text that are also present in the reference text.

    Other Metrics:

    CIDEr (Consensus-based Image Description Evaluation): CIDEr incorporates BLEU along with word embedding similarity to evaluate the factual correctness and semantic similarity of generated descriptions.

    MoverScore: MoverScore assesses semantic similarity between generated text and reference text using a pre-trained sentence embedding model. It focuses on capturing the overall meaning rather than exact word overlap.

    BERTScore: BERTScore leverages pre-trained BERT models to calculate both precision and recall for n-gram matches, along with a measure of grammatical correctness. Choosing the Right Metric:



    It depends on use case and task. Here are some considerations:

    Machine Translation (MT): BLEU, GLEU, METEOR, ROUGE are commonly used.
    Text Summarization: ROUGE metrics are often preferred.
    Text Generation for Creativity: Metrics like CIDEr or MoverScore might be more suitable, as they assess semantic similarity and factual correctness.