Ben Humphreys

Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

July 2, 2011 at 2:54pm
Home

More on Evaluation

I’ve been on a bit of an evaluation metrics binge recently. My reasoning is that their use is so ubiquitous among MT research and they often seem to be used as the final word on whether a new approach is deemed ‘successful’ or not. So I wanted to understand how they work and whether they are trustworthy.

As with my previous post, this doesn’t make any new points, it’s more of a summary of miscellaneous thoughts.

The Basics

Probably the most salient point that I came across when explaining why evaluation is so hard is that there are a huge number (possibly infinite) of valid translations for any given sentence. Translators can use synonyms, different turns of phrase, more or less ‘poetic license’. However that’s not to say that automatic evaluation is impossible.

Evaluation systems generally use multiple (the magic number is 4) human-produced “gold standard” translations and compare MT output against them. I won’t go into the details of how each metric works, for more information check out the Wikipedia pages on the three metrics you most commonly see: BLEU, METEOR and NIST.

Evaluating Evaluation

To state the obvious somewhat, the point of evaluation metrics is to give a score to translations. But the quality of an evaluation metric in itself is how well its scores correlate with human judgement. Prescriptive linguists may point to grammatical correctness as the gold standard, but the final target of MT systems are humans, not linguists :)

The correlation between metric scores and human evaluator scores is often given as Pearson’s coefficient, a value between −1.0 and 1.0 with zero being no correlation at all.

To give this value some context, human evaluation while being subjective tends to have a correlation of around 0.7. Mutton 2007 evaluate multiple translation systems and find the correlation between them and humans is 0.3~0.4 (see also Argawal 2008). Automatic evaluation still has room for improvement.

TODO: Find more detailed information on how Pearson’s coefficient is used.

Evaluation Methodologies

One problem with evaluation is that the methods used to evaluate text will naturally give better scores to text produced by systems that use similar methodology.

To put this in more concrete terms, imagine you’re trying to measure fluency of some text. The golden-hammer approach is to learn some n-grams from a large corpus and use the information to see if the produced text has likely n-grams or not. If the text was produced by something that uses n-grams, even if the training corpora are different, it will get a better score than text produced by say, a hierarchical MT system.

Mutton 2007 showed this by using various syntactic parsers to evaluate text fluency. They found that the correlation between metrics and humans changed significantly when the evaluation method differed from the generation method (e.g. hierarchical vs ngram).

Fluency and Adequacy

I mentioned this in my previous post, Translation quality can be broken down into fluency and adequacy.

All three of the most popular evaluation metrics produce scores which do not differentiate between fluency and adequacy. Indeed, the gold-standard references which are being used mix fluency and adequacy.

There are evaluation metrics being developed that can independently evaluate fluency and/or adequacy (Banchs 2011, Lo 2011, Mutton 2007) (thanks to Graham for these).

To promote fluency of output, MT systems already have a language model \(P(e)\) in the classic \(P(e)P(e|f)\). In addition, some MT systems use evaluation (particularly BLEU score) to direct learning methods. I would be interested in seeing if using fluency- and adequacy-only metrics could be used in the training process to produce better output.

On a related note, in reading about log-linear systems, it was suggested that ungrammatical sentences could be penalised by including a parameter that evaluates whether a sentence has a verb or not. I’ve yet to look into the details of a real system where this has been used. It seems that these kind of problems could be detected up with simple chunking without the need for an expensive syntactic parse, but maybe the hardest part would be promoting grammatical sentences without affecting the adequacy of the translation.

TODO: Find models that use such general syntactic information.

Final Thoughts

Evaluation models that separately score fluency and adequacy seem to better represent the way humans can evaluate sentences. I’m curious to see how treating them independently can be used to improve fluent output.

Each evaluation metric seems to have its own unique flaws and none of them are perfect. Newer improved algorithms are being created but it seems that part of the problem with their wider use seems to standardisation. BLEU, NIST and other scores are still the de-facto standard, and while other metrics exist they don’t have the wider use of the main three.

I’ve still got a lot more reading to do.

Notes

  1. steek79 reblogged this from benhumphreys
  2. benhumphreys posted this