December 26, 2011 at 8:40pm
Segmentation and Evaluation
This is just a short post as it’s too long to put on Twitter. Today I tried segmenting NTCIR-7 English–Japanese MT data by various methods and seeing if it affected their BLEU and RIBES scores.
Using BLEU on the character level was tried in BLEU in characters (Denoul 2005), in which they showed that for English, BLEU on the character level correlates with word-level BLEU for English. However Japanese is a very different language, so I’m not sure the result is that applicable. I was curious how naïve segmentation of Japanese would compare against JUMAN, as well as how removing particles or limiting the output to only the kanji would affect correlation.
BLEU RIBES
Fluency Adeq Fluency Adeq
juman 0.384 0.296 0.537 0.618
juman_noparticles 0.347 0.257 0.527 0.609
juman_nopunc 0.363 0.273 0.507 0.585
kanakanji 0.409 0.330 0.544 0.632
kytea 0.383 0.325 0.540 0.625
only_kanji 0.309 0.202 0.507 0.555
only_kanji_1gram 0.309 0.203 0.505 0.563
1gram 0.351 0.274 0.527 0.635
2gram 0.203 0.130 0.363 0.415
3gram 0.206 0.148 0.275 0.243
4gram 0.172 0.094 0.130 0.090
5gram 0.170 0.140 0.141 0.127
The weird thing I noticed was that 1-grams have almost as good a correlation with human evaluators as Japanese segmented using JUMAN.
For BLEU my Kana/Kanji naïve segmentation method got the highest correlation. For this I segment letters into kanji/kana/punctuation groups, as in the example below:
- Ref: その 結果 , 第 2 固形感光性樹脂膜 120 を 除去 する 工程 を 簡略化 できるので , 液状感光性樹脂膜 118 を 使用 して バンプ 電極 122 構造 を 形成 しても , 半導体装置 の 製造 コスト の 上昇 を 抑制 できる 。
- Candidate: これにより 、 液状感光性樹脂膜 の 構造 を 用 いることにより 、 バンプ 電極 124 が 形成 されているにもかかわらず 、118、 第 2 の 固体 の 除去工程 を 簡略化 することができる 感光性樹脂膜 120、 したがって 、 半導体装置 の 製造 コスト の 上昇 も 抑 えることができる 。
In this case I assume that the short particles, numbers and punctuation are matching up. There’s no way huge chunks like “されているにもかかわらず” will match up. This method even breaks up verbs incorrectly as in “抑 える”.
My problem is that I’m trying to work out whether this even means anything. In the case of RIBES, a high score means that characters in the reference and candidate translations are in roughly the same positions. However as the “words” are now single characters, maybe the chance that there are hiragana in similar positions is high enough to give false high scores? But in that case wouldn’t the correlation be fairly poor?
I tried varying the max length of n-grams that BLEU considered, but it didn’t have much effect on the correlation.
BLEU having higher fluency correlation, and RIBES higher adequacy correlation supports what I’ve read recently about their relative strengths.
I need to think about this more.
July 2, 2011 at 2:54pm
More on Evaluation
I’ve been on a bit of an evaluation metrics binge recently. My reasoning is that their use is so ubiquitous among MT research and they often seem to be used as the final word on whether a new approach is deemed ‘successful’ or not. So I wanted to understand how they work and whether they are trustworthy.
As with my previous post, this doesn’t make any new points, it’s more of a summary of miscellaneous thoughts.
The Basics
Probably the most salient point that I came across when explaining why evaluation is so hard is that there are a huge number (possibly infinite) of valid translations for any given sentence. Translators can use synonyms, different turns of phrase, more or less ‘poetic license’. However that’s not to say that automatic evaluation is impossible.
Evaluation systems generally use multiple (the magic number is 4) human-produced “gold standard” translations and compare MT output against them. I won’t go into the details of how each metric works, for more information check out the Wikipedia pages on the three metrics you most commonly see: BLEU, METEOR and NIST.
Evaluating Evaluation
To state the obvious somewhat, the point of evaluation metrics is to give a score to translations. But the quality of an evaluation metric in itself is how well its scores correlate with human judgement. Prescriptive linguists may point to grammatical correctness as the gold standard, but the final target of MT systems are humans, not linguists :)
The correlation between metric scores and human evaluator scores is often given as Pearson’s coefficient, a value between −1.0 and 1.0 with zero being no correlation at all.
To give this value some context, human evaluation while being subjective tends to have a correlation of around 0.7. Mutton 2007 evaluate multiple translation systems and find the correlation between them and humans is 0.3~0.4 (see also Argawal 2008). Automatic evaluation still has room for improvement.
TODO: Find more detailed information on how Pearson’s coefficient is used.
Evaluation Methodologies
One problem with evaluation is that the methods used to evaluate text will naturally give better scores to text produced by systems that use similar methodology.
To put this in more concrete terms, imagine you’re trying to measure fluency of some text. The golden-hammer approach is to learn some n-grams from a large corpus and use the information to see if the produced text has likely n-grams or not. If the text was produced by something that uses n-grams, even if the training corpora are different, it will get a better score than text produced by say, a hierarchical MT system.
Mutton 2007 showed this by using various syntactic parsers to evaluate text fluency. They found that the correlation between metrics and humans changed significantly when the evaluation method differed from the generation method (e.g. hierarchical vs ngram).
Fluency and Adequacy
I mentioned this in my previous post, Translation quality can be broken down into fluency and adequacy.
All three of the most popular evaluation metrics produce scores which do not differentiate between fluency and adequacy. Indeed, the gold-standard references which are being used mix fluency and adequacy.
There are evaluation metrics being developed that can independently evaluate fluency and/or adequacy (Banchs 2011, Lo 2011, Mutton 2007) (thanks to Graham for these).
To promote fluency of output, MT systems already have a language model \(P(e)\) in the classic \(P(e)P(e|f)\). In addition, some MT systems use evaluation (particularly BLEU score) to direct learning methods. I would be interested in seeing if using fluency- and adequacy-only metrics could be used in the training process to produce better output.
On a related note, in reading about log-linear systems, it was suggested that ungrammatical sentences could be penalised by including a parameter that evaluates whether a sentence has a verb or not. I’ve yet to look into the details of a real system where this has been used. It seems that these kind of problems could be detected up with simple chunking without the need for an expensive syntactic parse, but maybe the hardest part would be promoting grammatical sentences without affecting the adequacy of the translation.
TODO: Find models that use such general syntactic information.
Final Thoughts
Evaluation models that separately score fluency and adequacy seem to better represent the way humans can evaluate sentences. I’m curious to see how treating them independently can be used to improve fluent output.
Each evaluation metric seems to have its own unique flaws and none of them are perfect. Newer improved algorithms are being created but it seems that part of the problem with their wider use seems to standardisation. BLEU, NIST and other scores are still the de-facto standard, and while other metrics exist they don’t have the wider use of the main three.
I’ve still got a lot more reading to do.