Segmentation and Evaluation
This is just a short post as it’s too long to put on Twitter. Today I tried segmenting NTCIR-7 English–Japanese MT data by various methods and seeing if it affected their BLEU and RIBES scores.
Using BLEU on the character level was tried in BLEU in characters (Denoul 2005), in which they showed that for English, BLEU on the character level correlates with word-level BLEU for English. However Japanese is a very different language, so I’m not sure the result is that applicable. I was curious how naïve segmentation of Japanese would compare against JUMAN, as well as how removing particles or limiting the output to only the kanji would affect correlation.
BLEU RIBES
Fluency Adeq Fluency Adeq
juman 0.384 0.296 0.537 0.618
juman_noparticles 0.347 0.257 0.527 0.609
juman_nopunc 0.363 0.273 0.507 0.585
kanakanji 0.409 0.330 0.544 0.632
kytea 0.383 0.325 0.540 0.625
only_kanji 0.309 0.202 0.507 0.555
only_kanji_1gram 0.309 0.203 0.505 0.563
1gram 0.351 0.274 0.527 0.635
2gram 0.203 0.130 0.363 0.415
3gram 0.206 0.148 0.275 0.243
4gram 0.172 0.094 0.130 0.090
5gram 0.170 0.140 0.141 0.127
The weird thing I noticed was that 1-grams have almost as good a correlation with human evaluators as Japanese segmented using JUMAN.
For BLEU my Kana/Kanji naïve segmentation method got the highest correlation. For this I segment letters into kanji/kana/punctuation groups, as in the example below:
- Ref: その 結果 , 第 2 固形感光性樹脂膜 120 を 除去 する 工程 を 簡略化 できるので , 液状感光性樹脂膜 118 を 使用 して バンプ 電極 122 構造 を 形成 しても , 半導体装置 の 製造 コスト の 上昇 を 抑制 できる 。
- Candidate: これにより 、 液状感光性樹脂膜 の 構造 を 用 いることにより 、 バンプ 電極 124 が 形成 されているにもかかわらず 、118、 第 2 の 固体 の 除去工程 を 簡略化 することができる 感光性樹脂膜 120、 したがって 、 半導体装置 の 製造 コスト の 上昇 も 抑 えることができる 。
In this case I assume that the short particles, numbers and punctuation are matching up. There’s no way huge chunks like “されているにもかかわらず” will match up. This method even breaks up verbs incorrectly as in “抑 える”.
My problem is that I’m trying to work out whether this even means anything. In the case of RIBES, a high score means that characters in the reference and candidate translations are in roughly the same positions. However as the “words” are now single characters, maybe the chance that there are hiragana in similar positions is high enough to give false high scores? But in that case wouldn’t the correlation be fairly poor?
I tried varying the max length of n-grams that BLEU considered, but it didn’t have much effect on the correlation.
BLEU having higher fluency correlation, and RIBES higher adequacy correlation supports what I’ve read recently about their relative strengths.
I need to think about this more.
15 Notes/ Hide
-
biger934 liked this
-
benhumphreys posted this