Ben Humphreys

  • Archive
  • RSS
Pop-upView Separately
  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet
Waiting for yasaka jinja. I don’t know why.
Pop-upView Separately

Waiting for yasaka jinja. I don’t know why.

  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet

Using MongoDB for Research - Don’t

  1. Parse data using your programming language of choice. Wonderful.
  2. Insert data into MongoDB in a easy-to-understand hierarchical structure.
  3. Write other scripts to compare, process and analyse the data. Joy.
  4. Add more data to the database.
  5. See BSONElement exception. Curse, search the internet for why.
  6. Give up, run mysterious db.repairDatabase(). Hope data is OK. Fail. Reload data.
  7. Run some tools, add more data to the DB.
  8. 15 minutes later, GOTO 5

Here’s to you, Invalid BSONObj size: -286331154

    • #programming
    • #mongodb
  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet

Segmentation and Evaluation

This is just a short post as it’s too long to put on Twitter. Today I tried segmenting NTCIR-7 English–Japanese MT data by various methods and seeing if it affected their BLEU and RIBES scores.

Using BLEU on the character level was tried in BLEU in characters (Denoul 2005), in which they showed that for English, BLEU on the character level correlates with word-level BLEU for English. However Japanese is a very different language, so I’m not sure the result is that applicable. I was curious how naïve segmentation of Japanese would compare against JUMAN, as well as how removing particles or limiting the output to only the kanji would affect correlation.

                    BLEU              RIBES
                    Fluency  Adeq     Fluency  Adeq
juman               0.384    0.296    0.537    0.618
juman_noparticles   0.347    0.257    0.527    0.609
juman_nopunc        0.363    0.273    0.507    0.585
kanakanji           0.409    0.330    0.544    0.632
kytea               0.383    0.325    0.540    0.625
only_kanji          0.309    0.202    0.507    0.555
only_kanji_1gram    0.309    0.203    0.505    0.563
1gram               0.351    0.274    0.527    0.635
2gram               0.203    0.130    0.363    0.415
3gram               0.206    0.148    0.275    0.243
4gram               0.172    0.094    0.130    0.090
5gram               0.170    0.140    0.141    0.127

The weird thing I noticed was that 1-grams have almost as good a correlation with human evaluators as Japanese segmented using JUMAN.

For BLEU my Kana/Kanji naïve segmentation method got the highest correlation. For this I segment letters into kanji/kana/punctuation groups, as in the example below:

  • Ref: その 結果 , 第 2 固形感光性樹脂膜 120 を 除去 する 工程 を 簡略化 できるので , 液状感光性樹脂膜 118 を 使用 して バンプ 電極 122 構造 を 形成 しても , 半導体装置 の 製造 コスト の 上昇 を 抑制 できる 。
  • Candidate: これにより 、 液状感光性樹脂膜 の 構造 を 用 いることにより 、 バンプ 電極 124 が 形成 されているにもかかわらず 、118、 第 2 の 固体 の 除去工程 を 簡略化 することができる 感光性樹脂膜 120、 したがって 、 半導体装置 の 製造 コスト の 上昇 も 抑 えることができる 。

In this case I assume that the short particles, numbers and punctuation are matching up. There’s no way huge chunks like “されているにもかかわらず” will match up. This method even breaks up verbs incorrectly as in “抑 える”.

My problem is that I’m trying to work out whether this even means anything. In the case of RIBES, a high score means that characters in the reference and candidate translations are in roughly the same positions. However as the “words” are now single characters, maybe the chance that there are hiragana in similar positions is high enough to give false high scores? But in that case wouldn’t the correlation be fairly poor?

I tried varying the max length of n-grams that BLEU considered, but it didn’t have much effect on the correlation.

BLEU having higher fluency correlation, and RIBES higher adequacy correlation supports what I’ve read recently about their relative strengths.

I need to think about this more.

    • #mt
    • #evaluation
    • #phd
  • 1 month ago
  • 9
  • Comments
  • Permalink
  • Share
    Tweet
Christmas lunch! Mmm stuffing.
Pop-upView Separately

Christmas lunch! Mmm stuffing.

  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet
Made Filipino food for lunch :D
Pop-upView Separately

Made Filipino food for lunch :D

  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet
Japanese kids book. On important business.
Pop-upView Separately

Japanese kids book. On important business.

  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet
Akōshi, Japan
Pop-upView Separately

Akōshi, Japan

  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet

TeX/PDF → HTML

I haven’t managed to get the idea of publishing papers in HTML out of my head. I’m convinced now that 99% of the work is in decent conversion to HTML. The presentation aspect is tricky but can be done with copious amounts of CSS and Javascript.

Back to conversion. It seems there’s two possible ways to tackle it, each with their strengths and difficulties:

  • TeX → HTML, or
  • PDF → HTML

TeX → HTML

This seemed like the most logical choice. TeX is already a form of markup language, it has headings, emphasis, references, everything that would be needed in an HTML-based paper.

However it’s not that simple. It all stems from the fact that TeX can be re-programmed. New commands can be created, macros written. As Donald Knuth said himself — “Only TeX can parse TeX”

Parsing TeX with TeX

If only TeX can parse TeX, it got me wondering about creating a package for TeX that outputs HTML as well as TeX.

For example overriding \section to also produce a <h1> tag. That way if people have wrapped them in macros they will still get called.

ConTeXt provides for something like this with the following command \setupbackend[export=yes,xhtml=yes]. It was designed for producing ePub books that are zipped XHTML files. Consequently the output looks closer to XML than HTML and is not really what I had in mind as a clean HTML5 document.

The major downside to this approach is the difficulty in using it. Users would have to have a particular install of LaTeX, change their headers, run LaTeX, then run another tool in order to convert the XHTML to HTML5.

I had been hoping for a standalone tool that worked on TeX files or PDF, and was simple to install. Something like gem install my_converter and my_converter mypaper.tex. Which brings us to the alternative to using TeX itself.

Parsing TeX with Ruby/Python/Perl

You could parse TeX with any programming language but the problem is the same. It’s possible to parse LaTeX using a grammatical parser like treetop, indeed someone has already written a basic parser, but it’s nearly impossible to parse any custom commands that people may have defined in their TeX documents.

The crucial question is how much do people use custom commands? If they’re only used in extremely long or weird TeX documents, then a tool that does not support them would still be worth having.

Having gone nuts over evaluation methods for machine translation and being a TDD believer my first reaction is to create a testing/evaluation method to see how well a simple parser would cover say, 100 random documents from random fields on arXiv.org. That in itself is another week or two’s solid work.

Other Existing Tools

There are some existing tools that I mentioned in my previous post, but they all seem to be flawed in some fundamental way. The two most promising tools I’ve seen are Pandoc and tex4ht.

Pandoc can produce HTML and LaTeX from Markdown, which is an interesting idea that I might cover another time. But its TeX to HTML conversion is rather simplistic and messy. The HTML output is odd and it does not support very much TeX syntax.

tex4ht on the other hand, is generally praised as the most comprehensive tool for the job. I’ve not got it working yet so I can’t say. I’m hoping it doesn’t do what a lot of TeX to HTML converters do and try to reproduce the page format exactly.

PDF → HTML

Assuming that the objective is to make as many papers available in HTML as possible, the ideal solution would convert existing PDFs to HTML.

Search engines are already scraping PDF documents for indexing, and can reproduce them in HTML with absolute positioning and faking the right paper size.

However thet trick would be not to recreate the PDF formatting, but to parse the document and create a semantically marked-up HTML equivalent. That is, using header tags, <table>s, to reproduce the content in a device-independent way.

Features

Assuming you can get from PDF to some kind of plain text or absolutely-positioned HTML, there are a bunch of features that you could use to guess what kind of information is in the text.

  • Order — Title, authors, abstract, content
  • Content
    • Keywords like “abstract” start a section
    • Headings likely to have heading numbers — 2.1 etc
    • Capitalisation
    • Punctuation — Headings less likely to have periods/full stops
  • Typography
    • Font size — work out list of font sizes, one with most text is body, then others are various heading sizes Larger means heading
    • Font weight — Bold indicates header. Or within text indicates emphasis
  • Sentence length — Full sentences (8+ words) with full stops indicates body text. Others will be headings, tables
    • In Really bad convertors/documents, each line is a seperate div. Hard to tell what is body text and what is not

Difficulties

However the most important parts of papers are also the most difficult to retrieve.

  • Tables — In TeX these are easy to convert, in PDF they become a mess of numbers
  • Bullet-point lists — Often the bullet-point character becomes mixed with the text
  • Headers/footers — Need to be identified and ignored
  • Mathematics — Often impossible to recover

Without supporting the above points, any conversion tool would be useless.

What next?

Despite all the difficulties mentioned above, I still think that a tool would be worth having, but the main problem for me is time. I need to focus on my research and cannot dedicate enough time to producing something of worth. I might play with Treetop this weekend and see if I can make something interesting.

    • #pdf
    • #research
    • #programming
  • 2 months ago
  • Comments
  • Permalink
  • Share
    Tweet

G30 at Bonn University

From the 6th to the 10th of December I made a flying visit to Bonn, Germany as part of Kyoto University’s G30 student recruitment drive. G30 is an initiative by the Japanese government to attract more foreign students to Japan, with the aim of having 300,000 foreign students by 2020. The first stages of the initiative involved recruiting more non-Japanese professors, converting certificates and documentation to English and changing course content itself into English. A number of departments within Kyoto University have completed these first three stages and are now trying to attract more foreign students to their new English-language courses.

While G30 tries to address the language barrier associated with studying in Japan, it is often criticised for not addressing the other issue with studying in Japan — the cost. G30 does not offer scholarships, but it does offer partial and full exemption from tuition fees. Living costs in Japan are high but without tuition fees on top, the cost is bearable.

Anyway, back to the trip.

Japan Fair

All day Wednesday I took part in Study Japan!, a fair put together but 10 or so Japanese universities aimed at pulling in more German students for the G30 initiative I described above. Most of the big universities were there: Waseda University, Kyoto University, Tokyo Metropolitan University etc. We each had a stand, and a bunch of documents to give out to prospective students. I was there to tell them about what student life was like at Kyoto University and in Kyoto in general. Some of the questions that I tried to answer in my notes for G30 post did come up, but most of the time I just answered general questions. Surprisingly only one person asked about the nuclear situation in Japan, that and the fact that most of the students who came were in their first year, makes me think that they were not yet serious about studying in Japan.

Most of the students that came were from the humanities department. They were taking Japanese or Asian Studies as their major, with a few people from the management or economics department. We were there to represent the Informatics Department of Kyoto University so it was unfortunate that we didn’t have the right booklets to support them. I think that science and technology majors probably did not come because they assume that Japanese would be required to study in Japan. Clearly Japan needs to work a little more on letting people know that Japanese is not a requirement.

After learning more about G30 I’m a lot less skeptical about it. In the foreign community in Japan, it’s generally seen as a “cash grab” by the Japanese government to attract more fees-paying students without thinking about the practicalities of studying in Japan. However the tuition exemption offer makes it a much more tempting prospect.

Food

I’ve had little experience of German food, beyond the snacks I had at Oktoberfest in Tokyo a few years ago. The food I had in Germany was generally delicious. On the first night we had the German equivalent of mulled wine, called glühwein and German sausages. The only thing I couldn’t quite get used to was the amount of salt in everything. It was a bit too much after a while.

Language

While there, everyone thought I was German. Being friendly, people would casually say things to me, but I would have to stop them half-way through or at the end of their long sentence and apologise that I couldn’t speak German.

I really felt that I was apologising every time I said this. Being in Germany it seemed so rude that I was not able to understand the simplest things. I’ve tried studying German before but all I could remember was danke and and auf wiedersehen. People were very nice about it and never seemed to get annoyed in the way I’ve heard Parisians do with non-French speakers. And of course everybody spoke English extremely well.

If it was French, I think I would be able to guess what the other person was asking me from the cognates that exist in English. But with German I found it impossible to guess what they were saying. I’ve heard that German is supposed to be close to English but it seems so much further than French.

The irony is that in Japan, people mostly assume that I cannot speak Japanese when I can. But in Germany they assume I can speak the language, but I can’t. Oh cruel irony.

People & Schadenfreude

It’s been many years since I’ve travelled travelled outside of Asia, but I was struck by how much I felt I was ‘on the same wavelength’ as German people I met. When something amusing or odd happened, I often met eyes with other people around, and we exchanged knowing looks that said “You’re seeing this too, and thinking the same, right?”

One perfect example of this happened on the last day as we tried to take the express ICE train to Frankfurt airport. There had been a suicide on the line at around 10am and all the trains on the line were still not moving by the time we tried to get our 11am train. Jumping in front of trains in Germany is a rare enough occurrence that they do not have a quick response to it. In Japan people jumping in front of trains is probably the most popular method of suicide and happens literally every day, and so Japanese train companies are extremely efficient at cleaning up the mess and getting trains running again. Anyway, we were sitting on the train and waiting for it to start moving. Every so often, a rather stressed-sounding German train official spoke through the train’s PA system giving us updates. Every time he gave a new update, he seemed to get more and more stressed, with his voice rising in volume and pitch. The PA system would start to crackle and cut out as he got louder and louder. Everyone on the train found this hilarious. Nobody could do anything so it seemed everyone was resigned to waiting and laughing at the ridiculousness of our situation. The train staff had tried to connect another set of cars to ours, but there had been a software malfunction, and other complications that were making this poor young train conductor more and more stressed. Later on he began asking people to get off the train as it was exceeding the legal limit for passengers it could carry. He was almost screaming “Please get off the train, there is another coming in 3 minutes, please get that one. We cannot leave until you do. Please.” Passengers were laughing at this poor guy. It was pretty funny. I guess it’s somewhat like schadenfreude.

Overall I really enjoyed my trip. But the guilt I felt at not being able to speak the language has really brought home how important it is to learn the language of the country you’re planning on visiting. And how it would be completely impossible for me to live somewhere without knowing the language.

    • #germany
    • #japan
    • #japanese
    • #g30
    • #university
    • #trips
  • 2 months ago
  • Comments
  • Permalink
  • Share
    Tweet
← Newer • Older →
Page 3 of 127

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr