December 2011
19 posts
2 tags
Using MongoDB for Research - Don't
Parse data using your programming language of choice. Wonderful.
Insert data into MongoDB in a easy-to-understand hierarchical structure.
Write other scripts to compare, process and analyse the data. Joy.
Add more data to the database.
See BSONElement exception. Curse, search the internet for why.
Give up, run mysterious db.repairDatabase(). Hope data is OK. Fail. Reload data.
Run some...
3 tags
Segmentation and Evaluation
This is just a short post as it’s too long to put on Twitter. Today I tried segmenting NTCIR-7 English–Japanese MT data by various methods and seeing if it affected their BLEU and RIBES scores.
Using BLEU on the character level was tried in BLEU in characters (Denoul 2005), in which they showed that for English, BLEU on the character level correlates with word-level BLEU for English....
3 tags
TeX/PDF → HTML
I haven’t managed to get the idea of publishing papers in HTML out of my head.
I’m convinced now that 99% of the work is in decent conversion to HTML. The
presentation aspect is tricky but can be done with copious amounts of CSS and
Javascript.
Back to conversion. It seems there’s two possible ways to tackle it, each with
their strengths and difficulties:
TeX → HTML, or
PDF...
6 tags
G30 at Bonn University
From the 6th to the 10th of December I made a flying visit to Bonn, Germany as
part of Kyoto University’s G30 student recruitment drive. G30 is an initiative
by the Japanese government to attract more foreign students to Japan, with the
aim of having 300,000 foreign students by 2020. The first stages of the
initiative involved recruiting more non-Japanese professors,...
2 tags
Mixing Kana and Kanji and MT
Writing in a mix of Kanji and Kana makes it a lot easier for machines as well as humans.
Found this while messing with Google Translate and Japanese “no”.
“かんこくのでんきせいひんのかかく” → “Dress belongings or writing full-bodied electric kettle”
“韓国の電気製品の価格” → “South Korean electronics prices”
2 tags
5 tags
Dear Science — Let’s stop using PDF — Part 2
I’ve thought more about how to implement what I put forward in the first part of Dear Science — Let’s stop using PDF, and I believe the problem can be broken down into two parts:
Generating HTML — converting LaTeX to HTML
Presentation — presenting text and figures in a resolution-independent way
Generating HTML
This is probably the harder of the two tasks.
Researchers are...
4 tags
Dear Science — Let's stop using PDF
It’s 2011, it’s the future. The Earth is doomed. I’m making a Space Ark. For Space. There’s no room for printed material on my Space Ark. “A4” is just an abstract concept for when we used dead trees to store our information. For when we collated facts like so many dead butterflies and bound them in books to sit on shelves and gather dust.
It’s 2011 and...
November 2011
46 posts