February 2012
3 posts
2 tags
I haven’t been drunk in 3 years... and I’ve been... →
January 2012
17 posts
3 tags
Fast Ruby Parsers →
This thread mentions a bunch of Ruby parsers, with the focus on being fast for small languages. This is more for my own future reference than anything, it’s a great summary.
5 tags
Parsing Wiki Text (Part 2)
More on parsing Wikipedia markup. So you’ve abandoned the idea of using regular expressions to parse the markup, congratulations.
This should save yourself from insanity and oblivion.
mwlib is a Python library for parsing Wiki markup. It seems semi-official. Whatever, it’s what we’re going to use.
It can parse the zipped XML dumps as I’ve shown in previous...
2 tags
Dealing with Unicode CSV files in Python
Today I found out that Python’s csv module cannot read UTF-8 files (see the warning at the top of the page).
There are two Stackoverflow questions on the topic (here and here) but the solutions don’t seem to work for Python 2.6.
There code examples on the official page borrow from StackOverflow (or the other way around), and don’t seem to help much.
The more I learn about...
1 tag
2 tags
2 tags
Learn Languages Through Games on Steam
If you like playing PC games, and want to learn a language at the same time, try this. Set your Steam language to the foreign language you’re trying to learn, and you can play most of your library in that language. It’s under Preferences > Interface then choose your preferred language from the first drop-down.
Sometimes Steam will need to re-download the data to get the localised...
4 tags
Parsing Edict XML with Perl and XML::LibXML
Edict is a Japanese-English dictionary that is free to use for research (as far as I know). It’s available in a few formats, the most useful of which is XML dump of English-only data.
It might help someone sometime, so I’ve posted a short Perl snippet of how to parse the format.
A single entry looks like:
4 tags
Print Wiki Parse Tree with `mwlib`
It’s hard to see how mwlib treats Wiki markup. You can use this small snippet to output the tree of an article and see what you’re dealing with. It’s recursive so if you have a Huge document it’s possible you’ll get some problems, although I haven’t hit any yet.
Output is something like:
Section
Node
Text
Node
Paragraph
Text
Paragraph
...
2 tags
Python Regex Global Replace
Today’s fun Python gotcha: I couldn’t find out how to do a global regular expression replace. In Perl it would look like:
$my_str =~ s/foo/bar/g;
In Python it’s:
import re
re.sub('foo', 'bar', my_text)
It turns out I wasn’t reading the sub() documentation properly. The amazing @puresock saved the day and pointed out this from the documentation:
The optional...
5 tags
Parsing Wiki Text (Part 1)
I’ll be dealing with Wiki text a lot in the coming months, so I thought I’d post what I learn, as I learn it.
This is just a quick introduction on what I did today pieced together from mailing list posts
and various other posts online.
Wikimedia Markup
First some points on the markup used on Wikipedia. There seems to be a few ways to call this syntax, wikitext, wikimedia markup,...
2 tags
2 tags
Installing Python Junk without Hassle
Note to self, how to install Python stuff with minimum fuss. Looks like virtualenv.py is sort of like RVM for Python.
$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py my_new_env
$ . my_new_env/bin/activate
(my_new_env)$ pip install ...
2 tags
Japanese Equivalent of The Onion
I love The Onion. For a long time I wished there was an equivalent satrirical site in Japanese. Someone just told me about the Kyoko Shimbun, a Japanese site full of made-up amusing stories.
For example they have a story on McDonalds Japan releasing the McDonal-don, a bowl of rice (don) topped with a burger.
Another story on naturally drying baumkuhen cakes in the sun.
Discovering that pi is...
3 tags
Papers Incorrectly Identifies Duplicates
For those of you who use Papers like me, a quick warning - I’ve found it often incorrectly identifies non-duplicate papers as duplicates. I’m not sure why it does this, maybe because part of the download URL is the same.
Just double-check whether it really is a duplicate. It’s happened to me so often that I no longer check and just hit “Ignore” every time....
4 tags
Learn French with Bouletcorp
Bouletcorp is a superb French comic blog, with a combination of mind-expanding topics, humour and great artwork. The French is quite challenging and idiomatic, which makes it great to study but tricky to look up. But luckily most of the comics are translated into English. So open up both versions in your browser and get reading.
The image is taken from the French and English comic called...
December 2011
19 posts
2 tags
Using MongoDB for Research - Don't
Parse data using your programming language of choice. Wonderful.
Insert data into MongoDB in a easy-to-understand hierarchical structure.
Write other scripts to compare, process and analyse the data. Joy.
Add more data to the database.
See BSONElement exception. Curse, search the internet for why.
Give up, run mysterious db.repairDatabase(). Hope data is OK. Fail. Reload data.
Run some...
3 tags
Segmentation and Evaluation
This is just a short post as it’s too long to put on Twitter. Today I tried segmenting NTCIR-7 English–Japanese MT data by various methods and seeing if it affected their BLEU and RIBES scores.
Using BLEU on the character level was tried in BLEU in characters (Denoul 2005), in which they showed that for English, BLEU on the character level correlates with word-level BLEU for English....
3 tags
TeX/PDF → HTML
I haven’t managed to get the idea of publishing papers in HTML out of my head.
I’m convinced now that 99% of the work is in decent conversion to HTML. The
presentation aspect is tricky but can be done with copious amounts of CSS and
Javascript.
Back to conversion. It seems there’s two possible ways to tackle it, each with
their strengths and difficulties:
TeX → HTML, or
PDF...
6 tags
G30 at Bonn University
From the 6th to the 10th of December I made a flying visit to Bonn, Germany as
part of Kyoto University’s G30 student recruitment drive. G30 is an initiative
by the Japanese government to attract more foreign students to Japan, with the
aim of having 300,000 foreign students by 2020. The first stages of the
initiative involved recruiting more non-Japanese professors,...
2 tags
Mixing Kana and Kanji and MT
Writing in a mix of Kanji and Kana makes it a lot easier for machines as well as humans.
Found this while messing with Google Translate and Japanese “no”.
“かんこくのでんきせいひんのかかく” → “Dress belongings or writing full-bodied electric kettle”
“韓国の電気製品の価格” → “South Korean electronics prices”
2 tags
5 tags
Dear Science — Let’s stop using PDF — Part 2
I’ve thought more about how to implement what I put forward in the first part of Dear Science — Let’s stop using PDF, and I believe the problem can be broken down into two parts:
Generating HTML — converting LaTeX to HTML
Presentation — presenting text and figures in a resolution-independent way
Generating HTML
This is probably the harder of the two tasks.
Researchers are...
4 tags
Dear Science — Let's stop using PDF
It’s 2011, it’s the future. The Earth is doomed. I’m making a Space Ark. For Space. There’s no room for printed material on my Space Ark. “A4” is just an abstract concept for when we used dead trees to store our information. For when we collated facts like so many dead butterflies and bound them in books to sit on shelves and gather dust.
It’s 2011 and...
November 2011
46 posts
1 tag
Remapping Semicolon to Backspace
A few weeks ago I went on a key rebinding spree, and tried to improve my efficiency by moving around keys I use most. I’ve had capslock bound to escape for nearly 2 years now and I can’t recommend it enough.
There are two tools by the same guy. They’re both kind of minimal, but work well enough:
PCKeyboardHack for rebinding capslock
KeyRemap4MacBook for rebinding just about...
4 tags
MongoImport Bug?
Update: Quick hack solution - add a line with an empty JSON object at the start of the file. { }
I’ve come across some odd behaviour when using mongoimport. I’m not sure if it’s an issue with my setup, or a bug in Mongo, but here’s what I’ve found.
mongoimport version 2.0.0
mongod db version v2.0.0, pdfile version 4.5
The data I’m trying to import is a...