February 2012
3 posts
2 tags
Feb 7th
1 note
I haven’t been drunk in 3 years... and I’ve been... →
Feb 5th
333 notes
January 2012
17 posts
3 tags
Fast Ruby Parsers →
This thread mentions a bunch of Ruby parsers, with the focus on being fast for small languages. This is more for my own future reference than anything, it’s a great summary.
Jan 31st
5 tags
Parsing Wiki Text (Part 2)
More on parsing Wikipedia markup. So you’ve abandoned the idea of using regular expressions to parse the markup, congratulations. This should save yourself from insanity and oblivion. mwlib is a Python library for parsing Wiki markup. It seems semi-official. Whatever, it’s what we’re going to use. It can parse the zipped XML dumps as I’ve shown in previous...
Jan 31st
2 tags
Dealing with Unicode CSV files in Python
Today I found out that Python’s csv module cannot read UTF-8 files (see the warning at the top of the page). There are two Stackoverflow questions on the topic (here and here) but the solutions don’t seem to work for Python 2.6. There code examples on the official page borrow from StackOverflow (or the other way around), and don’t seem to help much. The more I learn about...
Jan 30th
1 tag
Jan 28th
2 tags
Jan 25th
1 note
2 tags
Learn Languages Through Games on Steam
If you like playing PC games, and want to learn a language at the same time, try this. Set your Steam language to the foreign language you’re trying to learn, and you can play most of your library in that language. It’s under Preferences > Interface then choose your preferred language from the first drop-down. Sometimes Steam will need to re-download the data to get the localised...
Jan 25th
4 tags
Parsing Edict XML with Perl and XML::LibXML
Edict is a Japanese-English dictionary that is free to use for research (as far as I know). It’s available in a few formats, the most useful of which is XML dump of English-only data. It might help someone sometime, so I’ve posted a short Perl snippet of how to parse the format. A single entry looks like:
Jan 24th
4 tags
Print Wiki Parse Tree with `mwlib`
It’s hard to see how mwlib treats Wiki markup. You can use this small snippet to output the tree of an article and see what you’re dealing with. It’s recursive so if you have a Huge document it’s possible you’ll get some problems, although I haven’t hit any yet. Output is something like: Section Node Text Node Paragraph Text Paragraph ...
Jan 19th
2 tags
Python Regex Global Replace
Today’s fun Python gotcha: I couldn’t find out how to do a global regular expression replace. In Perl it would look like: $my_str =~ s/foo/bar/g; In Python it’s: import re re.sub('foo', 'bar', my_text) It turns out I wasn’t reading the sub() documentation properly. The amazing @puresock saved the day and pointed out this from the documentation: The optional...
Jan 17th
Jan 16th
1 note
5 tags
Parsing Wiki Text (Part 1)
I’ll be dealing with Wiki text a lot in the coming months, so I thought I’d post what I learn, as I learn it. This is just a quick introduction on what I did today pieced together from mailing list posts and various other posts online. Wikimedia Markup First some points on the markup used on Wikipedia. There seems to be a few ways to call this syntax, wikitext, wikimedia markup,...
Jan 15th
2 tags
Jan 15th
59 notes
2 tags
Installing Python Junk without Hassle
Note to self, how to install Python stuff with minimum fuss. Looks like virtualenv.py is sort of like RVM for Python. $ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py $ python virtualenv.py my_new_env $ . my_new_env/bin/activate (my_new_env)$ pip install ...
Jan 15th
Jan 14th
2 tags
Japanese Equivalent of The Onion
I love The Onion. For a long time I wished there was an equivalent satrirical site in Japanese. Someone just told me about the Kyoko Shimbun, a Japanese site full of made-up amusing stories. For example they have a story on McDonalds Japan releasing the McDonal-don, a bowl of rice (don) topped with a burger. Another story on naturally drying baumkuhen cakes in the sun. Discovering that pi is...
Jan 10th
3 tags
Papers Incorrectly Identifies Duplicates
For those of you who use Papers like me, a quick warning - I’ve found it often incorrectly identifies non-duplicate papers as duplicates. I’m not sure why it does this, maybe because part of the download URL is the same. Just double-check whether it really is a duplicate. It’s happened to me so often that I no longer check and just hit “Ignore” every time....
Jan 10th
4 tags
Learn French with Bouletcorp
Bouletcorp is a superb French comic blog, with a combination of mind-expanding topics, humour and great artwork. The French is quite challenging and idiomatic, which makes it great to study but tricky to look up. But luckily most of the comics are translated into English. So open up both versions in your browser and get reading. The image is taken from the French and English comic called...
Jan 7th
22 notes
December 2011
19 posts
Dec 31st
Dec 31st
Dec 31st
2 tags
Using MongoDB for Research - Don't
Parse data using your programming language of choice. Wonderful. Insert data into MongoDB in a easy-to-understand hierarchical structure. Write other scripts to compare, process and analyse the data. Joy. Add more data to the database. See BSONElement exception. Curse, search the internet for why. Give up, run mysterious db.repairDatabase(). Hope data is OK. Fail. Reload data. Run some...
Dec 26th
3 tags
Segmentation and Evaluation
This is just a short post as it’s too long to put on Twitter. Today I tried segmenting NTCIR-7 English–Japanese MT data by various methods and seeing if it affected their BLEU and RIBES scores. Using BLEU on the character level was tried in BLEU in characters (Denoul 2005), in which they showed that for English, BLEU on the character level correlates with word-level BLEU for English....
Dec 25th
9 notes
Dec 24th
Dec 23rd
Dec 21st
Dec 20th
3 tags
TeX/PDF → HTML
I haven’t managed to get the idea of publishing papers in HTML out of my head. I’m convinced now that 99% of the work is in decent conversion to HTML. The presentation aspect is tricky but can be done with copious amounts of CSS and Javascript. Back to conversion. It seems there’s two possible ways to tackle it, each with their strengths and difficulties: TeX → HTML, or PDF...
Dec 13th
6 tags
G30 at Bonn University
From the 6th to the 10th of December I made a flying visit to Bonn, Germany as part of Kyoto University’s G30 student recruitment drive. G30 is an initiative by the Japanese government to attract more foreign students to Japan, with the aim of having 300,000 foreign students by 2020. The first stages of the initiative involved recruiting more non-Japanese professors,...
Dec 11th
2 tags
Mixing Kana and Kanji and MT
Writing in a mix of Kanji and Kana makes it a lot easier for machines as well as humans. Found this while messing with Google Translate and Japanese “no”. “かんこくのでんきせいひんのかかく” → “Dress belongings or writing full-bodied electric kettle” “韓国の電気製品の価格” → “South Korean electronics prices”
Dec 11th
Dec 10th
Dec 10th
Dec 10th
Dec 10th
2 tags
Dec 9th
3 notes
5 tags
Dear Science — Let’s stop using PDF — Part 2
I’ve thought more about how to implement what I put forward in the first part of Dear Science — Let’s stop using PDF, and I believe the problem can be broken down into two parts: Generating HTML — converting LaTeX to HTML Presentation — presenting text and figures in a resolution-independent way Generating HTML This is probably the harder of the two tasks. Researchers are...
Dec 4th
33 notes
4 tags
Dear Science — Let's stop using PDF
It’s 2011, it’s the future. The Earth is doomed. I’m making a Space Ark. For Space. There’s no room for printed material on my Space Ark. “A4” is just an abstract concept for when we used dead trees to store our information. For when we collated facts like so many dead butterflies and bound them in books to sit on shelves and gather dust. It’s 2011 and...
Dec 2nd
27 notes
Dec 1st
November 2011
46 posts
Nov 30th
1 tag
Remapping Semicolon to Backspace
A few weeks ago I went on a key rebinding spree, and tried to improve my efficiency by moving around keys I use most. I’ve had capslock bound to escape for nearly 2 years now and I can’t recommend it enough. There are two tools by the same guy. They’re both kind of minimal, but work well enough: PCKeyboardHack for rebinding capslock KeyRemap4MacBook for rebinding just about...
Nov 28th
1 note
Nov 28th
Nov 28th
4 tags
MongoImport Bug?
Update: Quick hack solution - add a line with an empty JSON object at the start of the file. { } I’ve come across some odd behaviour when using mongoimport. I’m not sure if it’s an issue with my setup, or a bug in Mongo, but here’s what I’ve found. mongoimport version 2.0.0 mongod db version v2.0.0, pdfile version 4.5 The data I’m trying to import is a...
Nov 28th
Nov 26th
Nov 25th
Nov 25th
Nov 23rd
Nov 22nd
Nov 21st