Ben Humphreys

  • Archive
  • RSS

Someone disconnects completely from technology for 90 days. I’d love to be able to do this, but it’s kind of impossible.

    • #video
    • #tech
  • 6 days ago
  • 1
  • Comments
  • Permalink
  • Share
    Tweet

I haven’t been drunk in 3 years... and I’ve been partying way more than you.

(via timriley)

Source: dariusmonsef

  • 1 week ago > dariusmonsef
  • 333
  • Comments
  • Permalink
  • Share
    Tweet

Fast Ruby Parsers

This thread mentions a bunch of Ruby parsers, with the focus on being fast for small languages. This is more for my own future reference than anything, it’s a great summary.

    • #ruby
    • #programming
    • #wiki
  • 1 week ago
  • Comments
  • Permalink
  • Share
    Tweet

Parsing Wiki Text (Part 2)

More on parsing Wikipedia markup. So you’ve abandoned the idea of using regular expressions to parse the markup, congratulations. This should save yourself from insanity and oblivion.

mwlib is a Python library for parsing Wiki markup. It seems semi-official. Whatever, it’s what we’re going to use. It can parse the zipped XML dumps as I’ve shown in previous tutorials. However you have to provide it with an ariticle name, and today we’re feeling more John Wayne.

How can we feed mwlib raw Wiki markup and get the same hierarchical deliciousness as output? Here’s how:

A more involved example:

    • #phd
    • #programming
    • #wikipedia
    • #wiki
    • #python
  • 1 week ago
  • Comments
  • Permalink
  • Share
    Tweet

Dealing with Unicode CSV files in Python

Today I found out that Python’s csv module cannot read UTF-8 files (see the warning at the top of the page).

There are two Stackoverflow questions on the topic (here and here) but the solutions don’t seem to work for Python 2.6.

There code examples on the official page borrow from StackOverflow (or the other way around), and don’t seem to help much.

The more I learn about Python the more I appreciate Ruby.

    • #programming
    • #python
  • 1 week ago
  • Comments
  • Permalink
  • Share
    Tweet
    • #video
  • 2 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet

Star Wars Uncut: Director’s Cut: Wonderfully cute and funny clips by hundreds of fans, stuck together to make the entire first film (via TokyoJoe).

Hand-made outfits, stop-motion segments, entire clips rendered on 80’s Macintoshes, Star Trek uniforms. Also My Little Pony at 0:37.

Source: tokyojoe0985

    • #video
    • #geekery
  • 2 weeks ago > tokyojoe0985
  • 1
  • Comments
  • Permalink
  • Share
    Tweet

Learn Languages Through Games on Steam

If you like playing PC games, and want to learn a language at the same time, try this. Set your Steam language to the foreign language you’re trying to learn, and you can play most of your library in that language. It’s under Preferences > Interface then choose your preferred language from the first drop-down.

Sometimes Steam will need to re-download the data to get the localised content, but once that’s all done, you should be able to play without any trouble. For example, both Portal games, Civilisation, Team Fortress 2 and Skyrim all work in French. Check the languages section of Steam store page for that game for what languages are available.

Available languages are listed on the right-hand side of the game's details page

The only problem with learning from games is you might learn a lot of weird vocabulary. Just think of it as gaming with a bonus, rather than actual language practice.

    • #games
    • #language
  • 2 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet

Parsing Edict XML with Perl and XML::LibXML

Edict is a Japanese-English dictionary that is free to use for research (as far as I know). It’s available in a few formats, the most useful of which is XML dump of English-only data.

It might help someone sometime, so I’ve posted a short Perl snippet of how to parse the format.

A single entry looks like:

    • #programming
    • #phd
    • #japanese
    • #perl
  • 2 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet

Print Wiki Parse Tree with `mwlib`

It’s hard to see how mwlib treats Wiki markup. You can use this small snippet to output the tree of an article and see what you’re dealing with. It’s recursive so if you have a Huge document it’s possible you’ll get some problems, although I haven’t hit any yet.

Output is something like:

Section
  Node
    Text
  Node
    Paragraph
      Text
    Paragraph
      Style
        Node
          Node
            ArticleLink
            Text
            ArticleLink
            Text
...

There’s probably a function within mwlib for doing this, but I couldn’t find it.

    • #programming
    • #python
    • #mwlib
    • #wiki
  • 3 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet
← Newer • Older →
Page 1 of 127

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr