Ben Humphreys

  • Archive
  • RSS

Parsing Wiki Text (Part 1)

I’ll be dealing with Wiki text a lot in the coming months, so I thought I’d post what I learn, as I learn it.

This is just a quick introduction on what I did today pieced together from mailing list posts and various other posts online.

Wikimedia Markup

First some points on the markup used on Wikipedia. There seems to be a few ways to call this syntax, wikitext, wikimedia markup, wiki something. By some weird definitions it’s not a real grammar, merely something that can be translated into HTML easily.

From what I’ve read online, parsing it naïvely using regular expressions is not a great idea. There are too many edge cases, and weird ways in which things can be written for it to be viable. However I have a feeling that using regular expressions only would be a lot faster than the grammar-like parsers that are available. Maybe I’ll do a short benchmark another time.

For now, rather than roll my own, I decided to use the Python library mwlib to parse the Wiki markup.

Install

Let’s install mwlib then. Unfortunately the documentation for mwlib is somewhere between non-existent and useless.

If you’re on a shared setup, I found the easiest way to install is using virtualenv.py:

$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py wiki
$ . wiki/bin/activate
(wiki)$ pip install mwlib

Setup

You then need to download the dump(s) of the data you want to deal with. Find it from the full list of dumps available. It’s likely you’ll just want the Articles, but refer to the documentation for more details of what’s on offer. We want the articles-only dump of Simple English:

wget http://dumps.wikimedia.org/simplewiki/20120112/simplewiki-20120112-pages-articles.xml.bz2

mwlib uses a db format to speed up access. Set up that DB with:

mw-buildcdb --input=some-dump.xml.bz2 --output=some_output_dir

This will take a couple of minutes, depending on the size of your dump. Coffee Time Once this is done, your some_output_dir should contain 3 files that we’ll access from mwlib later on.

  • wikiidx.cdb - the articles index file.
  • wikidata.bin - the wikitext for the articles.
  • wikiconf.txt - config file for mwlib

Parsing

We’re going to parse Simple English wiki’s page on Cheese mainly because I’m hungry. Check out the raw Wiki syntax on the edit page

Run this with Python, and you should get some wonderful output.

...
Inline link: Super Adams Smelly Cheese
Inline link: Emmental cheese
Inline link: String Cheese
Inline link: Marble cheese
Inline link: Wikimedia Commons
Inline link: media
Linked Japanese article: ja:チーズ

Next

I’ll post more as I get further. For now, you should have a good craving for cheese.

    • #programming
    • #nlp
    • #wikipedia
    • #phd
    • #python
  • 4 months ago
  • 10
  • Comments
  • Permalink
  • Share
    Tweet

10 Notes/ Hide

  1. benhumphreys posted this

Recent comments

Blog comments powered by Disqus
← Previous • Next →

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr