Parsing Wiki Text (Part 1)
I’ll be dealing with Wiki text a lot in the coming months, so I thought I’d post what I learn, as I learn it.
This is just a quick introduction on what I did today pieced together from mailing list posts and various other posts online.
Wikimedia Markup
First some points on the markup used on Wikipedia. There seems to be a few ways to call this syntax, wikitext, wikimedia markup, wiki something. By some weird definitions it’s not a real grammar, merely something that can be translated into HTML easily.
From what I’ve read online, parsing it naïvely using regular expressions is not a great idea. There are too many edge cases, and weird ways in which things can be written for it to be viable. However I have a feeling that using regular expressions only would be a lot faster than the grammar-like parsers that are available. Maybe I’ll do a short benchmark another time.
For now, rather than roll my own, I decided to use the Python library mwlib
to parse the Wiki markup.
Install
Let’s install mwlib then. Unfortunately the documentation for mwlib is somewhere between non-existent and useless.
If you’re on a shared setup, I found the easiest way to install is using
virtualenv.py:
$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py wiki
$ . wiki/bin/activate
(wiki)$ pip install mwlib
Setup
You then need to download the dump(s) of the data you want to deal with. Find it from the full list of dumps available. It’s likely you’ll just want the Articles, but refer to the documentation for more details of what’s on offer. We want the articles-only dump of Simple English:
wget http://dumps.wikimedia.org/simplewiki/20120112/simplewiki-20120112-pages-articles.xml.bz2
mwlib uses a db format to speed up access. Set up that DB with:
mw-buildcdb --input=some-dump.xml.bz2 --output=some_output_dir
This will take a couple of minutes, depending on the size of your dump. Coffee Time
Once this is done, your some_output_dir should contain 3 files that we’ll
access from mwlib later on.
wikiidx.cdb- the articles index file.wikidata.bin- the wikitext for the articles.wikiconf.txt- config file for mwlib
Parsing
We’re going to parse Simple English wiki’s page on Cheese mainly because I’m hungry. Check out the raw Wiki syntax on the edit page
Run this with Python, and you should get some wonderful output.
...
Inline link: Super Adams Smelly Cheese
Inline link: Emmental cheese
Inline link: String Cheese
Inline link: Marble cheese
Inline link: Wikimedia Commons
Inline link: media
Linked Japanese article: ja:チーズ
Next
I’ll post more as I get further. For now, you should have a good craving for cheese.
10 Notes/ Hide
-
benhumphreys posted this