Fast Ruby Parsers
This thread mentions a bunch of Ruby parsers, with the focus on being fast for small languages. This is more for my own future reference than anything, it’s a great summary.
Parsing Wiki Text (Part 2)
More on parsing Wikipedia markup. So you’ve abandoned the idea of using regular expressions to parse the markup, congratulations. This should save yourself from insanity and oblivion.
mwlib is a Python library for parsing Wiki markup. It seems semi-official. Whatever, it’s what we’re going to use.
It can parse the zipped XML dumps as I’ve shown in previous tutorials.
However you have to provide it with an ariticle name, and today we’re feeling more John Wayne.
How can we feed mwlib raw Wiki markup and get the same hierarchical deliciousness as output? Here’s how:
A more involved example:
Dealing with Unicode CSV files in Python
Today I found out that Python’s csv module cannot read UTF-8 files (see the warning at the top of the page).
There are two Stackoverflow questions on the topic (here and here) but the solutions don’t seem to work for Python 2.6.
There code examples on the official page borrow from StackOverflow (or the other way around), and don’t seem to help much.
The more I learn about Python the more I appreciate Ruby.
Parsing Edict XML with Perl and XML::LibXML
Edict is a Japanese-English dictionary that is free to use for research (as far as I know). It’s available in a few formats, the most useful of which is XML dump of English-only data.
It might help someone sometime, so I’ve posted a short Perl snippet of how to parse the format.
A single entry looks like:
Print Wiki Parse Tree with `mwlib`
It’s hard to see how mwlib treats Wiki markup. You can use this small snippet to output the tree of an article and see what you’re dealing with. It’s recursive so if you have a Huge document it’s possible you’ll get some problems, although I haven’t hit any yet.
Output is something like:
Section
Node
Text
Node
Paragraph
Text
Paragraph
Style
Node
Node
ArticleLink
Text
ArticleLink
Text
...
There’s probably a function within mwlib for doing this, but I couldn’t find it.
Python Regex Global Replace
Today’s fun Python gotcha: I couldn’t find out how to do a global regular expression replace. In Perl it would look like:
$my_str =~ s/foo/bar/g;
In Python it’s:
import re
re.sub('foo', 'bar', my_text)
It turns out I wasn’t reading the sub() documentation properly. The amazing @puresock saved the day and pointed out this from the documentation:
The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so
sub('x*', '-', 'abc')returns'-a-b-c-'.
tl;dr Python’s regex-replace function sub() defaults to global. So you don’t need a global flag.
Parsing Wiki Text (Part 1)
I’ll be dealing with Wiki text a lot in the coming months, so I thought I’d post what I learn, as I learn it.
This is just a quick introduction on what I did today pieced together from mailing list posts and various other posts online.
Wikimedia Markup
First some points on the markup used on Wikipedia. There seems to be a few ways to call this syntax, wikitext, wikimedia markup, wiki something. By some weird definitions it’s not a real grammar, merely something that can be translated into HTML easily.
From what I’ve read online, parsing it naïvely using regular expressions is not a great idea. There are too many edge cases, and weird ways in which things can be written for it to be viable. However I have a feeling that using regular expressions only would be a lot faster than the grammar-like parsers that are available. Maybe I’ll do a short benchmark another time.
For now, rather than roll my own, I decided to use the Python library mwlib
to parse the Wiki markup.
Install
Let’s install mwlib then. Unfortunately the documentation for mwlib is somewhere between non-existent and useless.
If you’re on a shared setup, I found the easiest way to install is using
virtualenv.py:
$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py wiki
$ . wiki/bin/activate
(wiki)$ pip install mwlib
Setup
You then need to download the dump(s) of the data you want to deal with. Find it from the full list of dumps available. It’s likely you’ll just want the Articles, but refer to the documentation for more details of what’s on offer. We want the articles-only dump of Simple English:
wget http://dumps.wikimedia.org/simplewiki/20120112/simplewiki-20120112-pages-articles.xml.bz2
mwlib uses a db format to speed up access. Set up that DB with:
mw-buildcdb --input=some-dump.xml.bz2 --output=some_output_dir
This will take a couple of minutes, depending on the size of your dump. Coffee Time
Once this is done, your some_output_dir should contain 3 files that we’ll
access from mwlib later on.
wikiidx.cdb- the articles index file.wikidata.bin- the wikitext for the articles.wikiconf.txt- config file for mwlib
Parsing
We’re going to parse Simple English wiki’s page on Cheese mainly because I’m hungry. Check out the raw Wiki syntax on the edit page
Run this with Python, and you should get some wonderful output.
...
Inline link: Super Adams Smelly Cheese
Inline link: Emmental cheese
Inline link: String Cheese
Inline link: Marble cheese
Inline link: Wikimedia Commons
Inline link: media
Linked Japanese article: ja:チーズ
Next
I’ll post more as I get further. For now, you should have a good craving for cheese.
Just found out my girlfriend posted this to Programmer Ryan Gosling. So happy :D
Even mentions all-important tea. happy sigh
Source: programmerryangosling
Installing Python Junk without Hassle
Note to self, how to install Python stuff with minimum fuss. Looks like virtualenv.py is sort of like RVM for Python.
$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py my_new_env
$ . my_new_env/bin/activate
(my_new_env)$ pip install ...
Using MongoDB for Research - Don’t
- Parse data using your programming language of choice. Wonderful.
- Insert data into MongoDB in a easy-to-understand hierarchical structure.
- Write other scripts to compare, process and analyse the data. Joy.
- Add more data to the database.
- See BSONElement exception. Curse, search the internet for why.
- Give up, run mysterious
db.repairDatabase(). Hope data is OK. Fail. Reload data. - Run some tools, add more data to the DB.
- 15 minutes later, GOTO 5
Here’s to you, Invalid BSONObj size: -286331154