Fast Ruby Parsers
This thread mentions a bunch of Ruby parsers, with the focus on being fast for small languages. This is more for my own future reference than anything, it’s a great summary.
Parsing Wiki Text (Part 2)
More on parsing Wikipedia markup. So you’ve abandoned the idea of using regular expressions to parse the markup, congratulations. This should save yourself from insanity and oblivion.
mwlib is a Python library for parsing Wiki markup. It seems semi-official. Whatever, it’s what we’re going to use.
It can parse the zipped XML dumps as I’ve shown in previous tutorials.
However you have to provide it with an ariticle name, and today we’re feeling more John Wayne.
How can we feed mwlib raw Wiki markup and get the same hierarchical deliciousness as output? Here’s how:
A more involved example:
Print Wiki Parse Tree with `mwlib`
It’s hard to see how mwlib treats Wiki markup. You can use this small snippet to output the tree of an article and see what you’re dealing with. It’s recursive so if you have a Huge document it’s possible you’ll get some problems, although I haven’t hit any yet.
Output is something like:
Section
Node
Text
Node
Paragraph
Text
Paragraph
Style
Node
Node
ArticleLink
Text
ArticleLink
Text
...
There’s probably a function within mwlib for doing this, but I couldn’t find it.