Ben Humphreys

  • Archive
  • RSS

Parsing Wiki Text (Part 2)

More on parsing Wikipedia markup. So you’ve abandoned the idea of using regular expressions to parse the markup, congratulations. This should save yourself from insanity and oblivion.

mwlib is a Python library for parsing Wiki markup. It seems semi-official. Whatever, it’s what we’re going to use. It can parse the zipped XML dumps as I’ve shown in previous tutorials. However you have to provide it with an ariticle name, and today we’re feeling more John Wayne.

How can we feed mwlib raw Wiki markup and get the same hierarchical deliciousness as output? Here’s how:

A more involved example:

    • #phd
    • #programming
    • #wikipedia
    • #wiki
    • #python
  • 3 months ago
  • Comments
  • Permalink
  • Share
    Tweet

Recent comments

Blog comments powered by Disqus
← Previous • Next →

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr