Ben Humphreys

  • Archive
  • RSS

Fast Ruby Parsers

This thread mentions a bunch of Ruby parsers, with the focus on being fast for small languages. This is more for my own future reference than anything, it’s a great summary.

    • #ruby
    • #programming
    • #wiki
  • 3 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet

Parsing Wiki Text (Part 2)

More on parsing Wikipedia markup. So you’ve abandoned the idea of using regular expressions to parse the markup, congratulations. This should save yourself from insanity and oblivion.

mwlib is a Python library for parsing Wiki markup. It seems semi-official. Whatever, it’s what we’re going to use. It can parse the zipped XML dumps as I’ve shown in previous tutorials. However you have to provide it with an ariticle name, and today we’re feeling more John Wayne.

How can we feed mwlib raw Wiki markup and get the same hierarchical deliciousness as output? Here’s how:

A more involved example:

    • #phd
    • #programming
    • #wikipedia
    • #wiki
    • #python
  • 3 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet

Print Wiki Parse Tree with `mwlib`

It’s hard to see how mwlib treats Wiki markup. You can use this small snippet to output the tree of an article and see what you’re dealing with. It’s recursive so if you have a Huge document it’s possible you’ll get some problems, although I haven’t hit any yet.

Output is something like:

Section
  Node
    Text
  Node
    Paragraph
      Text
    Paragraph
      Style
        Node
          Node
            ArticleLink
            Text
            ArticleLink
            Text
...

There’s probably a function within mwlib for doing this, but I couldn’t find it.

    • #programming
    • #python
    • #mwlib
    • #wiki
  • 1 month ago
  • 15
  • Comments
  • Permalink
  • Share
    Tweet

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr