Ben Humphreys

  • Archive
  • RSS

Python Regex Global Replace

Today’s fun Python gotcha: I couldn’t find out how to do a global regular expression replace. In Perl it would look like:

$my_str =~ s/foo/bar/g;

In Python it’s:

import re
re.sub('foo', 'bar', my_text)

It turns out I wasn’t reading the sub() documentation properly. The amazing @puresock saved the day and pointed out this from the documentation:

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.

tl;dr Python’s regex-replace function sub() defaults to global. So you don’t need a global flag.

    • #programming
    • #python
  • 3 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet
Kamogawa river, Kyoto
Pop-upView Separately

Kamogawa river, Kyoto

  • 3 weeks ago
  • 1
  • Comments
  • Permalink
  • Share
    Tweet

Parsing Wiki Text (Part 1)

I’ll be dealing with Wiki text a lot in the coming months, so I thought I’d post what I learn, as I learn it.

This is just a quick introduction on what I did today pieced together from mailing list posts and various other posts online.

Wikimedia Markup

First some points on the markup used on Wikipedia. There seems to be a few ways to call this syntax, wikitext, wikimedia markup, wiki something. By some weird definitions it’s not a real grammar, merely something that can be translated into HTML easily.

From what I’ve read online, parsing it naïvely using regular expressions is not a great idea. There are too many edge cases, and weird ways in which things can be written for it to be viable. However I have a feeling that using regular expressions only would be a lot faster than the grammar-like parsers that are available. Maybe I’ll do a short benchmark another time.

For now, rather than roll my own, I decided to use the Python library mwlib to parse the Wiki markup.

Install

Let’s install mwlib then. Unfortunately the documentation for mwlib is somewhere between non-existent and useless.

If you’re on a shared setup, I found the easiest way to install is using virtualenv.py:

$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py wiki
$ . wiki/bin/activate
(wiki)$ pip install mwlib

Setup

You then need to download the dump(s) of the data you want to deal with. Find it from the full list of dumps available. It’s likely you’ll just want the Articles, but refer to the documentation for more details of what’s on offer. We want the articles-only dump of Simple English:

wget http://dumps.wikimedia.org/simplewiki/20120112/simplewiki-20120112-pages-articles.xml.bz2

mwlib uses a db format to speed up access. Set up that DB with:

mw-buildcdb --input=some-dump.xml.bz2 --output=some_output_dir

This will take a couple of minutes, depending on the size of your dump. Coffee Time Once this is done, your some_output_dir should contain 3 files that we’ll access from mwlib later on.

  • wikiidx.cdb - the articles index file.
  • wikidata.bin - the wikitext for the articles.
  • wikiconf.txt - config file for mwlib

Parsing

We’re going to parse Simple English wiki’s page on Cheese mainly because I’m hungry. Check out the raw Wiki syntax on the edit page

Run this with Python, and you should get some wonderful output.

...
Inline link: Super Adams Smelly Cheese
Inline link: Emmental cheese
Inline link: String Cheese
Inline link: Marble cheese
Inline link: Wikimedia Commons
Inline link: media
Linked Japanese article: ja:チーズ

Next

I’ll post more as I get further. For now, you should have a good craving for cheese.

    • #programming
    • #nlp
    • #wikipedia
    • #phd
    • #python
  • 4 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet
Just found out my girlfriend posted this to Programmer Ryan Gosling. So happy :D

Even mentions all-important tea. happy sigh
View Separately

Just found out my girlfriend posted this to Programmer Ryan Gosling. So happy :D

Even mentions all-important tea. happy sigh

Source: programmerryangosling

    • #programming
    • #submission
  • 4 weeks ago > programmerryangosling
  • 59
  • Comments
  • Permalink
  • Share
    Tweet

Installing Python Junk without Hassle

Note to self, how to install Python stuff with minimum fuss. Looks like virtualenv.py is sort of like RVM for Python.

$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py my_new_env
$ . my_new_env/bin/activate
(my_new_env)$ pip install ...
    • #programming
    • #python
  • 4 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet
RECYCLE OR DIE
Pop-upView Separately

RECYCLE OR DIE

  • 4 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet

Japanese Equivalent of The Onion

I love The Onion. For a long time I wished there was an equivalent satrirical site in Japanese. Someone just told me about the Kyoko Shimbun, a Japanese site full of made-up amusing stories.

For example they have a story on McDonalds Japan releasing the McDonal-don, a bowl of rice (don) topped with a burger.

Another story on naturally drying baumkuhen cakes in the sun.

Discovering that pi is only 10 digits long. Calculations until now having been a bug in the program running them. The quote from the researcher at the end is great.

The stories are short enough and have a good variety of vocabulary to be a pretty good way to practice Japanese I think.

    • #japanese
    • #language
  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet

Papers Incorrectly Identifies Duplicates

For those of you who use Papers like me, a quick warning - I’ve found it often incorrectly identifies non-duplicate papers as duplicates. I’m not sure why it does this, maybe because part of the download URL is the same.

Just double-check whether it really is a duplicate. It’s happened to me so often that I no longer check and just hit “Ignore” every time. It’s safer than losing a bunch of papers I just imported.

    • #papers
    • #phd
    • #research
  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet

Learn French with Bouletcorp

Bouletcorp is a superb French comic blog, with a combination of mind-expanding topics, humour and great artwork. The French is quite challenging and idiomatic, which makes it great to study but tricky to look up. But luckily most of the comics are translated into English. So open up both versions in your browser and get reading.

The image is taken from the French and English comic called Kitchen Darwinism.

    • #français
    • #french
    • #comics
    • #language
  • 1 month ago
  • 22
  • Comments
  • Permalink
  • Share
    Tweet
2012
Pop-upView Separately

2012

  • 1 month ago
  • Comments
  • Permalink
  • Share
    Tweet
← Newer • Older →
Page 2 of 127

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr