Parsing Wiki Text (Part 2)
More on parsing Wikipedia markup. So you’ve abandoned the idea of using regular expressions to parse the markup, congratulations. This should save yourself from insanity and oblivion.
mwlib is a Python library for parsing Wiki markup. It seems semi-official. Whatever, it’s what we’re going to use.
It can parse the zipped XML dumps as I’ve shown in previous tutorials.
However you have to provide it with an ariticle name, and today we’re feeling more John Wayne.
How can we feed mwlib raw Wiki markup and get the same hierarchical deliciousness as output? Here’s how:
A more involved example:
Parsing Edict XML with Perl and XML::LibXML
Edict is a Japanese-English dictionary that is free to use for research (as far as I know). It’s available in a few formats, the most useful of which is XML dump of English-only data.
It might help someone sometime, so I’ve posted a short Perl snippet of how to parse the format.
A single entry looks like:
Parsing Wiki Text (Part 1)
I’ll be dealing with Wiki text a lot in the coming months, so I thought I’d post what I learn, as I learn it.
This is just a quick introduction on what I did today pieced together from mailing list posts and various other posts online.
Wikimedia Markup
First some points on the markup used on Wikipedia. There seems to be a few ways to call this syntax, wikitext, wikimedia markup, wiki something. By some weird definitions it’s not a real grammar, merely something that can be translated into HTML easily.
From what I’ve read online, parsing it naïvely using regular expressions is not a great idea. There are too many edge cases, and weird ways in which things can be written for it to be viable. However I have a feeling that using regular expressions only would be a lot faster than the grammar-like parsers that are available. Maybe I’ll do a short benchmark another time.
For now, rather than roll my own, I decided to use the Python library mwlib
to parse the Wiki markup.
Install
Let’s install mwlib then. Unfortunately the documentation for mwlib is somewhere between non-existent and useless.
If you’re on a shared setup, I found the easiest way to install is using
virtualenv.py:
$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py wiki
$ . wiki/bin/activate
(wiki)$ pip install mwlib
Setup
You then need to download the dump(s) of the data you want to deal with. Find it from the full list of dumps available. It’s likely you’ll just want the Articles, but refer to the documentation for more details of what’s on offer. We want the articles-only dump of Simple English:
wget http://dumps.wikimedia.org/simplewiki/20120112/simplewiki-20120112-pages-articles.xml.bz2
mwlib uses a db format to speed up access. Set up that DB with:
mw-buildcdb --input=some-dump.xml.bz2 --output=some_output_dir
This will take a couple of minutes, depending on the size of your dump. Coffee Time
Once this is done, your some_output_dir should contain 3 files that we’ll
access from mwlib later on.
wikiidx.cdb- the articles index file.wikidata.bin- the wikitext for the articles.wikiconf.txt- config file for mwlib
Parsing
We’re going to parse Simple English wiki’s page on Cheese mainly because I’m hungry. Check out the raw Wiki syntax on the edit page
Run this with Python, and you should get some wonderful output.
...
Inline link: Super Adams Smelly Cheese
Inline link: Emmental cheese
Inline link: String Cheese
Inline link: Marble cheese
Inline link: Wikimedia Commons
Inline link: media
Linked Japanese article: ja:チーズ
Next
I’ll post more as I get further. For now, you should have a good craving for cheese.
Papers Incorrectly Identifies Duplicates
For those of you who use Papers like me, a quick warning - I’ve found it often incorrectly identifies non-duplicate papers as duplicates. I’m not sure why it does this, maybe because part of the download URL is the same.
Just double-check whether it really is a duplicate. It’s happened to me so often that I no longer check and just hit “Ignore” every time. It’s safer than losing a bunch of papers I just imported.
Segmentation and Evaluation
This is just a short post as it’s too long to put on Twitter. Today I tried segmenting NTCIR-7 English–Japanese MT data by various methods and seeing if it affected their BLEU and RIBES scores.
Using BLEU on the character level was tried in BLEU in characters (Denoul 2005), in which they showed that for English, BLEU on the character level correlates with word-level BLEU for English. However Japanese is a very different language, so I’m not sure the result is that applicable. I was curious how naïve segmentation of Japanese would compare against JUMAN, as well as how removing particles or limiting the output to only the kanji would affect correlation.
BLEU RIBES
Fluency Adeq Fluency Adeq
juman 0.384 0.296 0.537 0.618
juman_noparticles 0.347 0.257 0.527 0.609
juman_nopunc 0.363 0.273 0.507 0.585
kanakanji 0.409 0.330 0.544 0.632
kytea 0.383 0.325 0.540 0.625
only_kanji 0.309 0.202 0.507 0.555
only_kanji_1gram 0.309 0.203 0.505 0.563
1gram 0.351 0.274 0.527 0.635
2gram 0.203 0.130 0.363 0.415
3gram 0.206 0.148 0.275 0.243
4gram 0.172 0.094 0.130 0.090
5gram 0.170 0.140 0.141 0.127
The weird thing I noticed was that 1-grams have almost as good a correlation with human evaluators as Japanese segmented using JUMAN.
For BLEU my Kana/Kanji naïve segmentation method got the highest correlation. For this I segment letters into kanji/kana/punctuation groups, as in the example below:
- Ref: その 結果 , 第 2 固形感光性樹脂膜 120 を 除去 する 工程 を 簡略化 できるので , 液状感光性樹脂膜 118 を 使用 して バンプ 電極 122 構造 を 形成 しても , 半導体装置 の 製造 コスト の 上昇 を 抑制 できる 。
- Candidate: これにより 、 液状感光性樹脂膜 の 構造 を 用 いることにより 、 バンプ 電極 124 が 形成 されているにもかかわらず 、118、 第 2 の 固体 の 除去工程 を 簡略化 することができる 感光性樹脂膜 120、 したがって 、 半導体装置 の 製造 コスト の 上昇 も 抑 えることができる 。
In this case I assume that the short particles, numbers and punctuation are matching up. There’s no way huge chunks like “されているにもかかわらず” will match up. This method even breaks up verbs incorrectly as in “抑 える”.
My problem is that I’m trying to work out whether this even means anything. In the case of RIBES, a high score means that characters in the reference and candidate translations are in roughly the same positions. However as the “words” are now single characters, maybe the chance that there are hiragana in similar positions is high enough to give false high scores? But in that case wouldn’t the correlation be fairly poor?
I tried varying the max length of n-grams that BLEU considered, but it didn’t have much effect on the correlation.
BLEU having higher fluency correlation, and RIBES higher adequacy correlation supports what I’ve read recently about their relative strengths.
I need to think about this more.
Dear Science — Let’s stop using PDF — Part 2
I’ve thought more about how to implement what I put forward in the first part of Dear Science — Let’s stop using PDF, and I believe the problem can be broken down into two parts:
- Generating HTML — converting LaTeX to HTML
- Presentation — presenting text and figures in a resolution-independent way
Generating HTML
This is probably the harder of the two tasks.
Researchers are concerned with their research, and want to write and publish papers with the least amount of hassle. There are very few researchers who would go out of their way to publish their papers in HTML. This means that the tools for publishing their data must be ridiculously easy to use.
Requirements
- Generate perfect HTML, with no need to fiddle with broken tags afterwards.
- Generate straight from LaTeX & bib files.
- Ideally, the tool would work straight from PDF, but that seems impossible.
- Should be runnable online, to show users what output the tool can provide.
- Should also have a command-line tool that is runnable on pretty much anything — so no cutting-edge Ruby or Python requirements. Something Mac, Linux, Windows compatible.
For the LaTeX parsing itself, there are a separate set of requirements:
- Must convert LaTeX tabular elements to HTML <table> content.
- Must deal with internal references intelligently. How they are processed and output depends somewhat on the way they will be presented by the HTML (inline, with links, or something else).
- Preserve both inline and block mathematics to allow MathJAX to parse them correctly.
Existing Tools
People have tried to tackle this problem before, often trying to support the Whole of LaTeX markup and render different environments correctly.
I had hoped it would be possible to support the bare minimum of LaTeX, but given the number of ways in which LaTeX can be extended and the variety of TeX documents researchers produce, this seems difficult.
That being said, there are no existing tools that do the job to an adequately. Here are some existing tools (via Dropbear:
- Hyperlatex - Emacs macros, unextendable, useless
- TTH - Too much focus on equations that MathJAX will do for us.
- Heava - Written in Objective Caml. Citation stuff looks pretty good though. Probably best out of these.
- LaTeX2HTML - Died 2001
- LaTeXML - Quite well made. Especially the tabular stuff
Pandoc deserves a special mention. It aims to convert between LaTeX, HTML, Markdown and a host of other formats and does a fairly good job of doing so. However its LaTeX parsing and HTML conversion is pretty poor:
- Dies on
\begintags for unknown reasons - Sort of works with
bibfiles, but not with bib entries that are at the end of thetexfile. In this case it silently kills\citetags.
I’d like to think it’s possible to extend Pandoc but I don’t know if it’s worth it. It’s written in Haskell.
Presentation
This is wholly different to the generation problem discussed above. It’s a typography and data presentation task.
Requirements
In a sentence, the main requirement is for the article to look great on any screen resolution — desktop, tablet, phone. The resolution flexibility is the hardest part.
- Lines of text cannot be too long (8~10 words). This is not an issue on tablet/phone, but on desktop there is the issue of whether to use multiple columns or have large amounts of whitespace.
- Tables and figures — Must look good at any resolution.
- References — I have thought about this and some JS-enhanced click-to-see-details functionality might be a good idea.
- Must be viewable with CSS and Javascript disabled. This means no putting information in HTML5
data-attributes.
Technical Details
I have thought about this a little and I propose the final output should be a single HTML file, that links to a single CSS file and JS file that are stored on somewhere like GitHub.
This lets us make improvements without researchers having to update their files. It also allows users to download copies of the CSS/JS if they want to, or to change the appearance of their paper by simply linking to other files.
As stated in the requirements, it’s vital that the HTML file make sense and be easy to read without CSS or JS. This is for accessibility (screen-readers), web scraping and .
I admit I haven’t really kept up with the features of HTML5, but it seems there are semantic tags that will make the data make more sense. For example figures and their captions can be marked up with <figure> and <figcaption>, the paper title and abstract make sense inside the new <header> tag.
Future
This project is huge. Writing a competent LaTeX parser alone could take a while. And HTML formatting that works with the variety of figure sizes is a problem.
For now I have a sample paper that I’ve converted into HTML and I’m playing with layout at various resolutions. When it’s more presentable I’ll publish it here.
In the meantime, I’d like to invite other people’s input on the idea:
- Is it worth it?
- Would you be willing to use a tool and publish your articles in HTML?
- Would you be interested in helping in produce this?
- Contributing to the parser?
- Getting involved in the HTML/CSS/JS?
Dear Science — Let’s stop using PDF
It’s 2011, it’s the future. The Earth is doomed. I’m making a Space Ark. For Space. There’s no room for printed material on my Space Ark. “A4” is just an abstract concept for when we used dead trees to store our information. For when we collated facts like so many dead butterflies and bound them in books to sit on shelves and gather dust.
It’s 2011 and we’re still using PDF to publish academic papers. Dear God Why. While the web is evolving, constantly finding new ways to present text, information and graphics, scientific papers are stuck in a world unchanged since the 17th century.
Viewing Documents
Ignoring the added functionality that writing for the web provides, the viewing experience alone is reason enough to bury PDF.
As newspapers began their first forays into publishing on touchscreen devices, they were ripped to shreds by most interface designers for simply providing their printed content as-is on the scrollable screen. Users had to scroll back up when changing columns or changing between stories, it was impossible to link stories to friends, reference parts of the newspaper, quickly navigate to sections of the newspaper.
All these issues are present when viewing papers on PCs or mobile devices, yet it seems to be accepted as the norm. Tools have been developed for searching, mining, linking and generally enhancing academic papers, but the method of viewing papers in digital format hasn’t changed in the last 10 years.
Linking Documents
This is not hard. Linking documents is useful. Imagine the web without links. Better still, look at what you have to do to look up a referenced paper within a digital format academic paper.
- See reference in text, e.g. (Smith et. al. 2009)
- Scroll/page to end of document
- Work out if references are in last name order, or something else
- Visually find the last name of the author (or ctrl+f Smith)
- Guess which “Smith 2009” is the right one if there’s more than one
- Highlight title of paper with cursor
- Copy to clipboard
- Alt-tab to browser
- Paste into search box
- From results, find one that looks relevant
- If PDF not available on search page, click through, find PDF on next page… etc.
Compared with the experience on the web:
- Hover over - see full paper details in popup using simple Javascript
- ???
- Click text to open referenced PDF
It would even be possible to link to specific sections or paragraphs of the papers to which you are referring. Or even specific parts of results tables. So your reader doesn’t have to search the entire paper to find which result you mean. Wouldn’t that be revolutionary.
What to do?
In short, publish papers in HTML, using Javascript and CSS. This isn’t a revolutionary idea.
The ideas I’ve put below are just off the top of my head. The web is overflowing with competing alternative ways to display information, be it text or graphical. And that’s the point, we should be embracing new ideas. As printed media is dying, we should move away from the absurd situation we’re in, where our primary medium is a printed format that has been repurposed for screen use.
Technical Considerations
Much as it pains me, printed media is still here to stay for a while, so an ideal solution would be able to produce both LaTeX and HTML. Pandoc (there are probably other equivalents) does the majority of this already, creating LaTeX and HTML from a single Markdown input.
To make HTML-based publishing a viable alternative existing PDF format, it must provide more features than PDF. And for it to be accepted, it has to be as easy as the current method if not easier.
Text
LaTeX and HTML are already roughly equivalent, with different size headings, bold formatting. Personally I prefer Markdown for formatting that’s easier to read and visually parse than HTML or LaTeX.
Formatting the raw text could be done with a combination of CSS and Javascript modification. There are JS libraries that can resize/reformat text Journals could still have their own stylesheets that they provide or host for authors, in the same way they often provide LaTeX templates.
We no longer consume information purely through A4 (or letter if you’re American) paper. PDFs are completely oblivious to this, making you scroll, zoom, strain your eyes. “Responsive” is the name given to modern websites that offer reformatted content depending on the size of the viewing screen. There are hundreds of examples.
I want to clarify that I am not talking about publishing academic content on a website (thanks @gneubig). The idea would be to provide existing paper content, and Only that content in a web-standard, flexible-viewport format.
Equations
This is easy. MathJax is superb.
Tabular Data
Again, this is relatively easy. HTML already has good support for tabular data, and there are many JS libraries that can be used to add automatic formatting or allow reordering of the data.
Charts
There are a million Javascript libraries for this, but the problem is not many of them are that great. Google charts is decent enough, but my main worry for this is learning the interface may be a barrier to users.
In my mind, good chart support is probably the hardest part of this idea. In particular, making the interface simple to use, but also easy to enhance with useful interactions.
Making charts interactive * Could allow for selecting different groups of data to visually compare on the same graph * No more information available by hovering
Interactive Figures, Runnable Code
You are no longer tied to static images, free your mind. It’s totally possible to make more complex interactive figures in JavaScript to illustrate your point. Not just graphs but scripts that require user input, and show them how your research really works in practice.
For example a JS implementation of your algorithm that let viewers change values in text boxes, seeing the effect on the output and a corresponding chart.
Videos
Yes! You can add videos to webpages. No more being restricted to static content. Outside of computer science I think this would be much more useful, but it is good to have the ability to add other media.
Possible Problems
There are some problems that I want to think about more:
- By introducing external sources, it may be harder to archive pages so they are still usable in 10 years time.
- Similarly, linking to many resources makes working offline more difficult. Ubiquitous internet access is not quite here yet.
Conclusion
After writing all this, I’m going to have to put my money where my mouth is and write some sort of template.
- Markdown
- Responsive layout
- Interactive charts
- Intelligent bibliography
MongoImport Bug?
Update: Quick hack solution - add a line with an empty JSON object at the start of the file. { }
I’ve come across some odd behaviour when using mongoimport. I’m not sure if it’s an issue with my setup, or a bug in Mongo, but here’s what I’ve found.
mongoimportversion 2.0.0mongoddb version v2.0.0, pdfile version 4.5
The data I’m trying to import is a small subset of NTCIR data. I hope it’s OK to post here in case there’s an issue in the data itself that’s causing the bug.
The first line never gets imported. I’ve tried varying the order, inserting a blank line in the header and it’s always the first line.
Even more confusingly, mongoimport claims it’s working correctly, saying it imported 3 objects:
Then checking the data in the collection returns only 2 results:
Does anyone have any ideas of what this could be?
Papers vs Mendeley
I’ve been using Papers for a few years, but I decided to try Mendeley again as Papers seems so slow in releasing new features. Here are some points I considered and which I thought was better.
- Search - Neither - Papers has a huge number of search engines, none of which return results that I can actually download. I almost always end up going to Google Scholar, and choosing the links with PDFs. Papers seems to miss these and go for ones behind paywalls. Mendeley desktop doesn’t have a search feature. Their website does, but it seems broken at the moment.
- Interface - Papers - More flexible, Mendeley is Java and has some issues being non-native code.
- Speed - Papers - Mendeley is written in Java, and seems to use their overly slow website for some stuff.
- Price - Mendeley - Free. You even have to pay seperately for Papers on the desktop and iOS.
- Backup - Both - I backup Papers using Dropbox, Mendeley seems to offer an equivalent service.
- Organisation - Papers - Has automatic folders, more organisation/tagging options.
- Metadata - Mendeley - Papers is a pain for managing metadata, Mendeley seems to be better at guessing/importing details.
- PDF Annotation - Mendeley - Papers desktop still doesn’t have the ability to add highlights/annotation to PDFs. (Update: See Mek’s comment below, Papers 2.1 due out early December should have this feature)
- Social Stuff - Mendeley - Livfe on Papers is useless, Mendeley’s alternative seems usable. There’s even a Machine Learning for NLP group.
Conclusion
I’m sticking with Papers. But if you haven’t chosen yet I’d give Mendeley a try. The annotation, search and social features seem better. The features are in the roadmap for Papers 2.1 but who knows when that will be out.
Notes on Studying at Kyoto University
I will be attending the Study Japan! Fair at Bonn University on the 7th of December, and talking to students about life at Kyoto University.
In order to prepare, I’ve written notes on things I might be asked by students. I’m publishing them in case they are useful to people.
Bear in mind this information is correct only as far as I know, please double-check by reading the relevant materials yourselves.
I will continue updating this as I find more information and prepare for the event.
If you have any questions, feel free to email me at benhumphreys@gmail.com
Department
- Graduate School of Informatics at Kyoto has a large number of research areas
Labs
- List of groups/labs
- I am focussing on language processing
- There are three laboratories that deal directly with language processing:
- Kurohashi-Kawahara lab in general uses a more grammatical, tree-based approach to language analysis
- Kawahara lab focusses more on using mathematical models
- Okuno & Ogata lab deals with speech processing and robotics
My Research
- I am a first-year doctoral student and thus I have not completely decided on my research area
- I am currently researching evaluation of Japanese machine translation
- I am trying to create an algorithm that can give a score to a Japanese sentence produced by machine translation system, by comparing it to a reference translation produced by a human
Lab Daily Life
- Each laboratory is different, but in general it’s like this:
- Weekly group meetings where some people in the group present their research progress
- There may be smaller group meetings which focus on a specific research area
- For example within Kurohashi-Kawahara lab, we have a general meeting with everyone, then split up into machine translation and information retrieval groups
- There are some meetings for introducing research papers - students take it in turns each week to present a paper that they think is interesting
- There are study groups that go through parts of textbooks and cover more difficult topics
- Some one-to-one meetings with your supervisors
Japanese Language
- The level of Japanese required varies between laboratories
- Most laboratory professors speak English well enough that communication with them will not be a problem
- Some laboratories
- Technically Japanese is not required to join laboratories, but Japanese is often used within the lab
- People may discuss some things in Japanese as after all it is their native language and easier for them
- Presentations are typically in Japanese, but use slides that are written in English
- There are usually a few non-Japanese researchers in each laboratory
Master’s
- 2 years
- Classes are required
PhD
- 3 years
- Not required to take classes
- Possible to take classes within your department
- Courses are the same as those offered to Masters students
Admission Procedures
Life in Japan
General
- People speak a little English, they are very helpful and will try to help
- Paperwork is often only available in Japanese
- Kyoto University foreign students division will help you with any paperwork you may have to fill in (creating a bank account, phone contract etc)
Safety
- Other than the two mentioned below, Japan is incredibly safe, you can walk around at any time of the night in any area and feel safe
- Bear in mind that the news outside of Japan often reports only the most extreme parts of the disaster, and while it was a horrific event, the majority of Japan is not directly affected
Nuclear
- Kyoto is 500km away from the Fukushima power plants
- In Kyoto the earthquake was less than magnitude 2, and most people did not feel it
- Kyoto is unaffected by the radiation directly
- Even Tokyo, which is much closer to the reactor is not very affected
- The major risk while living in Kyoto is from food which could have come from the affected area
- Most food is labelled with its origin, so it’s possible to avoid some things that are fom Fukushima
- However not all food is labelled, and when eating out it is not clear where the food is coming from
- I think the risk is relatively low
Earthquakes
- Earthquakes do happen in Japan, it’s a part of life
- The vast majority of the time they are so small you barely notice
- When you do notice, they are not strong enough to do any real damage
- Even the ones that are scary, and may knock things off shelves are not strong enough to hurt you
- It’s simple to take precautions to avoid getting hurt:
- Make sure you don’t put heavy objects on high shelves, brace bookshelves against the ceiling so they do not fall over
- Have emergency rations (water, tinned food) in your apartment/dorm
- You get used to it :)
Life in Kyoto
- City is relatively small when compared to Tokyo or Osaka
- Can cycle from university, dorms to the ‘city centre’ within 20 minutes, or take the city bus
- However it is still a city, large enough for everything you might need
- Osaka is less than an hour away by train, costing ~800 yen return
- Tokyo is 2.5 hours by bullet train (17,000 yen one-way), or ~6000 yen for an overnight ~7 hour bus
Life in Dormatories
- Dormatory details
- All students coming from abroad are able to apply to live in dormatories
- I’d recommend that you live in dormatories at least for the first year
- It’s a lot cheaper - 10,000 yen a month compared to +50,000 yen a month for an apartment
- You’ll make friends more easily
- It’s easier to apply for dormatories while living outside Japan
- Students have their own rooms, people have little events, the dorms are fairly clean
- There are some dorms that are relatively far away from university - a 40 minute train ride
- Other dormatories (e.g. Shugakuin) are close enough to cycle to university
Kyoto University
The Campus
- Kyoto University is split into a few different campuses
- Yoshida is the ‘main’ campus, there are also campuses in Katsura and Uji
- Your location depends on your subject
- As far as I know, informatics is primarily in Yoshida
- There are free shuttle busses between campuses, taking between 20~40 minutes depending on traffic
Student Activities
- There are a variety of clubs and circles - clubs are more serious, circles more casual
- There are two festivals in the year where clubs and circles set up stalls and try to recruit new members
Study
Japanese Lessons
- Foreign students have to spend 6 months taking Japanese lessons
- Lessons are divided roughly into 3 levels, ranging from complete beginner to someone who knows all joyo kanji.
- You are placed based on an initial placement test (but it’s possible to change later)
- There are a wide variety of subjects available - Kanji, conversation, reading, grammar composition, research presentations, listening to news
- You are free to choose your subjects
- You’re required to take between 8 and 10 subjects
- Classes takes up a couple of hours a day
- The classes have between 5 and 20 people in them (mostly around 10)
- Homework isn’t too strenuous - but you get out what you put in
- After the first 6 months, it’s possible to continue taking Japanese classes, as many as you like
General Classes
- See section on PhD and Masters in Research
Other
MEXT Scholarship
- All tuition fees are paid
- Monthly stipend of ~15,000 yen
- Application is very competitive
- Write a research plan
- Look for other documents online dealing specifically with recommendations on this