Ben Humphreys

  • Archive
  • RSS

TeX/PDF → HTML

I haven’t managed to get the idea of publishing papers in HTML out of my head. I’m convinced now that 99% of the work is in decent conversion to HTML. The presentation aspect is tricky but can be done with copious amounts of CSS and Javascript.

Back to conversion. It seems there’s two possible ways to tackle it, each with their strengths and difficulties:

  • TeX → HTML, or
  • PDF → HTML

TeX → HTML

This seemed like the most logical choice. TeX is already a form of markup language, it has headings, emphasis, references, everything that would be needed in an HTML-based paper.

However it’s not that simple. It all stems from the fact that TeX can be re-programmed. New commands can be created, macros written. As Donald Knuth said himself — “Only TeX can parse TeX”

Parsing TeX with TeX

If only TeX can parse TeX, it got me wondering about creating a package for TeX that outputs HTML as well as TeX.

For example overriding \section to also produce a <h1> tag. That way if people have wrapped them in macros they will still get called.

ConTeXt provides for something like this with the following command \setupbackend[export=yes,xhtml=yes]. It was designed for producing ePub books that are zipped XHTML files. Consequently the output looks closer to XML than HTML and is not really what I had in mind as a clean HTML5 document.

The major downside to this approach is the difficulty in using it. Users would have to have a particular install of LaTeX, change their headers, run LaTeX, then run another tool in order to convert the XHTML to HTML5.

I had been hoping for a standalone tool that worked on TeX files or PDF, and was simple to install. Something like gem install my_converter and my_converter mypaper.tex. Which brings us to the alternative to using TeX itself.

Parsing TeX with Ruby/Python/Perl

You could parse TeX with any programming language but the problem is the same. It’s possible to parse LaTeX using a grammatical parser like treetop, indeed someone has already written a basic parser, but it’s nearly impossible to parse any custom commands that people may have defined in their TeX documents.

The crucial question is how much do people use custom commands? If they’re only used in extremely long or weird TeX documents, then a tool that does not support them would still be worth having.

Having gone nuts over evaluation methods for machine translation and being a TDD believer my first reaction is to create a testing/evaluation method to see how well a simple parser would cover say, 100 random documents from random fields on arXiv.org. That in itself is another week or two’s solid work.

Other Existing Tools

There are some existing tools that I mentioned in my previous post, but they all seem to be flawed in some fundamental way. The two most promising tools I’ve seen are Pandoc and tex4ht.

Pandoc can produce HTML and LaTeX from Markdown, which is an interesting idea that I might cover another time. But its TeX to HTML conversion is rather simplistic and messy. The HTML output is odd and it does not support very much TeX syntax.

tex4ht on the other hand, is generally praised as the most comprehensive tool for the job. I’ve not got it working yet so I can’t say. I’m hoping it doesn’t do what a lot of TeX to HTML converters do and try to reproduce the page format exactly.

PDF → HTML

Assuming that the objective is to make as many papers available in HTML as possible, the ideal solution would convert existing PDFs to HTML.

Search engines are already scraping PDF documents for indexing, and can reproduce them in HTML with absolute positioning and faking the right paper size.

However thet trick would be not to recreate the PDF formatting, but to parse the document and create a semantically marked-up HTML equivalent. That is, using header tags, <table>s, to reproduce the content in a device-independent way.

Features

Assuming you can get from PDF to some kind of plain text or absolutely-positioned HTML, there are a bunch of features that you could use to guess what kind of information is in the text.

  • Order — Title, authors, abstract, content
  • Content
    • Keywords like “abstract” start a section
    • Headings likely to have heading numbers — 2.1 etc
    • Capitalisation
    • Punctuation — Headings less likely to have periods/full stops
  • Typography
    • Font size — work out list of font sizes, one with most text is body, then others are various heading sizes Larger means heading
    • Font weight — Bold indicates header. Or within text indicates emphasis
  • Sentence length — Full sentences (8+ words) with full stops indicates body text. Others will be headings, tables
    • In Really bad convertors/documents, each line is a seperate div. Hard to tell what is body text and what is not

Difficulties

However the most important parts of papers are also the most difficult to retrieve.

  • Tables — In TeX these are easy to convert, in PDF they become a mess of numbers
  • Bullet-point lists — Often the bullet-point character becomes mixed with the text
  • Headers/footers — Need to be identified and ignored
  • Mathematics — Often impossible to recover

Without supporting the above points, any conversion tool would be useless.

What next?

Despite all the difficulties mentioned above, I still think that a tool would be worth having, but the main problem for me is time. I need to focus on my research and cannot dedicate enough time to producing something of worth. I might play with Treetop this weekend and see if I can make something interesting.

    • #pdf
    • #research
    • #programming
  • 5 months ago
  • 15
  • Comments
  • Permalink
  • Share
    Tweet

15 Notes/ Hide

  1. gomode832 liked this
  2. benhumphreys posted this

Recent comments

Blog comments powered by Disqus
← Previous • Next →

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr