Dear Science — Let’s stop using PDF — Part 2
I’ve thought more about how to implement what I put forward in the first part of Dear Science — Let’s stop using PDF, and I believe the problem can be broken down into two parts:
- Generating HTML — converting LaTeX to HTML
- Presentation — presenting text and figures in a resolution-independent way
Generating HTML
This is probably the harder of the two tasks.
Researchers are concerned with their research, and want to write and publish papers with the least amount of hassle. There are very few researchers who would go out of their way to publish their papers in HTML. This means that the tools for publishing their data must be ridiculously easy to use.
Requirements
- Generate perfect HTML, with no need to fiddle with broken tags afterwards.
- Generate straight from LaTeX & bib files.
- Ideally, the tool would work straight from PDF, but that seems impossible.
- Should be runnable online, to show users what output the tool can provide.
- Should also have a command-line tool that is runnable on pretty much anything — so no cutting-edge Ruby or Python requirements. Something Mac, Linux, Windows compatible.
For the LaTeX parsing itself, there are a separate set of requirements:
- Must convert LaTeX tabular elements to HTML <table> content.
- Must deal with internal references intelligently. How they are processed and output depends somewhat on the way they will be presented by the HTML (inline, with links, or something else).
- Preserve both inline and block mathematics to allow MathJAX to parse them correctly.
Existing Tools
People have tried to tackle this problem before, often trying to support the Whole of LaTeX markup and render different environments correctly.
I had hoped it would be possible to support the bare minimum of LaTeX, but given the number of ways in which LaTeX can be extended and the variety of TeX documents researchers produce, this seems difficult.
That being said, there are no existing tools that do the job to an adequately. Here are some existing tools (via Dropbear:
- Hyperlatex - Emacs macros, unextendable, useless
- TTH - Too much focus on equations that MathJAX will do for us.
- Heava - Written in Objective Caml. Citation stuff looks pretty good though. Probably best out of these.
- LaTeX2HTML - Died 2001
- LaTeXML - Quite well made. Especially the tabular stuff
Pandoc deserves a special mention. It aims to convert between LaTeX, HTML, Markdown and a host of other formats and does a fairly good job of doing so. However its LaTeX parsing and HTML conversion is pretty poor:
- Dies on
\begintags for unknown reasons - Sort of works with
bibfiles, but not with bib entries that are at the end of thetexfile. In this case it silently kills\citetags.
I’d like to think it’s possible to extend Pandoc but I don’t know if it’s worth it. It’s written in Haskell.
Presentation
This is wholly different to the generation problem discussed above. It’s a typography and data presentation task.
Requirements
In a sentence, the main requirement is for the article to look great on any screen resolution — desktop, tablet, phone. The resolution flexibility is the hardest part.
- Lines of text cannot be too long (8~10 words). This is not an issue on tablet/phone, but on desktop there is the issue of whether to use multiple columns or have large amounts of whitespace.
- Tables and figures — Must look good at any resolution.
- References — I have thought about this and some JS-enhanced click-to-see-details functionality might be a good idea.
- Must be viewable with CSS and Javascript disabled. This means no putting information in HTML5
data-attributes.
Technical Details
I have thought about this a little and I propose the final output should be a single HTML file, that links to a single CSS file and JS file that are stored on somewhere like GitHub.
This lets us make improvements without researchers having to update their files. It also allows users to download copies of the CSS/JS if they want to, or to change the appearance of their paper by simply linking to other files.
As stated in the requirements, it’s vital that the HTML file make sense and be easy to read without CSS or JS. This is for accessibility (screen-readers), web scraping and .
I admit I haven’t really kept up with the features of HTML5, but it seems there are semantic tags that will make the data make more sense. For example figures and their captions can be marked up with <figure> and <figcaption>, the paper title and abstract make sense inside the new <header> tag.
Future
This project is huge. Writing a competent LaTeX parser alone could take a while. And HTML formatting that works with the variety of figure sizes is a problem.
For now I have a sample paper that I’ve converted into HTML and I’m playing with layout at various resolutions. When it’s more presentable I’ll publish it here.
In the meantime, I’d like to invite other people’s input on the idea:
- Is it worth it?
- Would you be willing to use a tool and publish your articles in HTML?
- Would you be interested in helping in produce this?
- Contributing to the parser?
- Getting involved in the HTML/CSS/JS?