docbook…what a piece of junk!

lukepieceofjunkIn the past week, I have focused on porting the Spring Data Evans release train to asciidoctor. It has been joyous and torturous. The joy is moving away from docbook and towards asciidoctor. The torture is dealing with all of docbook’s idiosyncrasies and the wonderment at how someone could build such a tool in the first place.

First of all, XML is a highly nested data structure. The nesting is used to convey semantics, but XML was originally designed as a machine-to-machine communication medium. It was indicendtially written using ASCII so that devs could debug the communications more easily along the way. Using it to create documentation systems is ridiculous.

docbook-whitespaceIf you look at this fragment of docbook, you can see all the wasted space based on standard XML formatting. Reading and maintaining this is an eyesore and the source of carpal tunnel.

asciidoctorLooking at asciidoctor is much more pleasing. There is no cognitive overhead of angled brackets. No extra indentation. And the results are easier to visualize. The semantics are so much clearer.

But that isn’t the only thing that has been a hassle. To do this migration, I wrote a SAX parser that properly pushes and pops as elements are started and finished. In essence, as you start a new element, a new state needs to be created an pushed onto a stack. When that element finishes, you pop that state off the stack, and properly append it to the previous state, i.e. the enclosing XML token. I had this working smoothly. Along the way, if you hit any characters between the XML entities, they need to be appended to the top of the stack’s “chunks”.

But docbook has so many awkward situations, like <section> <title>blah</title> <para> <section> <title>blah</title>…., that would lead to a level 0 header followed by a level 2 header. Where is level 1? Nowhere to be found and up to me to clean up.

Also docbook has SO MANY different tokens that really expresses the same thing. For every different element type, I created an AST node to represent it. But when it came to <code>blah</code>, <interfacename>blah</interfacename>, and <literal>blah</literal>, I created only one, Monospaced. These all basically get converted to the same thing: `blah`, i.e. some text wrapped in back ticks.

Other stuff is simply impossible. I did my best to handle tables, but too many quirky things can go wrong. So I did my best and then hand edited after the fact. Callouts are worse. Asciidoctor makes callouts awesome. But getting over the hump of docbook’s format is such that I don’t even try to convert. I just stuff the raw AST content in the output and warn “DO THIS PART BY HAND”.

Given all this, porting the whole documentation of Spring Data to asciidoctor required me to invest three days of effort purely at building this script. I actually rewrote the script after Day 2 because I had made a fundamental flaw. But I met my objective of migrating all of the Evans train by this past Monday.

But the pay off of writing the script properly has been great. I hadn’t written a SAX parser like this for ten years. It brought back keen memories. Thankfully, using Spring Boot CLI + Groovy made it a pleasant, scriptastic experience.


Leave a Reply

Your email address will not be published. Required fields are marked *