Writing the fastest XML parser for JavaScript

While building some reusable map layers for a side-project using Danish map data, I had to parse some quite large XML files containing polygons representing various regions of Denmark and convert it to a JavaScript object.

Motivation

While building some reusable map layers for a side-project using Danish map data, I had to parse some large XML files containing polygon representations of various regions of Denmark, and convert those to JavaScript objects. My first instinct was: "There is probably a library for that". So I find a random library to parse the XML, convert it all to polygons and show them on the map. But there is a noticeable hang when the layer is supposed to be shown. I load up the Chrome Performance tab, and notice that it's taking a couple hundred milliseconds to parse the XML.

If this were a work project, I would obviously just keep looking for a faster library. And there are actually some pretty fast ones out there! More on that later. My competitive side, however, thinks "I can do better than that". So I begin writing my own parser.

First Iteration

No Regular Expressions

My first intuition is to avoid using regular expressions. They may perform well when they're very simple, but once you do anything that causes backtracking, you're losing the linear-time performance. So for this parser, I decide to do simple string matching.

No Auxiliary Functions

Avoiding auxiliary functions makes the code harder to read, but it also saves the code from the overhead of creating new call stacks.

Instead, the parser is written as a single-pass for-loop. No backtracking, no recursion, no revisiting sections or finding the end of some tag and re-reading the attributes. Generally, you can count on iteration being as fast or faster than even shallow recursion in JavaScript.

Avoiding Garbage Collection

After writing the parser from these simple principles, a very obvious problem pops up. It does alright, matching the random library on speed, but it is spending half its time doing garbage collection!

So there is some memory leak in the parser and a little investigation with a performance profiler shows that the temporary strings I build when parsing the text content of an XML node is the culprit. And the data set has some quite large text contents.

Instead of building a temporary string, copying it into the final object and leaving it in memory until the garbage collector comes around, I instead find two numeric pointers, the start index and the end index of the content. Once the start- and end-index of a node, is found, I slice it from the source string directly into the returned object. No more garbage collection, and the parser is now beating the random library easily.

Second iteration

When doing side-projects, it is easy to fall down rabbit-holes of spin-off projects, and that's what this turned out to be as well. Although the parser does what it needs to do for my map project, a new goal emerges. It's easy to out-perform a multi-purpose library that isn't necessarily meant for speed, but it's hard to be the fastest of them all.

Looking around for other libraries, the following turn up: fast-xml-parser, xml2js, xmlbuilder2 and a blog post about txml. The latter is the fastest library I've found so far. I start benchmarking using benchmark.js and some code and assets stolen directly from fast-xml-parser. I easily beat the established libraries, but txml wins by a significant margin. I study the blog post for insights, and two key things differ from my approach.

indexOf, startsWith instead of while-loops

When parsing the XML documents, the parser moves forward until some symbol terminating whatever it is currently doing appears:

while(xml[i] !== '>') {
  i++;
}

For me, this was the easiest way to reason about what was happening, and also gave a more granular view of things when debugging. And I asssumed this was the fastest way to do it. But surprisingly, the built-in indexOf-function performs much better:

i = xml.indexOf('>', i);

The two code snippets are functionally equivalent, but changing everything to use indexOf with a start-index allowed me to match the performance of txml. But there was more.

Char code comparisons

When comparing the symbol at the parser's current position with some other symbol, I simply did an indexed lookup on the string:

if (xml[i] === '/') { ... }

The engine will read the character in the xml-string at index i, convert both characters in the comparison to char codes and perform a numeric comparison. An alternative way to do it is:

if (xml.charCodeAt(i) === 47) { ... }

Here, the left-hand side does the same: accesses the xml-string at index i+1, and compares it to the number 47 (char code for "/"). By saving a single char code conversion on the right hand side, the performance gain is significant enough to push the parser beyond txml in the benchmarks.

Final benchmarks

The final benchmarks are done on an AMD Ryzen 9 9900X 4.4 Ghz CPU, 2x32GB DDR5-6000MHz RAM and with Node.js as the runtime. As mentioned above, I use benchmark.js as the benchmark tool. The tables show how many times each library can parse a document per second:

ptest.xml

Tiny XML file with attributes, CDATA and nesting

Library Requests/second
fast-xml-parser 76 833
fast-xml-parser - preserve order 81 940
xmlbuilder2 33 847
xml2js 36 126
txml 456 226
rapid-xml-to-js 593 815

sample.xml

Small sample XML file at 1.5KB

Library Requests/second
fast-xml-parser 20 676
fast-xml-parser - preserve order 22 492
xmlbuilder2 15 123
xml2js 16 361
txml 158 576
rapid-xml-to-js 201 213

large.xml

Large 98MB XML file

Library Requests/second
fast-xml-parser 0.2789
fast-xml-parser - preserve order 0.3115
xmlbuilder2 0.1648
xml2js 0.2233
txml 1.3508
rapid-xml-to-js 1.8392

You may find the source code here, where you can also run the benchmarks yourself.