Here is an interesting article by David Mertz on XML processing in Python. David looked at different light weight XML to Python data structure parsers available, and their advantage on speed and flexibility over standard DOM.
It compared and benchmarked some of the light weight XML structure modules like:
- Fredrik Lundh's ElementTree
- David Mertz's gnosis.xml.objectify
- Python's default xml.dom.minidom, which is part of PyXML
- 4 Suite's cDomellete
When I worked on transforming and template substitution of the OpenOffice.org documents, initially I used the 4DOM implementation in PyXML. It is a full DOM2-compliant implementation for Python, and it has all the standard tools like XPath. Using XPath to select elements in a DOM document is pretty cool, and it there are a lot of DOM related documents and references on the net that I can read about. However, pretty soon I have to face the deficiency of the standard DOM implementation in Python. It is bloated. Too bloated.
It is slow to initialise. Slow to parse a document. And it takes so much memory that it sometimes has problems parsing some large OpenOffice.org documents. And while there XPath is "okay" as a query language to select elements, there is just not enough functions to manipulate the DOM objects, which I used a lot in my project. For example, moving a sub-tree from one DOM document to another document requires lots of cloning, and it is bloody expensive when you need to move 50 pages across. My project also involves with creating a template language inside OOo, and simulating loops by copying/cloning element trees again and again proves to be very expensive in PyXML.
At the end, I wrote my own object tree library using Python's expat parser, and bingo! Problem solved. It can now parse OOo documents much faster, and does lots of in-memory manipulation without cloning objects. Moreover, it is more "Python-like" in comparison to DOM.
I guess for those who wrote ElementTree and XML_Objectify must also face the same situation as I had, and instead of putting up with the performance (or lack of) of DOM in PyXML, they wrote their own light-weight replacements that perform much faster and more user friendly. I actually might benchmark the set of modules I wrote against these libraries, and see how they compare with each other.
Too bad that the set of Python modules I wrote was for work, therefore they won't be open sourced...