Mar 05, 2008

Parsing big  XML files in Python (some 60 MB is already big for me) was a bit painful until now. I used to import minidom and sometimes sax.

The problem with minidom is that the whole XML file loads into memory. Unless you have a 16GB machine, go to get a coffee, as you won't be able to do anything else until the cpu ends processing the file. If you try to do it with SAX, you have to work detecting every element start and end.  Quite crappy.

Today I learned a better solution from Erral: use lxml library. Here is an example so that you see how can we convert an XML file into a list of dicts:

from lxml import etree
coords = etree.parse("/path/to/your/xml/file").getroot()
coords_list = []
for coord in coords:
    this = {}
    for child in coord.getchildren():
        this[child.tag] = child.text
        coords_list.append(this)

Quite straightforward, isn't it?

iBro
Apr 15, 2009 05:29 PM
Hmm, I was expecting to encounter some custom bare Python code.

How about immediate parsing of an unknown size stream read in blocks ? (not necessarily XML)

Commenting has been disabled.

You may be interested in these other articles