You are here: Home / Blog / How to parse big XML files in Python

How to parse big XML files in Python

Gari Araolaza Mar 05, 2008

Parsing big  XML files in Python (some 60 MB is already big for me) was a bit painful until now. I used to import minidom and sometimes sax.

The problem with minidom is that the whole XML file loads into memory. Unless you have a 16GB machine, go to get a coffee, as you won't be able to do anything else until the cpu ends processing the file. If you try to do it with SAX, you have to work detecting every element start and end.  Quite crappy.

Today I learned a better solution from Erral: use lxml library. Here is an example so that you see how can we convert an XML file into a list of dicts:

from lxml import etree
coords = etree.parse("/path/to/your/xml/file").getroot()
coords_list = []
for coord in coords:
    this = {}
    for child in coord.getchildren():
        this[child.tag] = child.text
        coords_list.append(this)

Quite straightforward, isn't it? It's already in Kelpi: XML to list of dict parsing.

Filed under: ,
iBro
May 25, 2011 11:04 AM
Hmm, I was expecting to encounter some custom bare Python code.

How about immediate parsing of an unknown size stream read in blocks ? (not necessarily XML)

Commenting has been disabled.