Parsing xml in Python with etree.ElementTree (2024)

The xml module in the standard library provide tools for working with XML documents.

The ElementTree class in the etree submodule of the xml module offers an intuitive way of parsing and representing XML data.

ElementTree objects represents xml data in form of a tree structure in which the hierarchy is based on the nesting of the xml elements.

Basic parsing example

Consider if we have an xml file called articles.xml with the following content.

<?xml version = '1.0' encoding = 'UTF-8'?><articlelist> <article> <author country = 'India'>John Doe</author> <datepublished>2024/04/05</datepublished> <title>Lorem ipsum dolor sit amet consectetur adipisicing elit</title> <content>Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia, molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium optio, eaque rerum! Provident similique accusantium nemo autem. </content> </article> <article> <author country = 'Finland'>Mary Smith</author> <datepublished>2024/04/07</datepublished> <title>Perspiciatis minima nesciunt dolorem</title> <content>Perspiciatis minima nesciunt dolorem! Officiis iure rerum voluptates a cumque velit quibusdam sed amet tempora. Sit laborum ab, eius fugit doloribus tenetur fugiat, temporibus enim commodi iusto libero magni deleniti quod quam consequuntur! Commodi minima excepturi repudiandae velit hic maxime doloremque.</content> </article> </articlelist>

We can parse the document by passing the opened file object as an argument to the ElementTree.parse() method, as shown below:

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) print(tree)

<xml.etree.ElementTree.ElementTree object at 0x000001B03DB770E0>

As shown in the above example, the ElementTree.parse() helper method creates an ElementTree instance from the given file object.

The ElementTree object represents the structure of the xml documents in form of a tree, where each node in the tree represents the corresponding element in the xml document.

Traversing an ElementTree

The tree.iter() method returns an iterator object that yields the nodes of the parsed tree from top to bottom. By default it returns all nodes in the tree.

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) for node in tree.iter(): print(node.tag)

articlelist
article
author
datepublished
title
content
article
author
datepublished
title
content

You can pass a tag as an argument to the tree.iter() method so that it will only iterate over the elements with that tag.

tree.iter(tag = None)

For example, to get onlynodes with author tag, we can parse 'author' as the tag argument, as shown below:

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) for node in tree.iter('author'): print(node.text)

John Doe
Mary Smith

Search for Nodes

Parsed trees contains some useful methods to expressively search for nodes with certain characteristics. This allows you to find for nodes with given tags or even nodes that appears at certain depth of the parse tree.

The two basic methods for searching are find() and findall().

Find single node - tree.find()

The tree.find() method returns the first node that matches the search strings. It returns None, if there is no matching node.

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) n = tree.find('.//author') print(n.text)

John Doe

Note how the search string is formatted, each stroke represents a depth starting from the root. So in the above case, we are finding for a node withauthor tag at the second depth. If we are looking for an article tag we would use a single stroke i.e "./article"to correspond with the depth of that article tag from the root.

from xml.etree import ElementTreewith open('pynerds.txt') as file: tree = ElementTree.parse(file) article = tree.find('./article') for i in article: print(i.tag, ': ', i.text)

author : John Doe
datepublished : 2024/04/05
title : Lorem ipsum dolor sit amet consectetur adipisicing elit
content : Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia,
molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum
numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium
optio, eaque rerum! Provident similique accusantium nemo autem.

If you only want the text value, you can use the findtext() method instead of find().

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) print(tree.findtext('./article/title'))

Lorem ipsum dolor sit amet consectetur adipisicing elit

The './article/title' search string, searches for a title element that is nested inside of an article element. This can be especially useful if the xml document contains elements that have similar tags.

Find all matching elements - tree.findall()

The tree.findall() method returns a list of all matching nodes for the given search string.

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) nodes = tree.findall('.//author') print(nodes) for n in nodes: print(n.text)

[<Element 'author' at 0x0000011425175FD0>, <Element 'author' at 0x0000011425176160>]
John Doe
Mary Smith

Deeper look on nodes

The Elementobjects returned by methods liketree.iter(), tree.find(), etc are used to represent a single node in the xml parse tree.

Element objects contain some useful attributes and methods for accessing and manipulating information of the represented xml element. We have already used some of the attributes such as text and tag.

The attrib dictionary of an Element object stores the attributes of the represented xml element.

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) n = tree.find('*author') print(n.attrib) print(n.text) print(n.attrib.get('country'))

{'country': 'India'}
John Doe
India

You can use the tail attribute to get the text that comes after the closing tag of a given node.

Parsing Strings

If the xml data is in form of a string, we can parse it using the XML() function. Which takes the xml string as an argument, parses it and creates an Element object representation.

from xml.etree import ElementTreexml_data = ''' <article> <author country = 'India'>John Doe</author> <datepublished>2024/04/05</datepublished> <title>Lorem ipsum dolor sit amet consectetur adipisicing elit</title> <content>Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia, molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium optio, eaque rerum! Provident similique accusantium nemo autem. </content> </article>'''article = ElementTree.XML(xml_data)print(article.findtext('.author'))print(article.findtext('.datepublished'))print(article.findtext('.title'))

John Doe
2024/04/05
Lorem ipsum dolor sit amet consectetur adipisicing elit

Note that unlike parse() which returns an ElementTree instance, the return value of XML() is an Element object.

‹‹ Prevpickle module→

Parsing xml in Python with etree.ElementTree (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Horacio Brakus JD

Last Updated:

Views: 6061

Rating: 4 / 5 (51 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Horacio Brakus JD

Birthday: 1999-08-21

Address: Apt. 524 43384 Minnie Prairie, South Edda, MA 62804

Phone: +5931039998219

Job: Sales Strategist

Hobby: Sculling, Kitesurfing, Orienteering, Painting, Computer programming, Creative writing, Scuba diving

Introduction: My name is Horacio Brakus JD, I am a lively, splendid, jolly, vivacious, vast, cheerful, agreeable person who loves writing and wants to share my knowledge and understanding with you.