XML Basics with Python

Introduction to XML

XML stands for Extensible Markup Language. It's a text-based markup language derived from Standard Generalized Markup Language (SGML). XML tags identify the data and are used to store and organize the data, rather than specifying how to display it. It is designed to be self-descriptive and simple, yet robust and powerful.

XML is a way to structure data for sharing between applications, making it an important part of web services, including SOAP, RSS, and RESTful APIs.

XML vs HTML

You might have heard of HTML (Hyper Text Markup Language), another markup language used primarily for creating web pages. While they share some similarities, they also have some crucial differences:

  • XML was designed to store and transport data, while HTML was designed to display data.

  • In XML, tags are not predefined; you must define your own tags, whereas in HTML, tags are predefined.

  • XML is case sensitive, whereas HTML is not.

XML Syntax

XML has a simple syntax and uses a set of rules for encoding documents in a format that is both human-readable and machine-readable.

Example:

Let's look at an XML document that contains information about a book:

<?xml version="1.0" encoding="UTF-8"?>
<book>
  <title>The Lord of the Rings</title>
  <author>J.R.R. Tolkien</author>
  <year>1954</year>
</book>

This XML document includes the following components:

  • Prolog: <?xml version="1.0" encoding="UTF-8"?> - This is the XML prolog, it defines the XML version and the character encoding used in the document.

  • Root Element: <book>...</book> - This is the root element of the XML document, and it contains all other elements.

  • Child Elements: <title>, <author>, and <year> - These are child elements of the <book> element.

XML Elements

XML elements form the building blocks of an XML document. They are defined by a start tag and an end tag with the content inserted in between.

Example:

In the XML document above, <title>The Lord of the Rings</title> is an element. Here, <title> is the start tag, The Lord of the Rings is the content, and </title> is the end tag.

XML Attributes

XML elements can have attributes, which provide additional information about the element.

Example:

xmlCopy code<book category="fantasy">
  <title lang="en">The Lord of the Rings</title>
  <author>J.R.R. Tolkien</author>
  <year>1954</year>
</book>

Here, the book element has an attribute category with a value of fantasy, and the title element has an attribute lang with a value of en.

Advanced XML Concepts

XML Document Structure

XML documents form a tree-like structure that starts at "the root" and branches to "the leaves". This structure is called the XML tree. Elements can nest other elements, forming parent-child relationships. For example:

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
  <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
  </book>
</catalog>

In this example, catalog is the root element, book is a child element of catalog, and author, title, genre, price, publish_date, and description are child elements of book.

XML Validation

It's crucial for an XML document to be properly formatted or "well-formed". Beyond being well-formed, an XML document can also be valid. A valid XML document is one that conforms to a specific Document Type Definition (DTD) or XML Schema Definition (XSD).

<!DOCTYPE note [
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

In this example, we use a DTD to specify the structure of the note element. This ensures the note element always contains to, from, heading, and body elements in this specific order.

XML Namespaces

XML namespaces are used to avoid element name conflicts. When using prefixes in XML, a namespace for the prefix must be defined. The namespace can be defined by an xmlns attribute in the start tag of an element.

<root xmlns:h="https://www.w3.org/TR/html4/" xmlns:f="https://www.w3schools.com/furniture">
  <h:table>
    <h:tr>
      <h:td>Apples</h:td>
      <h:td>Bananas</h:td>
    </h:tr>
  </h:table>
  <f:table>
    <f:name>African Coffee Table</f:name>
    <f:width>80</f:width>
    <f:length>120</f:length>
  </f:table>
</root>

In this example, we define two XML namespaces, one for HTML and one for furniture elements. This ensures there is no name collision between the two table elements.

XPath and XQuery

XPath and XQuery are languages used to select nodes from an XML document. XPath stands for XML Path Language, while XQuery stands for XML Query Language. They both provide ways to navigate through elements and attributes in an XML document.

XPath

XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. Here are some useful XPath expressions:

  • / Selects from the root node

  • nodename Selects nodes in the document with the name "nodename"

  • // Selects nodes in the document from the current node that match the selection no matter where they are

  • . Selects the current node

  • .. Selects the parent of the current node

  • @ Selects attributes

Here is an example of using XPath in Python with the lxml library:

from lxml import etree

# parse XML
tree = etree.parse('books.xml')

# select all 'book' nodes
books = tree.xpath('//book')
for book in books:
    print(etree.tostring(book))

# select 'title' of first 'book'
title = tree.xpath('//book[1]/title/text()')[0]
print('Title: ', title)

XQuery

XQuery is used to extract data from XML documents. It can also be used for computations, such as arithmetic and string manipulations. It is a fully featured and powerful language, but it can be complex for beginners. Python's support for XQuery is limited, but you can use it in other contexts, such as in databases that support XML, like eXist-db.

Here is an example of an XQuery:

for $x in doc("books.xml")/catalog/book
where $x/price > 30
return $x/title

This query will return the titles of all books in the books.xml document where the price is greater than 30.

Working with XML in Python

Python XML Parsing

Python provides a number of modules for parsing XML, including the built-in xml.etree.ElementTree module (commonly shortened to ElementTree or ET), and the third-party lxml and xmltodict modules.

Here's how to parse an XML document using ElementTree:

import xml.etree.ElementTree as ET

tree = ET.parse('items.xml')
root = tree.getroot()

In this example, the parse function parses the XML file and returns an ElementTree object. We then use the getroot method to get the root Element object of the XML document.

Once you've parsed an XML document, you can navigate through its elements to read or modify its data.

for child in root:
    print(child.tag, child.attrib)

for elem in root.iter('description'):
    print(elem.text)

In the first example, we iterate over the child elements of the root element, printing each element's tag name and attribute dictionary. In the second example, we use the iter method to find all 'description' elements at any depth within the root element.

Modifying XML with ElementTree

ElementTree allows you to modify XML by adding, modifying, or deleting elements and attributes. After making changes, you can write the modified XML tree back to a file.

for price in root.iter('price'):
    price.text = '9.99'

tree.write('items.xml')

In this example, we iterate over all 'price' elements and change their text to '9.99'. We then write the modified XML tree back to the file 'items.xml'.

Parsing XML with xmltodict

xmltodict is a Python module that makes working with XML feel more like working with JSON (and therefore with Python dictionaries). It can parse an XML document into a dictionary, and serialize a dictionary into an XML document.

import xmltodict

with open('items.xml') as fd:
    doc = xmltodict.parse(fd.read())

print(doc['root']['book']['title'])

In this example, we open the XML file and read its contents, then pass the contents to xmltodict.parse, which returns a Python dictionary. We can then access XML elements as we would dictionary keys.

XPath with Python lxml

XPath is a language for navigating XML documents and selecting elements. It's more powerful and flexible than the basic navigation methods provided by ElementTree. lxml is a Python XML processing library that is compatible with ElementTree but also supports XPath and other XML technologies.

To use XPath with lxml, first parse your XML document using lxml.etree instead of xml.etree.ElementTree:

from lxml import etree

tree = etree.parse('items.xml')
root = tree.getroot()

Now you can use the xpath method to execute XPath expressions:

# select all 'book' elements
books = root.xpath('//book')

# select 'book' elements with a 'category' attribute of 'cooking'
cooking_books = root.xpath('//book[@category="cooking"]')

# select the 'title' element of the first 'book' element
title = root.xpath('//book[1]/title')[0]

In the first example, //book selects all 'book' elements in the document. In the second example, //book[@category="cooking"] selects 'book' elements that have a 'category' attribute with a value of 'cooking'. In the third example, //book[1]/title selects the 'title' element of the first 'book' element.

Generating XML with ElementTree and lxml

Both ElementTree and lxml allow you to generate new XML documents from scratch.

Here's how to create an XML document using ElementTree:

root = ET.Element('root')

book = ET.SubElement(root, 'book')
book.set('category', 'cooking')

title = ET.SubElement(book, 'title')
title.text = 'Everyday Italian'

ET.ElementTree(root).write('items.xml')

This example creates a root 'root' element, then creates a 'book' subelement with a 'category' attribute, and a 'title' subelement with a text value. Finally, it writes the XML document to a file.

Here's the equivalent code using lxml:

root = etree.Element('root')

book = etree.SubElement(root, 'book')
book.set('category', 'cooking')

title = etree.SubElement(book, 'title')
title.text = 'Everyday Italian'

tree = etree.ElementTree(root)
tree.write('items.xml', pretty_print=True)

This code does the same thing, but with the added benefit of the pretty_print option, which formats the XML with indentation for easier reading.

Working with XML Namespaces

XML namespaces are used to avoid name conflicts in XML documents. They are declared using the xmlns attribute in the start tag of an element. Here is an example:

<root xmlns:h="http://www.w3.org/TR/html4/" xmlns:f="http://www.w3schools.com/furniture">
  <h:table>
    <h:tr>
      <h:td>Apples</h:td>
      <h:td>Bananas</h:td>
    </h:tr>
  </h:table>
  <f:table>
    <f:name>African Coffee Table</f:name>
    <f:width>80</f:width>
    <f:length>120</f:length>
  </f:table>
</root>

In this XML document, two namespaces are defined: 'http://www.w3.org/TR/html4/' and 'http://www.w3schools.com/furniture'. They are assigned the prefixes 'h' and 'f', respectively. These prefixes are used to qualify the names of elements in these namespaces.

Here is how to parse this XML document with lxml and access elements in these namespaces:

from lxml import etree

# define namespace map
nsmap = {'h': 'http://www.w3.org/TR/html4/',
         'f': 'http://www.w3schools.com/furniture'}

# parse XML
tree = etree.parse('namespaces.xml')
root = tree.getroot()

# select 'table' elements in the 'http://www.w3.org/TR/html4/' namespace
html_tables = root.xpath('//h:table', namespaces=nsmap)

# select 'table' elements in the 'http://www.w3schools.com/furniture' namespace
furniture_tables = root.xpath('//f:table', namespaces=nsmap)

The xpath method's namespaces argument is used to provide a mapping from namespace prefixes to namespace URIs. This mapping is used to resolve the namespace prefixes in the XPath expression.

JSON Conversions

JSON (JavaScript Object Notation) is a lightweight data-interchange format that's easy for humans to read and write and easy for machines to parse and generate. Given its compatibility with many languages and its simpler structure, JSON has become the de facto standard for server-client data exchange, replacing XML in many instances.

However, there are still numerous occasions when data stored as XML needs to be processed in a JSON-friendly environment or vice versa. Let's see how we can achieve this in Python.

We'll use the xmltodict module to convert XML data to JSON. This module essentially makes working with XML feel like you are working with JSON, as it maps XML data to Python's dictionary structure, which can then easily be converted to JSON.

Here's an example:

import json
import xmltodict

def convert(xml_str):
    data_dict = xmltodict.parse(xml_str)
    json_data = json.dumps(data_dict)
    return json_data

xml = """
<bookstore>
<book category="COOKING">
  <title lang="en">Everyday Italian</title> 
  <author>Giada De Laurentiis</author> 
  <year>2005</year> 
  <price>30.00</price> 
</book>
<book category="CHILDREN">
  <title lang="en">Harry Potter</title> 
  <author>J K. Rowling</author> 
  <year>2005</year> 
  <price>29.99</price> 
</book>
</bookstore>
"""

print(convert(xml))

Just as we converted XML to JSON in the previous section, we may also need to convert JSON to XML. We'll continue to use the xmltodict module for this conversion, as it provides a method unparse() for the reverse conversion.

Let's take a look at how to convert JSON to XML:

import json
import xmltodict

def convert(json_str):
    data_dict = json.loads(json_str)
    xml_data = xmltodict.unparse(data_dict)
    return xml_data

json_data = """
{
    "bookstore": {
        "book": [
            {
                "@category": "COOKING",
                "title": {
                    "@lang": "en",
                    "#text": "Everyday Italian"
                },
                "author": "Giada De Laurentiis",
                "year": "2005",
                "price": "30.00"
            },
            {
                "@category": "CHILDREN",
                "title": {
                    "@lang": "en",
                    "#text": "Harry Potter"
                },
                "author": "J K. Rowling",
                "year": "2005",
                "price": "29.99"
            }
        ]
    }
}
"""

print(convert(json_data))

In this example, we're first converting the JSON string to a Python dictionary using json.loads(), then converting the dictionary to an XML string with xmltodict.unparse().

Last updated