XML Basics with Python
Introduction to XML
XML stands for Extensible Markup Language. It's a text-based markup language derived from Standard Generalized Markup Language (SGML). XML tags identify the data and are used to store and organize the data, rather than specifying how to display it. It is designed to be self-descriptive and simple, yet robust and powerful.
XML is a way to structure data for sharing between applications, making it an important part of web services, including SOAP, RSS, and RESTful APIs.
XML vs HTML
You might have heard of HTML (Hyper Text Markup Language), another markup language used primarily for creating web pages. While they share some similarities, they also have some crucial differences:
XML was designed to store and transport data, while HTML was designed to display data.
In XML, tags are not predefined; you must define your own tags, whereas in HTML, tags are predefined.
XML is case sensitive, whereas HTML is not.
XML Syntax
XML has a simple syntax and uses a set of rules for encoding documents in a format that is both human-readable and machine-readable.
Example:
Let's look at an XML document that contains information about a book:
This XML document includes the following components:
Prolog:
<?xml version="1.0" encoding="UTF-8"?>
- This is the XML prolog, it defines the XML version and the character encoding used in the document.Root Element:
<book>...</book>
- This is the root element of the XML document, and it contains all other elements.Child Elements:
<title>
,<author>
, and<year>
- These are child elements of the<book>
element.
XML Elements
XML elements form the building blocks of an XML document. They are defined by a start tag and an end tag with the content inserted in between.
Example:
In the XML document above, <title>The Lord of the Rings</title>
is an element. Here, <title>
is the start tag, The Lord of the Rings
is the content, and </title>
is the end tag.
XML Attributes
XML elements can have attributes, which provide additional information about the element.
Example:
Here, the book
element has an attribute category
with a value of fantasy
, and the title
element has an attribute lang
with a value of en
.
Advanced XML Concepts
XML Document Structure
XML documents form a tree-like structure that starts at "the root" and branches to "the leaves". This structure is called the XML tree. Elements can nest other elements, forming parent-child relationships. For example:
In this example, catalog
is the root element, book
is a child element of catalog
, and author
, title
, genre
, price
, publish_date
, and description
are child elements of book
.
XML Validation
It's crucial for an XML document to be properly formatted or "well-formed". Beyond being well-formed, an XML document can also be valid. A valid XML document is one that conforms to a specific Document Type Definition (DTD) or XML Schema Definition (XSD).
In this example, we use a DTD to specify the structure of the note
element. This ensures the note
element always contains to
, from
, heading
, and body
elements in this specific order.
XML Namespaces
XML namespaces are used to avoid element name conflicts. When using prefixes in XML, a namespace for the prefix must be defined. The namespace can be defined by an xmlns attribute in the start tag of an element.
In this example, we define two XML namespaces, one for HTML and one for furniture elements. This ensures there is no name collision between the two table
elements.
XPath and XQuery
XPath and XQuery are languages used to select nodes from an XML document. XPath stands for XML Path Language, while XQuery stands for XML Query Language. They both provide ways to navigate through elements and attributes in an XML document.
XPath
XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. Here are some useful XPath expressions:
/
Selects from the root nodenodename
Selects nodes in the document with the name "nodename"//
Selects nodes in the document from the current node that match the selection no matter where they are.
Selects the current node..
Selects the parent of the current node@
Selects attributes
Here is an example of using XPath in Python with the lxml
library:
XQuery
XQuery is used to extract data from XML documents. It can also be used for computations, such as arithmetic and string manipulations. It is a fully featured and powerful language, but it can be complex for beginners. Python's support for XQuery is limited, but you can use it in other contexts, such as in databases that support XML, like eXist-db.
Here is an example of an XQuery:
This query will return the titles of all books in the books.xml
document where the price is greater than 30.
Working with XML in Python
Python XML Parsing
Python provides a number of modules for parsing XML, including the built-in xml.etree.ElementTree
module (commonly shortened to ElementTree
or ET
), and the third-party lxml
and xmltodict
modules.
Here's how to parse an XML document using ElementTree
:
In this example, the parse
function parses the XML file and returns an ElementTree
object. We then use the getroot
method to get the root Element
object of the XML document.
Navigating XML with ElementTree
Once you've parsed an XML document, you can navigate through its elements to read or modify its data.
In the first example, we iterate over the child elements of the root element, printing each element's tag name and attribute dictionary. In the second example, we use the iter
method to find all 'description' elements at any depth within the root element.
Modifying XML with ElementTree
ElementTree
allows you to modify XML by adding, modifying, or deleting elements and attributes. After making changes, you can write the modified XML tree back to a file.
In this example, we iterate over all 'price' elements and change their text to '9.99'. We then write the modified XML tree back to the file 'items.xml'.
Parsing XML with xmltodict
xmltodict
is a Python module that makes working with XML feel more like working with JSON (and therefore with Python dictionaries). It can parse an XML document into a dictionary, and serialize a dictionary into an XML document.
In this example, we open the XML file and read its contents, then pass the contents to xmltodict.parse
, which returns a Python dictionary. We can then access XML elements as we would dictionary keys.
XPath with Python lxml
XPath is a language for navigating XML documents and selecting elements. It's more powerful and flexible than the basic navigation methods provided by ElementTree. lxml
is a Python XML processing library that is compatible with ElementTree but also supports XPath and other XML technologies.
To use XPath with lxml
, first parse your XML document using lxml.etree
instead of xml.etree.ElementTree
:
Now you can use the xpath
method to execute XPath expressions:
In the first example, //book
selects all 'book' elements in the document. In the second example, //book[@category="cooking"]
selects 'book' elements that have a 'category' attribute with a value of 'cooking'. In the third example, //book[1]/title
selects the 'title' element of the first 'book' element.
Generating XML with ElementTree and lxml
Both ElementTree and lxml allow you to generate new XML documents from scratch.
Here's how to create an XML document using ElementTree:
This example creates a root 'root' element, then creates a 'book' subelement with a 'category' attribute, and a 'title' subelement with a text value. Finally, it writes the XML document to a file.
Here's the equivalent code using lxml:
This code does the same thing, but with the added benefit of the pretty_print
option, which formats the XML with indentation for easier reading.
Working with XML Namespaces
XML namespaces are used to avoid name conflicts in XML documents. They are declared using the xmlns
attribute in the start tag of an element. Here is an example:
In this XML document, two namespaces are defined: 'http://www.w3.org/TR/html4/' and 'http://www.w3schools.com/furniture'. They are assigned the prefixes 'h' and 'f', respectively. These prefixes are used to qualify the names of elements in these namespaces.
Here is how to parse this XML document with lxml and access elements in these namespaces:
The xpath
method's namespaces
argument is used to provide a mapping from namespace prefixes to namespace URIs. This mapping is used to resolve the namespace prefixes in the XPath expression.
JSON Conversions
JSON (JavaScript Object Notation) is a lightweight data-interchange format that's easy for humans to read and write and easy for machines to parse and generate. Given its compatibility with many languages and its simpler structure, JSON has become the de facto standard for server-client data exchange, replacing XML in many instances.
However, there are still numerous occasions when data stored as XML needs to be processed in a JSON-friendly environment or vice versa. Let's see how we can achieve this in Python.
We'll use the xmltodict
module to convert XML data to JSON. This module essentially makes working with XML feel like you are working with JSON, as it maps XML data to Python's dictionary structure, which can then easily be converted to JSON.
Here's an example:
Just as we converted XML to JSON in the previous section, we may also need to convert JSON to XML. We'll continue to use the xmltodict
module for this conversion, as it provides a method unparse()
for the reverse conversion.
Let's take a look at how to convert JSON to XML:
In this example, we're first converting the JSON string to a Python dictionary using json.loads()
, then converting the dictionary to an XML string with xmltodict.unparse()
.
Last updated