Working with XML Data

0
261

Using the .NET XML Functionality

If you know how to work with XML in .NET languages such as C#, then you already know how to perform the same tasks in IronPython because you can import the System.Xml assembly to gain full access to this functionality. Because the XML capabilities of the .NET Framework are so well defined, using the System.Xml assembly may be all you need to perform tasks within your application. The main issue to consider is how you plan to use your application later. For example, if you plan to move your application to another platform, then using the .NET Framework solution won’t work. In addition, you need to consider data type translation in IronPython. The .NET data types that you use normally are translated into their IronPython counterparts, which could prove confusing for some developers. With these factors in mind, the following sections provide an overview of XML support in the .NET Framework from the IronPython perspective.

Considering the System.Xml Namespace

The System.Xml namespace provides access to the various classes used to interact with XML data. You use these classes to read, write, interpret, edit, build, and otherwise manage XML data. For example, you might use the XmlDeclaration class to begin building an XML data file from scratch when needed. All of these classes depend heavily on standards to ensure the file you create using IronPython is readable by other languages and applications. In fact, the System.Xml namespace supports these standards and specifications.

  • XML 1.0 (including Document Type Definition, DTD, support): http://www.w3.org/ TR/1998/REC-xml-19980210
  • XML Namespaces (both stream level and Document Object Model, DOM): http://www.w3.org/TR/REC-xml-names/
  • XSD Schemas: http://www.w3.org/2001/XMLSchema
  • XPath expressions: http://www.w3.org/TR/xpath
  • XSLT transformations: http://www.w3.org/TR/xslt
  • DOM Level 1 Core: http://www.w3.org/TR/REC-DOM-Level-1/
  • DOM Level 2 Core: http://www.w3.org/TR/DOM-Level-2/

Developing a Basic .NET XML Application

A .NET XML application will follow most of the same principles you use when working with a static language such as C# or Visual Basic.NET. In fact, you might not notice much difference at all except for the obvious structural requirements of a Python application. Consequently, you should find it easy to move your XML code over to IronPython because you really don’t have anything new to worry about. Listing 13-1 shows a simple XML application that creates an XML document, saves it to disk, reads it from disk, and then displays the content onscreen.

DOM-Only Support in the .NET Framework

It’s important to note that the .NET Framework supports DOM and not Simple API for XML (SAX). However, if you want SAX support, you can use the Python modules instead (see the “Working with xml.sax” section of this chapter). XML files include both data and context. In order to reconstruct the original dataset described by an XML file, you need a parser to read the text and then convert it to a usable object. DOM and SAX represent two different methods for interacting with XML documents without forcing the developer to create a parser. If you want more information about the DOM versus SAX approach to parsing XML parsers, check out the information at http://developerlife.com/tutorials/?p=28 and http://www.jamesh.id.au/ articles/libxml-sax/libxml-sax.html. Here’s a summary of the DOM features.

  • Object-based.
  • Object module is created automatically.
  • Element sequencing is preserved.
  • High memory usage.
  • Slow initial data retrieval.
  • Best for complex data structures.
  • In-memory document updates are supported.

SAX takes a completely different approach than DOM. Here’s a summary of the SAX features.

  • Event-based.
  • Object module is created by the application.
  • Element sequencing is ignored in favor of single events.
  • Low memory usage.
  • Fast initial data retrieval.
  • Best for simple data structures.
  • No document updates.

Listing 13-1: Reading and writing an XML document

[code]
# Import clr to add references.
import clr
# Add the required reference.
clr.AddReference(‘System.Xml’)
# Import the System.Xml classes.
from System.Xml import *
# This function creates the document and writes it to disk.
def CreateDocument():
# Create a document.
Doc = XmlDocument()
# Add the XML Declaration.
Declaration = Doc.CreateXmlDeclaration(‘1.0’, ‘utf-8’, ‘yes’)
Doc.AppendChild(Declaration)
# Create the root node.
Root = Doc.CreateNode(XmlNodeType.Element, ‘root’, None)
# Add child elements to the root.
MsgNode = Doc.CreateNode(XmlNodeType.Element, ‘Message’, None)
MsgNode.InnerXml = ‘Hello’
Root.AppendChild(MsgNode)
MsgNode = Doc.CreateNode(XmlNodeType.Element, ‘Message’, None)
MsgNode.InnerXml = ‘Goodbye’
Root.AppendChild(MsgNode)
# Add the root node to the document.
Doc.AppendChild(Root)
# Save the document to disk.
Doc.Save(‘Test.XML’)
def DisplayDocument():
# Create a document.
XMLDoc = XmlDocument()
# Load the XML data.
XMLDoc.Load(‘Test.XML’)
# Process the document.
for Nodes in XMLDoc:
if type(Nodes) == XmlElement:
for MsgNodes in Nodes:
print ‘Message:’, MsgNodes.InnerXml
# Interact with an XML document.
CreateDocument()
DisplayDocument()
# Pause after the debug session.
raw_input(‘nPress any key to continue…’)
[/code]

The code begins by importing clr, which the application uses to add the required reference to System.Xml using the clr.AddReference() method. The code then imports the System.Xml classes.

The example relies on two functions to keep the code simple: CreateDocument(), which creates and saves the document to disk, and DisplayDocument(), which reads the document from disk and displays the content on screen. The example calls each of these functions in turn.

The CreateDocument() function begins by creating an XmlDocument object, Doc. As with any .NET application, Doc doesn’t contain anything when you create it. The first task is to add the XML declarations so that the result is a well-formed XML document using Doc.CreateXmlDeclaration(). Calling Doc.AppendChild() adds the declaration to the document.

Now it’s time to create some content. All XML documents have a root node, which is Root for this example. The code creates Root using Doc.CreateNode() with an XmlNodeType.Element type and ‘root‘ for a name. The example doesn’t work with XML namespaces, so the third argument is set to None.

The most efficient way to create an XML document from scratch is to add all the child nodes to Root before you add Root to the document. The code creates MsgNode using the same technique as for Root. It adds content to MsgNode using the MsgNode.InnerXml property and then adds the node to Root using Root.AppendChild(). The example provides two ‘Message‘ nodes.

At this point, the code adds Root to the document using Doc .AppendChild(). It then saves the document to disk using Doc.Save(). Figure 13-1 shows the typical output from this example when viewed in Notepad (you can use any text editor to view the output because the Doc .Save() method includes spaces and line feeds).

The XML document output looks much as you might expect.
Figure 13-1: The XML document output looks much as you might expect.

The DisplayDocument() function begins by creating a document, XMLDoc, using the XmlDocument class constructor. It then loads the previously created XML document using XMLDoc.Load(). At this point, XMLDoc contains everything the code created earlier and you can easily explore it using the IronPython console.

If you’ve worked with XML documents using C# or Visual Basic.NET, you know that these languages sometimes make it hard to get to the data you really want. IronPython makes things very easy. All you need is a for loop, as shown in the code. Simple if statements make it easy to locate nodes of a particular type, XmlElement in this case.

By the time the code reaches the second for loop, it’s working with the ‘Message‘ elements. The code simply prints the MsgNodes.InnerXml property value to the screen, as shown in Figure 13-2. By now you can see that IronPython makes it incredibly simple to work with XML documents using the .NET Framework approach.

The example outputs the message content in the XML document.
Figure 13-2: The example outputs the message content in the XML document.

Loading and Viewing the XMLUtil Module

The example in this section assumes that you’ve loaded the XMLUtil module from the IronPython Tutorial directory. The following steps show you how to load this module manually so you can see the content.

  1. Open the IronPython console.
  2. Type import sys and press Enter. This command imports the sys module so that you can add the required directory to it.
  3. Type sys.path.append(‘C:/Program Files/IronPython 2.6/Tutorial‘) and press Enter (make sure you change the path information to match the location of your IronPython installation). The XMLUtil.py module exists in the Tutorial directory. Using this module is fine for experimentation, but be sure you copy the XMLUtil.py module to another location for other uses.
  4. Type print sys.path and press Enter. You should see the new path added to the list.
  5. Type dir(XMLUtil) and press Enter. You see the list of methods available in XMLUtil (as shown in Figure 13-3), which includes the Walk() method used in the example.
The Walk() method makes viewing XML data easier.
Figure 13-3: The Walk() method makes viewing XML data easier.

Loading and Viewing the XMLUtil Module

As previously mentioned, the XMLUtil.py file isn’t anything so advanced that you couldn’t put it together yourself, but it’s an interesting module to work with and use. Listing 13-2 shows a short example of how you could use this module in an application.

Listin g 13-2: Walking an XML document using XMLUtil

[code]
# Add the path required to import xmlutil.
import sys
sys.path.append(‘C:/Program Files/IronPython 2.6/Tutorial’)
# Import xmlutil to access the Walk() function.
import xmlutil
# Import clr to add references.
import clr
# Add the required reference.
clr.AddReference(‘System.Xml’)
# Import the System.Xml classes.
from System.Xml import *
# Create a document.
XMLDoc = XmlDocument()
# Load the XML data.
XMLDoc.Load(‘Test.XML’)
# Walk the file contents.
print ‘Contents of Test.XML’
for Node in xmlutil.Walk(XMLDoc):
print ‘nName:’, Node.Name
print ‘Value:’, Node.Value
print ‘InnerXml’, Node.InnerXml
# Pause after the debug session.
raw_input(‘nPress any key to continue…’)
[/code]

The example begins by importing sys, appending the Tutorial folder path, and importing XMLUtil. The code then imports clr, adds a reference to System.Xml, and imports the System.Xml classes. There isn’t anything new about any of this code.

The example makes use of the Text.XML file created in the “Developing a Basic .NET XML Application” section of this chapter. It creates an XmlDocument object, XMLDoc, and loads Text.XML into it using the XMLDoc.Load() method. At this point, you have an XML document that you can walk (go from node-to-node and examine). The XMLUtil.Walk() method can walk any sort of XML document, so you should try it out with other files after you’ve worked with the example for a while.

The next step is to call on XMLUtil.Walk() to walk the XML document for you. The example shows output from the Name, Value, and InnerXml properties. However, you have access to all the properties provided for the various XML data types that the .NET Framework provides. Consequently, you can use XMLUtil.Walk() to display any information needed, or to manage that information. Just because the example displays properties doesn’t mean you have any limitation on how you interact with the output of XMLUtil.Walk(). Figure 13-4 shows the output of this example.

Screen shows the output of the Walk() method for Test.XML.
Figure 13-4: Screen shows the output of the Walk() method for Test.XML.

The XMLUtil.Walk() function is so important because it demonstrates a Python generator (described later in the section when you have the required background). Most languages don’t provide support for generators, so they require a little explanation. The issue at the center of this whole discussion is the variant list. You know that an application will need to process some number of items during run time, but you have no idea of how long this list is or whether the list will exist at all. A producer function is one that outputs values one at a time in response to a request. The producer keeps processing items until it runs out, so the length of the list is no longer a concern (even if the list contains no items at all). Most languages rely on a callback, an address to the requestor, to provide a place to send the producer output. The problem with using a callback is that the code must provide some means of retaining state information to remember previous values. In some cases, using callbacks leads to unnatural, convoluted coding techniques that are hard to write, harder to understand, and nearly impossible to update later.

Developers have a number of alternatives they can use. For example, the developer could simply use a very large list. However, lists require that the developer know what values should appear in the list during design time, and lists can consume large quantities of memory, making them a less than helpful solution in many cases. Another solution is to use an iterator to perform the task. Using an iterator makes it easier to get out of a loop when the processing is finished and eliminates the memory requirements. However, using an iterator shifts the burden of maintaining state information to the producer, complicating an already difficult programming task because the producer may not know anything about the caller. There are other solutions, as well, such as running the requestor and producer on separate threads so that each object can maintain state information without worrying about the potential corruption that occurs when running the code on a single thread. Unfortunately, multithreaded applications can run slowly and require a platform that fully supports multithreading, making your application less portable. In short, most languages don’t provide a good solution to the problem of working with data of variant length.

A generator creates a situation where the producer continuously outputs individual results as in a loop, maintaining its state locally. The requestor actually views the function as a type of iterator, even though the producer isn’t coded to provide an iterator. To accomplish this task, Python provides the yield statement shown in Figure 13-5. The yield statement returns an intermediate result from the producer to the requestor, while the producer continues to process a list of items.

The code in Figure 13-5 begins with the definition of a function named Walk(). This function accepts some kind of XML as input. The first yield statement sends the entire xml input back to the requestor (the example application shown in Listing 13-2). Consequently, you see #document as the Name and the entire XML document as the InnerXml.

The second call to Walk() moves past the first yield statement. Because the second item doesn’t meet the hasattr(xml, “Attributes“) requirement, the code moves onto the loop statement at the bottom of the code listing shown in Figure 13-5. The effect of this loop is to obtain the child elements of the entire document. So the second call to Walk() ends with yield c, which returns the XML declaration element. As a result, you see xml for the Name, version=“1.0“ encoding=“utf-8“ standalone=“yes“ for the Value, and nothing for the InnerXml. This second call ends processing of the XML declaration.

The XMLUtil.Walk() function is interesting because it provides a generator.
Figure 13-5: The XMLUtil.Walk() function is interesting because it provides a generator.

The third call to Walk() begins processing of the root node. It’s interesting to trace through the code in the debugger because you see the for loops in XMLUtil.Walk() used to trace through each element of the input xml as if it were using recursion or perhaps some type of iteration, but the fact is that the code merely combines the for loop with a yield statement to feed each partial result back to the requestor. Using the Python debugger is actually a bit more helpful in this case than using the Visual Studio debugger because the Visual Studio debugger won’t show you the value of xml, child, or c so that you can see the changing values. The example code for this book includes XMLUtilDemo2.py for the purpose of using the Python debugger. Follow these steps to load the debugger so you can trace through the example yourself.

  1. Open the IronPython console.
  2. Type import sys and press Enter. This command imports the sys module so that you can add the required directory to it.
  3. Type sys.path.append(‘C:/Program Files/IronPython 2.6/Tutorial‘) and press Enter (make sure you change the path information to match the location of your IronPython installation).
  4. Type import XMLUtil and press Enter to import the support file (important if you want to see how the generator works).
  5. Type import XMLUtilDemo2 and press Enter to import the source code file.
  6. Type import pdb and press Enter to import the debugger.
  7. Type pdb.run(‘XMLUtilDemo2.main()‘) to start the debugger. At this point, you can single step through the code to see how everything works.

Using the Python Modules

At one point, the Python modules were stable and straightforward to use, but later versions are less stable and, when it comes to IronPython, may be missing required elements completely. Consequently, you might see tutorials such as the one at http://www.boddie.org.uk/python/XML_intro.html and wonder why they don’t work. These tutorials are based on earlier versions of Python and don’t account for the missing CPython elements in IronPython. The following sections describe how to overcome these problems in your application when you use the Python approach to XML file management in IronPython.

Working with xml.dom.minidom

The xml.dom.minidom module is designed to help you work with XML using the DOM approach. However, this module is far from complete in IronPython, partly due to the CPython support required in standard Python. The actual document support is complete, so you won’t have a problem building, editing, and managing XML documents. It’s the write and read support that are lacking.

Fortunately, you can overcome write issues by using a different approach to outputting the document to disk (or other media). Standard Python development practice is to use the xml.dom.ext .PrettyPrint() method, which simply doesn’t exist in IronPython. You get around the problem by performing the task in two steps, rather than one, as shown in Listing 13-3.

The reading problem isn’t as easy to solve. Standard Python development practice is to use the xml .dom.minidom.parse() method. This method does exist in IronPython, but it outputs an error stating

[code]
ImportError: No module named pyexpat
[/code]

This module actually is missing. In order to fix this problem, you must download the pyexpat. py file from https://fepy.svn.sourceforge.net/svnroot/fepy/trunk/lib/. Place this file in your Program FilesIronPython 2.6Lib, not the Program FilesIronPython 2.6Lib xmldom folder as you might think. As shown in Listing 13-3, the standard Python techniques work just fine now.

Listin g 13-3: Managing XML documents using the Python approach

[code]
# Import the required XML support.
import xml.dom.minidom
def CreateDocument():
# Create an XML document.
Doc = xml.dom.minidom.Document()
# Create the root node.
Root = Doc.createElement(‘root’)
# Add the message nodes.
MsgNode = Doc.createElement(‘Message’)
Message = Doc.createTextNode(‘Hello’)
MsgNode.appendChild(Message)
Root.appendChild(MsgNode)
MsgNode = Doc.createElement(‘Message’)
Message = Doc.createTextNode(‘Goodbye’)
MsgNode.appendChild(Message)
Root.appendChild(MsgNode)
# Append the root node to the document.
Doc.appendChild(Root)
# Create the output document.
MyFile = open(‘Test2.XML’, ‘w’)
# Write the output.
MyFile.write(Doc.toprettyxml(encoding=’utf-8’))
# Close the document.
MyFile.close()
def DisplayDocument():
# Read the existing XML document.
XMLDoc = xml.dom.minidom.parse(‘Test2.XML’)
# Print the message node content.
for ThisChild in XMLDoc.getElementsByTagName(‘Message’):
print ‘Message:’, ThisChild.firstChild.toxml().strip(‘nt’)
CreateDocument()
DisplayDocument()
# Pause after the debug session.
raw_input(‘nPress any key to continue…’)
[/code]

The first thing you should notice is that the code for this example is much shorter than its .NET counterpart, even though the result is essentially the same. Despite the problems with the Python libraries, you can write concise code for manipulating XML using Python.

The code begins by importing the only module it needs, xml.dom.minidom. It then calls CreateDocument() and DisplayDocument() in turn, just as the .NET example does. In fact, the output from this example is precisely the same. You see the same output shown in Figure 13-2 when you run this example.

The CreateDocument() function begins by creating an XML document, Doc, using xml.dom .minidom.Document(). The XML document automatically contains the XML declaration, so unlike the .NET version of the code, you don’t need to add it manually. So the first processing task is to create the root node using Doc.createElement(‘root‘).

As with the .NET example, this example creates two MsgNode elements that contain different messages. The technique used is different from the .NET example. Instead of setting an InnerXml property, the code creates an actual text node using Doc.createTextNode(). However, the result is the same, as shown in Figure 13-6. The last step is to add Root to Doc using Doc.appendChild().

A big difference between IronPython and Python is how you write the XML to a file. As previously mentioned, you can’t use the xml.dom.ext.PrettyPrint() method. In this case, the code creates a file, MyFile, using open(). The arguments define the filename and the mode, where ‘w‘ signifies write. In order to write the text to a file, you use a two-step process. First, the code creates formatting XML by calling Doc.toprettyxml(). The function accepts an optional encoding argument, but there isn’t any way to define the resulting XML document as stand-alone using the standalone=“yes“ attribute (see Figure 13-1). Second, the code writes the data to the file buffer using MyFile.write().

The Python output is similar, but not precisely the same as the .NET output.
Figure 13-6: The Python output is similar, but not precisely the same as the .NET output.

[code]
Calling MyFile.write() doesn’t write the data to disk. In order to clear the file buffer, you must call MyFile.close(). Theoretically, IronPython will call MyFile.close() when the application ends, but there isn’t any guarantee of this behavior, so you must specifically call MyFile.close() to ensure there isn’t any data loss.
[/code]

The DisplayDocument() function comes next. Reading an XML document from disk and placing it in a variable is almost too easy when using IronPython. All you need to do is make a single call to xml.dom.minidom.parse(). That’s it! The document is immediately ready for use.

The second step is to display the same output shown in Figure 13-2. Again, all you need in IronPython is a simple for loop, rather than the somewhat lengthy .NET code. In this case, you ask IronPython to retrieve the nodes you want using XMLDoc.getElementsByTagName(). The output is a list that you can process one element at a time. The print statement calls on a complex-looking call sequence.

[code]
ThisChild.firstChild.toxml().strip(‘nt’)
[/code]

However, if you take this call sequence apart, it really isn’t all that hard to understand. Every iteration of the loop places one of the MsgNode elements in ThisChild. The first (and only) child of MsgNode is the Message text node, so you can retrieve it using the firstChild property. The firstChild property contains a DOM Text node object, so you convert it to XML using the toxml() method. Unfortunately, the resulting string contains control characters, so you remove them using the strip(‘nt‘) method. The result is a simple value output.

Working with xml.sax

It’s important to remember that SAX is an event-driven method of working with XML. An application looks at a small number of bits out of an entire document. Consequently, SAX can be a good method for processing larger documents that you can’t read into memory at one time. A SAX application normally relies on three constructs:

  • One or more sources as input
  • A parser (normally, only one is used)
  • One or more handlers to respond to input events

There are many different Python SAX modules. Each of these modules provides different implementations of the three constructions. The default SAX implementation provides just four handlers. These handlers are implemented as classes that you use to interact with the events generated by the input file.

  • ContentHandler: Provides the main SAX interface for handling document events. Most applications use this interface as a minimum because it provides the basic support required for any document. The example shows how to use this handler, which is provided as part of the xml.sax module.
  • DTDHandler: Manages all of the Document Type Definition (DTD) events.
  • EntityResolver: Resolves external entities such as files referenced by processing instructions.
  • ErrorHandler: Reports any errors or warnings that the parser encounters when it processes the XML. Provided as part of the xml.sax module.

Now that you have a little better idea of what to expect, it’s time to look at an actual example. Listing 13-4 shows a simple SAX implementation that includes all of the constructs you normally need. Of course, you can easily add to this example to make it do considerably more than it does now.

Listin g 13-4: Parsing an XML document using SAX

[code]
# Import the required module.
import xml.sax
# Create a handler based on the default ContentHandler class.
class MessageHandler(xml.sax.ContentHandler):
# Contains the message text.
Message = ‘’
# Determines when the content is a message.
IsMessage = False
# Check for the kind of element before processing it.
def startElement(self, name, attrs):
if name == ‘Message’:
self.IsMessage = True
self.Message = ‘’
else:
self.IsMessage = False
# If this is the right kind of element, display the message for it.
def endElement(self, name):
if name == ‘Message’:
print ‘Message:’, self.Message.strip(‘nt’)
# Add each of the characters of the message to the Message variable.
def characters(self, ch):
if self.IsMessage:
self.Message += ch
# Create a parser.
Parser = xml.sax.make_parser()
# Create a handler for the parser and tell the parser to use it.
Handler = MessageHandler()
Parser.setContentHandler(Handler)
# Open a source and parse it using the parser with the custom handler.
Parser.parse(open(‘Test2.XML’))
# Pause after the debug session.
raw_input(‘nPress any key to continue…’)
[/code]

The code begins by importing the required xml.sax module. You don’t need anything fancy to create a basic SAX handler. Remember that SAX processes the file one character at a time and generates events based on the characters that the parser sees. Consequently, the code may seem a little odd for someone who is used to working with complete elements, but SAX gives you fine control over the processing cycle, including locating errors within the file.

The centerpiece of this example is the MessageHandler class. This class includes a variable to hold the message (Message), an indicator of whether an element is a message (IsMessage), and the three methods described in the following list.

  • startElement(): The parser calls this method at the beginning of an element.
  • endElement(): The parser calls this method at the end of an element.
  • characters(): Every character read from the source generates a call to characters().

For this example, the startElement() method checks the element name. If the element is a ‘Message‘ element, then the code sets IsMessage to True and clears Message of any existing content. This is a preparatory step.

When the characters() method sees that IsMessage is True, it appends every character it receives to Message. Remember that these are individual characters, so you can’t assume much about the content except that the flow is from the beginning of the file to the end of it. In other words, you won’t receive characters out of order.

The endElement() checks the element name again. When the element name is ‘Message‘, the code prints the content of Message. Because Message contains all of the characters from the source, you must use strip(‘nt‘) to remove any control characters. The output from this example is the same as shown in Figure 13-2.

Now that you understand the handler, it’s time to see how you put it to work. The main part of the code begins by creating a parser, Parser, using xml.sax.make_parser(). Remember that the parser simply generates events based on the input characters it sees. The handler performs the actual interpretation of those characters.

The next step is to create an instance of MessageHandler named Handler. The code uses Parser .setContentHandler() to assign the handler to Parser. Otherwise, Parser won’t know which handler to use to process the XML characters.

In order to process the XML file, the code still requires a source — the third construct. The open(‘Test2.XML‘) call opens Test2.XML as a source and passes this source to Parser through the Parser.parse() method. It’s the call to the Parser.parse() method that actually begins the process of generating events.