Python提供强大的处理XML文件的功能,目前主要有SAX和DOM两种处理方式。DOM方式比较简单,但性能不好;处理大于2M的XML文件时,建议使用SAX方式。
DOM(Document Object Model)是一种跨语言的读取XML文件的API,DOM转换XML文件为树形结构读取。
DOM Level 1 Specification于1998年生成,Python通过xml.dom.minidom支持这种标准。为了演示minidom是如何工作的,新建XML文件books.xml,内容如下:
<catalog>
<book isbn="1-56592-724-9">
<title>Python Core Book</title>
<author>Eric</author>
<author>Daniel</author>
<author>Mickey</author>
</book>
<book isbn="1-56592-051-1">
<title>Python Programing</title>
<author>Norman</author>
</book>
<!-- imagine more entries here... -->
</catalog>
下面程序读取XML文件,逐个打印每本书的名称、作者等信息,如下所示:
import pprint
import xml.dom.minidom
from xml.dom.minidom import Node
doc = xml.dom.minidom.parse("books.xml")
mapping = {}
for node in doc.getElementsByTagName("book"):
isbn = node.getAttribute("isbn")
print "ISDN: " + isbn
for node2 in node.getElementsByTagName("title"):
title = ""
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
title += node3.data
print "TITLE: " + title
for node2 in node.getElementsByTagName("author"):
author = ""
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
author += node3.data
print "AUTHOR: " + author
说明:
SAX与DOM的处理方式截然不同,SAX能够遍历搜索XML文件,因此读取大于2M的XML文件时也有很好的性能,缺点是比较复杂。
还是以上面的books.xml文件为例。先定义一个BookHandler类,继承xml.sax.handler.ContentHandler父类,生成字典对象mapping,获得书本的isdn与title映射关系。
import xml.sax.handler
class BookHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.inTitle = 0
self.mapping = {}
def startElement(self, name, attributes):
if name == "book":
self.buffer = ""
self.isbn = attributes["isbn"]
elif name == "title":
self.inTitle = 1
def characters(self, data):
if self.inTitle:
self.buffer += data
def endElement(self, name):
if name == "title":
self.inTitle = 0
self.mapping[self.isbn] = self.buffer
说明:
然后,读取XMl文件,并打印字典。
import xml.sax
import bookhandler
import pprint
parser = xml.sax.make_parser()
handler = bookhandler.BookHandler()
parser.setContentHandler(handler)
parser.parse("books.xml")
pprint.pprint(handler.mapping)
Comments
There are currently no comments
New Comment