Python提供强大的处理XML文件的功能,目前主要有SAX和DOM两种处理方式。DOM方式比较简单,但性能不好;处理大于2M的XML文件时,建议使用SAX方式。
DOM(Document Object Model)是一种跨语言的读取XML文件的API,DOM转换XML文件为树形结构读取。
DOM Level 1 Specification于1998年生成,Python通过xml.dom.minidom支持这种标准。为了演示minidom是如何工作的,新建XML文件books.xml,内容如下:
<catalog> <book isbn="1-56592-724-9"> <title>Python Core Book</title> <author>Eric</author> <author>Daniel</author> <author>Mickey</author> </book> <book isbn="1-56592-051-1"> <title>Python Programing</title> <author>Norman</author> </book> <!-- imagine more entries here... --> </catalog>
下面程序读取XML文件,逐个打印每本书的名称、作者等信息,如下所示:
import pprint import xml.dom.minidom from xml.dom.minidom import Node doc = xml.dom.minidom.parse("books.xml") mapping = {} for node in doc.getElementsByTagName("book"): isbn = node.getAttribute("isbn") print "ISDN: " + isbn for node2 in node.getElementsByTagName("title"): title = "" for node3 in node2.childNodes: if node3.nodeType == Node.TEXT_NODE: title += node3.data print "TITLE: " + title for node2 in node.getElementsByTagName("author"): author = "" for node3 in node2.childNodes: if node3.nodeType == Node.TEXT_NODE: author += node3.data print "AUTHOR: " + author
说明:
SAX与DOM的处理方式截然不同,SAX能够遍历搜索XML文件,因此读取大于2M的XML文件时也有很好的性能,缺点是比较复杂。
还是以上面的books.xml文件为例。先定义一个BookHandler类,继承xml.sax.handler.ContentHandler父类,生成字典对象mapping,获得书本的isdn与title映射关系。
import xml.sax.handler class BookHandler(xml.sax.handler.ContentHandler): def __init__(self): self.inTitle = 0 self.mapping = {} def startElement(self, name, attributes): if name == "book": self.buffer = "" self.isbn = attributes["isbn"] elif name == "title": self.inTitle = 1 def characters(self, data): if self.inTitle: self.buffer += data def endElement(self, name): if name == "title": self.inTitle = 0 self.mapping[self.isbn] = self.buffer
说明:
然后,读取XMl文件,并打印字典。
import xml.sax import bookhandler import pprint parser = xml.sax.make_parser() handler = bookhandler.BookHandler() parser.setContentHandler(handler) parser.parse("books.xml") pprint.pprint(handler.mapping)
Comments
There are currently no comments
New Comment