Python XML

Posted by: JiaLiang Liao 15 years, 10 months ago

Python提供强大的处理XML文件的功能，目前主要有SAX和DOM两种处理方式。DOM方式比较简单，但性能不好；处理大于2M的XML文件时，建议使用SAX方式。

DOM

DOM(Document Object Model)是一种跨语言的读取XML文件的API，DOM转换XML文件为树形结构读取。

DOM Level 1 Specification于1998年生成，Python通过xml.dom.minidom支持这种标准。为了演示minidom是如何工作的，新建XML文件books.xml，内容如下：

<catalog>
  <book isbn="1-56592-724-9">
    <title>Python Core Book</title>
    <author>Eric</author>
    <author>Daniel</author>
    <author>Mickey</author>
  </book>
  <book isbn="1-56592-051-1">
    <title>Python Programing</title>
    <author>Norman</author>
  </book>
  <!-- imagine more entries here... -->
</catalog>

下面程序读取XML文件，逐个打印每本书的名称、作者等信息，如下所示：

import pprint

import xml.dom.minidom
from xml.dom.minidom import Node

doc = xml.dom.minidom.parse("books.xml")

mapping = {}

for node in doc.getElementsByTagName("book"):
    isbn = node.getAttribute("isbn")
    print "ISDN: " + isbn

    for node2 in node.getElementsByTagName("title"):
        title = ""
	for node3 in node2.childNodes:
            if node3.nodeType == Node.TEXT_NODE:
                title += node3.data
        print "TITLE: " + title

    for node2 in node.getElementsByTagName("author"):
        author = ""
        for node3 in node2.childNodes:
            if node3.nodeType == Node.TEXT_NODE:
                author += node3.data
        print "AUTHOR: " + author

说明：

getElementsByTagName(“book”) 遍历树形结构，找到book为根的节点，存入node变量。

node.getAttribute(“isbn”) 读取节点属性。

node.getElementsByTagName(“title”) 继续搜索子节点。

node2.childNodes 节点中的文本信息也被认为是子节点，读取文本信息。

SAX

SAX与DOM的处理方式截然不同，SAX能够遍历搜索XML文件，因此读取大于2M的XML文件时也有很好的性能，缺点是比较复杂。

还是以上面的books.xml文件为例。先定义一个BookHandler类，继承xml.sax.handler.ContentHandler父类，生成字典对象mapping，获得书本的isdn与title映射关系。

import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.inTitle = 0
    self.mapping = {}

  def startElement(self, name, attributes):
    if name == "book":
      self.buffer = ""
      self.isbn = attributes["isbn"]
    elif name == "title":
      self.inTitle = 1

  def characters(self, data):
    if self.inTitle:
      self.buffer += data

  def endElement(self, name):
    if name == "title":
      self.inTitle = 0
      self.mapping[self.isbn] = self.buffer

说明：

SAX遍历XML文件startElement()方法在一个元素开始时执行。

self.isbn=attributes[“isbn”]当元素的名称是book时，读取元素属性isdn。

inTitle变量判断当前读取的元素是否为title。

characters() 方法读取元素间包含的文本信息。

endElement() 方法在元素结束时执行。

然后，读取XMl文件，并打印字典。

import xml.sax
import bookhandler
import pprint

parser = xml.sax.make_parser()
handler = bookhandler.BookHandler()
parser.setContentHandler(handler)
parser.parse("books.xml")
pprint.pprint(handler.mapping)

Posted by: JiaLiang Liao 15 years, 10 months ago

DOM

SAX

Comments

New Comment

Recent Posts

Archive

2019

2013

2012

2011

2010

2009

2008

2007

Categories

Authors

Feeds