Python XML parsing
What is XML?
XML means Extensible Markup Language(e X tensible M arkup L anguage). You can learn through this site XML Tutorial
XML is designed to transmit and store data.
XML is a set of rules to define the semantics of tags, these tags will document divided into many parts and these parts to be identified.
It is also a meta-markup language that defines the syntax of the language used to define other domain-specific, semantic, structured markup language.
python for XML parsing
Common DOM and XML programming interfaces SAX, two different interfaces with XML files the way, of course, the use of different occasions.
There are three ways python parsing XML, SAX, DOM, and ElementTree:
1.SAX (simple API for XML)
python standard library contains SAX parser, SAX with the event-driven model, triggered by one event in the process of parsing XML and calling user-defined callback functions to handle XML files.
2.DOM (Document Object Model)
The XML data is parsed into a tree in memory, operating through the tree to manipulate XML.
3.ElementTree (element tree)
ElementTree as a lightweight DOM, with a convenient and friendly API. Code availability, fast and consume less memory.
Note: Due to DOM need to map XML data into memory tree, one slow, the second is more consumption of memory, SAX streaming reads the XML file faster, take up less memory, but requires the user to implement callback (handler ).
Use this section to an XML instance document movies.xml reads as follows:
<collection shelf="New Arrivals"> <movie title="Enemy Behind"> <type>War, Thriller</type> <format>DVD</format> <year>2003</year> <rating>PG</rating> <stars>10</stars> <description>Talk about a US-Japan war</description> </movie> <movie title="Transformers"> <type>Anime, Science Fiction</type> <format>DVD</format> <year>1989</year> <rating>R</rating> <stars>8</stars> <description>A schientific fiction</description> </movie> <movie title="Trigun"> <type>Anime, Action</type> <format>DVD</format> <episodes>4</episodes> <rating>PG</rating> <stars>10</stars> <description>Vash the Stampede!</description> </movie> <movie title="Ishtar"> <type>Comedy</type> <format>VHS</format> <rating>PG</rating> <stars>2</stars> <description>Viewable boredom</description> </movie> </collection>
python xml parsing using SAX
SAX is an event-driven API.
Use SAX parsing an XML document involves two parts: the parser and event handler.
The parser is responsible for reading XML documents, and sends event event handlers, such as elements begin with the element end event;
The event handler is responsible for the event accordingly, XML data transfer for processing.
- 1, the processing of large files;
- 2, only part of the contents of the file, or simply to obtain specific information from the file.
- 3, want to build their own object model of the time.
Use sax manner xml xml.sax first introduced in the parse function, as well as the ContentHandler xml.sax.handler in python.
ContentHandler class method introduced
characters (content) method
The timing of the call:
From the beginning of the line, before experiencing the label, there is a character, content value of these strings.
From a label, a label before the next encounter, the presence of the character, content value of these strings.
From a label, before encountering a line terminator, the presence of characters, content value of these strings.
Tag may be the beginning of the tag, it can be the end of the label.
startDocument () method
Documentation startup called.
endDocument () method
When the call reaches the end of the document parser.
startElement (name, attrs) method
Call encountered XML start tag, name is the name of the tag, attrs is a dictionary property value tag.
endElement (name) method
Call encountered XML end tag.
make_parser method
The following method creates a new parser object and returns.
xml.sax.make_parser( [parser_list] )
Parameter Description:
- parser_list - optional parameter, parser list
parser method
The following method creates a SAX parser and parse xml document:
xml.sax.parse( xmlfile, contenthandler[, errorhandler])
Parameter Description:
- xmlfile - xml file name
- contenthandler - must be the object of a ContentHandler
- errorhandler - If this parameter is specified, errorhandler must be a SAX ErrorHandler Object
parseString method
parseString method creates an XML parser and parse xml string:
xml.sax.parseString(xmlstring, contenthandler[, errorhandler])
Parameter Description:
- xmlstring - xml string
- contenthandler - must be the object of a ContentHandler
- errorhandler - If this parameter is specified, errorhandler must be a SAX ErrorHandler Object
Python parsing XML instance
#!/usr/bin/python # -*- coding: UTF-8 -*- import xml.sax class MovieHandler( xml.sax.ContentHandler ): def __init__(self): self.CurrentData = "" self.type = "" self.format = "" self.year = "" self.rating = "" self.stars = "" self.description = "" # 元素开始事件处理 def startElement(self, tag, attributes): self.CurrentData = tag if tag == "movie": print "*****Movie*****" title = attributes["title"] print "Title:", title # 元素结束事件处理 def endElement(self, tag): if self.CurrentData == "type": print "Type:", self.type elif self.CurrentData == "format": print "Format:", self.format elif self.CurrentData == "year": print "Year:", self.year elif self.CurrentData == "rating": print "Rating:", self.rating elif self.CurrentData == "stars": print "Stars:", self.stars elif self.CurrentData == "description": print "Description:", self.description self.CurrentData = "" # 内容事件处理 def characters(self, content): if self.CurrentData == "type": self.type = content elif self.CurrentData == "format": self.format = content elif self.CurrentData == "year": self.year = content elif self.CurrentData == "rating": self.rating = content elif self.CurrentData == "stars": self.stars = content elif self.CurrentData == "description": self.description = content if ( __name__ == "__main__"): # 创建一个 XMLReader parser = xml.sax.make_parser() # turn off namepsaces parser.setFeature(xml.sax.handler.feature_namespaces, 0) # 重写 ContextHandler Handler = MovieHandler() parser.setContentHandler( Handler ) parser.parse("movies.xml")
The above code is executed as follows:
*****Movie***** Title: Enemy Behind Type: War, Thriller Format: DVD Year: 2003 Rating: PG Stars: 10 Description: Talk about a US-Japan war *****Movie***** Title: Transformers Type: Anime, Science Fiction Format: DVD Year: 1989 Rating: R Stars: 8 Description: A schientific fiction *****Movie***** Title: Trigun Type: Anime, Action Format: DVD Rating: PG Stars: 10 Description: Vash the Stampede! *****Movie***** Title: Ishtar Type: Comedy Format: VHS Rating: PG Stars: 2 Description: Viewable boredom
Complete SAX API documentation please refer to the Python SAX APIs
Use xml.dom parse xml
Document Object Model (Document Object Model, referred to as DOM), it is a W3C-recommended treatment Extensible Markup Language standard programming interface.
In a DOM parser to parse an XML document, read the entire document at once, all the elements of the document saved in a tree structure in memory, then you can use the DOM to provide different functions to read or modify the document content and structure to be modified to write the contents of the xml file.
python with xml.dom.minidom to parse xml document, examples are as follows:
#!/usr/bin/python # -*- coding: UTF-8 -*- from xml.dom.minidom import parse import xml.dom.minidom # 使用minidom解析器打开 XML 文档 DOMTree = xml.dom.minidom.parse("movies.xml") collection = DOMTree.documentElement if collection.hasAttribute("shelf"): print "Root element : %s" % collection.getAttribute("shelf") # 在集合中获取所有电影 movies = collection.getElementsByTagName("movie") # 打印每部电影的详细信息 for movie in movies: print "*****Movie*****" if movie.hasAttribute("title"): print "Title: %s" % movie.getAttribute("title") type = movie.getElementsByTagName('type')[0] print "Type: %s" % type.childNodes[0].data format = movie.getElementsByTagName('format')[0] print "Format: %s" % format.childNodes[0].data rating = movie.getElementsByTagName('rating')[0] print "Rating: %s" % rating.childNodes[0].data description = movie.getElementsByTagName('description')[0] print "Description: %s" % description.childNodes[0].data
Results of the above procedures are as follows:
Root element : New Arrivals *****Movie***** Title: Enemy Behind Type: War, Thriller Format: DVD Rating: PG Description: Talk about a US-Japan war *****Movie***** Title: Transformers Type: Anime, Science Fiction Format: DVD Rating: R Description: A schientific fiction *****Movie***** Title: Trigun Type: Anime, Action Format: DVD Rating: PG Description: Vash the Stampede! *****Movie***** Title: Ishtar Type: Comedy Format: VHS Rating: PG Description: Viewable boredom
Complete DOM API documentation please refer to the Python the DOM APIs .