python - Parse XML dump of a MediaWiki wiki -
i trying parse xml dump of wiktionary missing since don't output.
this similar shorter xml file:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it"> <page> <title>bigoto</title> <ns>0</ns> <id>24840</id> <revision> <id>1171207</id> <parentid>743817</parentid> <timestamp>2011-12-18t19:26:42z</timestamp> <contributor> <username>gnubotmarcoo</username> <id>14353</id> </contributor> <minor /> <comment>[[wikizionario:bot|bot]]: sostituisco template {{[[template:in|in]]}}</comment> <text xml:space="preserve">== wikimarkups ==</text> <sha1>gji6wqnsy6vi1ro8887t3bikh7nb3fr</sha1> <model>wikitext</model> <format>text/x-wiki</format> </revision> </page> </mediawiki>
i interest in parsing content of <title>
element if <ns>
element equals 0.
this script
import xml.etree.elementtree et tree = et.parse('test.xml') root = tree.getroot() page in root.findall('page'): ns = int(page.find('ns').text) word = page.find('title').text if ns == 0: print word
i recommend using beautifulsoup
can because it's easy use.
from bs4 import beautifulsoup bs # given html variable 'html' soup = bs(html, "xml") pages = soup.find_all('page') page in pages: if page.ns.text == '0': print page.title.text
as far can tell here, no need use int
convert <ns>
tag integer compare against == 0
. comparing against string '0'
works well--even easier, in case, since wouldn't have deal conversion @ all.
Comments
Post a Comment