python - Parse XML dump of a MediaWiki wiki -


i trying parse xml dump of wiktionary missing since don't output.

this similar shorter xml file:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it">      <page>     <title>bigoto</title>     <ns>0</ns>     <id>24840</id>     <revision>       <id>1171207</id>       <parentid>743817</parentid>       <timestamp>2011-12-18t19:26:42z</timestamp>       <contributor>         <username>gnubotmarcoo</username>         <id>14353</id>       </contributor>       <minor />       <comment>[[wikizionario:bot|bot]]: sostituisco template {{[[template:in|in]]}}</comment>       <text xml:space="preserve">== wikimarkups ==</text>       <sha1>gji6wqnsy6vi1ro8887t3bikh7nb3fr</sha1>       <model>wikitext</model>       <format>text/x-wiki</format>     </revision>  </page> </mediawiki> 

i interest in parsing content of <title> element if <ns> element equals 0.

this script

import xml.etree.elementtree et tree = et.parse('test.xml') root = tree.getroot()  page in root.findall('page'):   ns = int(page.find('ns').text)   word = page.find('title').text   if ns == 0:       print word 

i recommend using beautifulsoup can because it's easy use.

from bs4 import beautifulsoup bs # given html variable 'html' soup = bs(html, "xml") pages = soup.find_all('page') page in pages:     if page.ns.text == '0':         print page.title.text 

as far can tell here, no need use int convert <ns> tag integer compare against == 0. comparing against string '0' works well--even easier, in case, since wouldn't have deal conversion @ all.


Comments

Popular posts from this blog

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -

Trying to import CSV file to a SQL Server database using asp.net and c# - can't find what I'm missing -