python - Parse XML dump of a MediaWiki wiki -

April 15, 2013

i trying parse xml dump of wiktionary missing since don't output.

this similar shorter xml file:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it">      <page>     <title>bigoto</title>     <ns>0</ns>     <id>24840</id>     <revision>       <id>1171207</id>       <parentid>743817</parentid>       <timestamp>2011-12-18t19:26:42z</timestamp>       <contributor>         <username>gnubotmarcoo</username>         <id>14353</id>       </contributor>       <minor />       <comment>[[wikizionario:bot|bot]]: sostituisco template {{[[template:in|in]]}}</comment>       <text xml:space="preserve">== wikimarkups ==</text>       <sha1>gji6wqnsy6vi1ro8887t3bikh7nb3fr</sha1>       <model>wikitext</model>       <format>text/x-wiki</format>     </revision>  </page> </mediawiki>

i interest in parsing content of <title> element if <ns> element equals 0.

this script

import xml.etree.elementtree et tree = et.parse('test.xml') root = tree.getroot()  page in root.findall('page'):   ns = int(page.find('ns').text)   word = page.find('title').text   if ns == 0:       print word

i recommend using beautifulsoup can because it's easy use.

from bs4 import beautifulsoup bs # given html variable 'html' soup = bs(html, "xml") pages = soup.find_all('page') page in pages:     if page.ns.text == '0':         print page.title.text

as far can tell here, no need use int convert <ns> tag integer compare against == 0. comparing against string '0' works well--even easier, in case, since wouldn't have deal conversion @ all.

Search This Blog

DIs

python - Parse XML dump of a MediaWiki wiki -

Comments

Post a Comment

Popular posts from this blog

css - Text drops down with smaller window -

php - cannot display multiple markers in google maps v3 from traceroute result -

php - Boolean search on database with 5 million rows, very slow -