Perceived mismatch between hand-rolled re and XPath selector in scrapy shell -

May 15, 2011

i've opened scrapy shell url want , trying select instances of p tags such that:

<div class="foo"><p>blah</p></div>

but there seems mismatch can't instances of tags.

in [12]: len(hxs.re("<div class=\"foo")) out[12]: 13  in [13]: len(hxs.select('//div[contains(@class, "foo")]')) out[13]: 1

and in fact, can't full account of p tags xpath @ all...

in [14]: len(hxs.select('//p')) out[14]: 6

what missing? thought line [14] give instances of p tags in document.

the html trying select embedded block, wasn't considered valid html xpath. seems common issue new scrapy users page has ajax/javascript content, detectable hashtag in uri: http://example.com/content1#slide1

all of content resides in html code, browser needs run javascript populate whatever content hashtag points dom itself, xpath/bs4 for.

tt will, however, pullable regular expressions, if you're bold (hacky) enough. i'm considering other alternatives... making new xml dom out of contents of script block.

Search This Blog

DIs

Perceived mismatch between hand-rolled re and XPath selector in scrapy shell -

Comments

Post a Comment

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

css - Text drops down with smaller window -

php - Boolean search on database with 5 million rows, very slow -