Perceived mismatch between hand-rolled re and XPath selector in scrapy shell -
i've opened scrapy shell url want , trying select instances of p tags such that:
<div class="foo"><p>blah</p></div>
but there seems mismatch can't instances of tags.
in [12]: len(hxs.re("<div class=\"foo")) out[12]: 13 in [13]: len(hxs.select('//div[contains(@class, "foo")]')) out[13]: 1
and in fact, can't full account of p tags xpath @ all...
in [14]: len(hxs.select('//p')) out[14]: 6
what missing? thought line [14] give instances of p tags in document.
the html trying select embedded block, wasn't considered valid html xpath. seems common issue new scrapy users page has ajax/javascript content, detectable hashtag in uri: http://example.com/content1#slide1
all of content resides in html code, browser needs run javascript populate whatever content hashtag points dom itself, xpath/bs4 for.
tt will, however, pullable regular expressions, if you're bold (hacky) enough. i'm considering other alternatives... making new xml dom out of contents of script block.
Comments
Post a Comment