How do I check if a page is html5-based in python? -

May 15, 2011

i'm trying parse various pages on web using lxml module, like:

def dom(self):     return lxml.html.fromstring(self.content)

but seems have switch lxml.html lxml.html.html5parser in case of html5 pages.

http://lxml.de/html5parser.html

so how can determine if page html5-based? have check doctype char char before parse it?

edit: made simple regexp deal problem. seems work, yeah, i'm still looking neat ways. solution breaks sourceline method.

import lxml.html lxml.html import html5parser  def dom(self):     content = self.content     if self._is_html5():         elm = html5parser.fromstring(content)         content = lxml.html.tostring(elm, method='html')     return lxml.html.fromstring(content)  def _is_html5(self):     return bool(re.match(r'^<!doctype html>', self.content, re.i))

you don't have switch using html5parser html5 files. can, , should, use html5parser html files. browsers use html5 compatible parser html files regardless of version.

Search This Blog

DIs

How do I check if a page is html5-based in python? -

Comments

Post a Comment

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

php - Boolean search on database with 5 million rows, very slow -

css - Text drops down with smaller window -