How do I check if a page is html5-based in python? -
i'm trying parse various pages on web using lxml
module, like:
def dom(self): return lxml.html.fromstring(self.content)
but seems have switch lxml.html
lxml.html.html5parser
in case of html5 pages.
http://lxml.de/html5parser.html
so how can determine if page html5-based? have check doctype
char char before parse it?
edit: made simple regexp deal problem. seems work, yeah, i'm still looking neat ways. solution breaks sourceline
method.
import lxml.html lxml.html import html5parser def dom(self): content = self.content if self._is_html5(): elm = html5parser.fromstring(content) content = lxml.html.tostring(elm, method='html') return lxml.html.fromstring(content) def _is_html5(self): return bool(re.match(r'^<!doctype html>', self.content, re.i))
you don't have switch using html5parser html5 files. can, , should, use html5parser html files. browsers use html5 compatible parser html files regardless of version.
Comments
Post a Comment