How do I check if a page is html5-based in python? -


i'm trying parse various pages on web using lxml module, like:

def dom(self):     return lxml.html.fromstring(self.content) 

but seems have switch lxml.html lxml.html.html5parser in case of html5 pages.

http://lxml.de/html5parser.html

so how can determine if page html5-based? have check doctype char char before parse it?


edit: made simple regexp deal problem. seems work, yeah, i'm still looking neat ways. solution breaks sourceline method.

import lxml.html lxml.html import html5parser  def dom(self):     content = self.content     if self._is_html5():         elm = html5parser.fromstring(content)         content = lxml.html.tostring(elm, method='html')     return lxml.html.fromstring(content)  def _is_html5(self):     return bool(re.match(r'^<!doctype html>', self.content, re.i)) 

you don't have switch using html5parser html5 files. can, , should, use html5parser html files. browsers use html5 compatible parser html files regardless of version.


Comments

Popular posts from this blog

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -

Trying to import CSV file to a SQL Server database using asp.net and c# - can't find what I'm missing -