unicode - Avoid non printable character in html file written by Python -
i'm trying convert spss syntax files readable html. it's working except (single) non printable character inserted html file. doesn't seem have ascii code , looks tiny dot. , it's causing trouble.
it occurs (only) in second line of html file, corresponding first line of original file. hints @ line(s) of python cause problem (please see comments)
the code seems cause is
rfil = open(fil,"r") #rfil = read file, original syntax wfil = open(txtfil,"w") #wfil = write file, html output #line below causes problem?? wfil.write("<ol class='code'>\n<li>") cnt = 0 line in rfil: if cnt == 0: #line below causes problem?? wfil.write(line.rstrip("\n").replace("'",''').replace('"','"')) elif len(line) > 1: wfil.write("</li>\n<li>" + line.strip("\n").replace("'",''').replace('"','"')) else: wfil.write("<br /><br />") cnt += 1 wfil.write("</li>\n</ol>") wfil.close() rfil.close()
screen shot of result
the input file seems begin byte order mark (bom), indicate utf-8 encoding. can decode file unicode strings opening with
import codecs rfil = codecs.open(fil, "r", "utf_8_sig")
the utf_8_sig encoding skips bom in beginning.
some programs recognize bom, don't. write file out without bom, use
wfil = codecs.open(txtfil, "w", "utf_8")
Comments
Post a Comment