encoding - How to remove non-printable/invisible characters in ruby? -
sometimes have evil non-printable characters in middle of string. these strings user input, must make program receive instead of try change source of problem.
for example, can have zero width no-break space in middle of string. example, while parsing .po
file, 1 problematic part string "he man of god"
in middle of file. while seems correct, inspecting irb
shows:
"he man of god".codepoints => [104, 101, 32, 105, 115, 32, 97, 32, 65279, 109, 97, 110, 32, 111, 102, 32, 103, 111, 100]
i believe know bom
is, , handle nicely. have such characters on middle of file, not bom
.
my current approach remove characters found evil in smelly fashion:
text = (text.codepoints - codepoints_blacklist).pack("u*")
the close got following this post leaded me :print:
option on regexps. no me:
"m".scan(/[[:print:]]/).join.codepoints => [65279, 109]
so question is: how can remove non-printable characters string in ruby?
ruby can convert 1 multi-byte character set another. check these search results, plus read on ruby string's encode
method.
also, ruby's iconv friend.
finally, james grey wrote series of articles cover in detail.
one of things can using tools tell them transcode visually similar character, or ignore them completely.
dealing alternate character sets 1 of most... irritating things i've ever had do, because files can contain anything, marked text. might not expect , code dies or starts throwing errors, because people ingenious when coming ways insert alternate characters content.
Comments
Post a Comment