python - Find the length of a sentence with English words and Chinese characters -
the sentence may include non-english characters, e.g. chinese:
你好,hello world
the expected value length 5
(2 chinese characters, 2 english words, , 1 comma)
you can use chinese characters located in unicode range 0x4e00 - 0x9fcc.
# -*- coding: utf-8 -*- import re s = '你好 hello, world' s = s.decode('utf-8') # first find 'normal' words , interpunction # '[\x21-\x2f]' includes interpunction, change ',' if need match comma count = len(re.findall(r'\w+|[\x21-\x2]', s)) word in s: ch in word: # see https://stackoverflow.com/a/11415841/1248554 additional ranges if needed if 0x4e00 < ord(ch) < 0x9fcc: count += 1 print count
Comments
Post a Comment