python - Find the length of a sentence with English words and Chinese characters -


the sentence may include non-english characters, e.g. chinese:

你好,hello world 

the expected value length 5 (2 chinese characters, 2 english words, , 1 comma)

you can use chinese characters located in unicode range 0x4e00 - 0x9fcc.

# -*- coding: utf-8 -*- import re  s = '你好 hello, world' s = s.decode('utf-8')  # first find 'normal' words , interpunction # '[\x21-\x2f]' includes interpunction, change ',' if need match comma count = len(re.findall(r'\w+|[\x21-\x2]', s))  word in s:     ch in word:         # see https://stackoverflow.com/a/11415841/1248554 additional ranges if needed         if 0x4e00 < ord(ch) < 0x9fcc:             count += 1  print count 

Comments

Popular posts from this blog

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -

Trying to import CSV file to a SQL Server database using asp.net and c# - can't find what I'm missing -