3

I need to count number of words in UTF-8 string. ie I need to write a python function which takes "एक बार,एक कौआ, बहुत प्यासा, था" as input and returns 7 ( number of words ).

I tried regular expression "\b" as shown below. But result are inconsistent.

wordCntExp=re.compile(ur'\b',re.UNICODE);
sen='एक बार,एक कौआ, बहुत प्यासा, था';
print len(wordCntExp.findall(sen.decode('utf-8'))) >> 1;
12 

Any interpretation of the above answer or any other approaches to solve the above problem are appreciated.

4

3 回答 3

4

尝试使用:

import re
words = re.split(ur"[\s,]+",sen, flags=re.UNICODE)
count = len(words)

它将用空格和逗号分割单词。您可以将不被视为属于单词的字符的其他字符添加到第一个参数中。

受此启发

python重新文档

于 2013-07-16T08:48:14.380 回答
0

使用正则表达式

>>> import regex
>>> sen = 'एक बार,एक कौआ, बहुत प्यासा, था'
>>> regex.findall(ur'\w+', sen.decode('utf-8'))
[u'\u090f\u0915', u'\u092c\u093e\u0930', u'\u090f\u0915', u'\u0915\u094c\u0906', u'\u092c\u0939\u0941\u0924', u'\u092a\u094d\u092f\u093e\u0938\u093e', u'\u0925\u093e']
>>> len(regex.findall(ur'\w+', sen.decode('utf-8')))
7
于 2013-07-16T08:58:27.890 回答
0

我对您的语言结构一无所知,但是您不能简单地计算空格吗?

>>> len(sen.split()) + 1
7

注意+ 1因为有n - 1空格。[编辑为在任意长度的空间上分割-感谢@Martijn Pieters]

于 2013-07-16T08:45:11.043 回答