python - 用英文单词和汉字查找句子的长度

Question

该句子可能包含非英文字符，例如中文：

你好,hello world

长度的期望值为5（2个汉字，2个英文单词，1个逗号）

score 2 · Accepted Answer

您可以使用大多数汉字位于 unicode 范围0x4e00 - 0x9fcc。

# -*- coding: utf-8 -*-
import re

s = '你好 hello, world'
s = s.decode('utf-8')

# First find all 'normal' words and interpunction
# '[\x21-\x2f]' includes most interpunction, change it to ',' if you only need to match a comma
count = len(re.findall(r'\w+|[\x21-\x2]', s))

for word in s:
    for ch in word:
        # see https://stackoverflow.com/a/11415841/1248554 for additional ranges if needed
        if 0x4e00 < ord(ch) < 0x9fcc:
            count += 1

print count

score 1 · Accepted Answer

如果您乐于将每个中文字符视为一个单独的单词，即使并非总是如此，您可以通过使用unicodedata 模块检查每个字符的 Unicode 字符属性来完成类似的操作。

例如，如果您在示例文本上运行此代码：

# -*- coding: utf-8 -*-

import unicodedata

s = u"你好,hello world"     
for c in s:
  print unicodedata.category(c)

您会看到中文字符被报告为Lo（字母 other），这与通常报告为Llor的拉丁字符不同Lu。

知道了这一点，您可以考虑将任何内容视为Lo单个单词，即使它没有被空格/标点符号分隔。

现在这几乎肯定不会适用于所有语言的所有情况，但它可能足以满足您的需求。

更新

这是一个更完整的示例，说明如何执行此操作：

# -*- coding: utf-8 -*-

import unicodedata

s = u"你好,hello world"     

wordcount = 0
start = True
for c in s:      
  cat = unicodedata.category(c)
  if cat == 'Lo':        # Letter, other
    wordcount += 1       # each letter counted as a word
    start = True                       
  elif cat[0] == 'P':    # Some kind of punctuation
    wordcount += 1       # each punctation counted as a word
    start = True                       
  elif cat[0] == 'Z':    # Some kind of separator
    start = True
  else:                  # Everything else
    if start:
      wordcount += 1     # Only count at the start
    start = False    

print wordcount

score 0 · Accepted Answer

这里的逻辑有问题：

你好
,

这些都是字符，而不是单词。对于汉字，您可能需要使用正则表达式做一些事情

这里的问题是中文字符可能是单词部分或单词。

大好

在正则表达式中，是一两个词吗？每个字符单独是一个词，但它们在一起也是一个词。

hello world

如果你把它计算在空格上，那么你会得到 2 个单词，而且你的中文正则表达式可能不起作用。

我认为你可以让这个“单词”工作的唯一方法是分别计算中文和英文。

python - 用英文单词和汉字查找句子的长度

3 回答 3

Related

Reference