python - NLTK 中的实际字数

Question

NLTK 书中有几个字数统计示例，但实际上它们不是字数统计，而是令牌计数。例如，第 1 章，计数词汇说以下给出了字数：

text = nltk.Text(tokens)
len(text)

然而，它没有——它给出了一个单词和标点符号的数量。你怎么能得到一个真正的字数（忽略标点符号）？

同样，如何获得一个单词的平均字符数？显而易见的答案是：

word_average_length =(len(string_of_text)/len(text))

但是，这将被关闭，因为：

len(string_of_text) 是字符数，包括空格
len(text) 是一个记号计数，不包括空格但包括标点符号，它们不是单词。

我在这里错过了什么吗？这一定是一个非常常见的 NLP 任务……

score 21 · Accepted Answer

使用 nltk 进行标记化

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)

退货

['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']

score 15 · Accepted Answer

删除标点符号

使用正则表达式过滤掉标点符号

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

平均字符数

将每个单词的长度相加。除以字数。

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

或者你可以利用你已经做过的计数来防止一些重新计算。这将单词的长度乘以我们看到它的次数，然后将所有这些相加。

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75

score 2 · Accepted Answer

删除标点符号（没有正则表达式）

使用与dhg相同的解决方案，但测试给定的标记是字母数字，而不是使用正则表达式模式。

from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> filtered = [w for w in text if w.isalnum()]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

好处：

在 bool(nonPunct.match("à")) 是"À".isalnum()（至少在法语中“à”不是标点符号）时，对非英语语言效果更好。TrueFalse
不需要使用re包。

python - NLTK 中的实际字数

3 回答 3

删除标点符号

平均字符数

删除标点符号（没有正则表达式）

Related

Reference