python - 滥用 nltk 的 word_tokenize(sent) 的后果

Question

我正在尝试将段落拆分为单词。我手头有可爱的 nltk.tokenize.word_tokenize(sent)，但 help(word_tokenize) 说，“这个标记器设计用于一次处理一个句子。”

有谁知道如果你在一个段落上使用它会发生什么，即最多 5 个句子，而不是？我自己在几个短段落上尝试过它，它似乎有效，但这并不是确凿的证据。

score 7 · Accepted Answer

nltk.tokenize.word_tokenize(text)只是一个瘦包装函数，它调用TreebankWordTokenizertokenize类的实例的方法，它显然使用简单的正则表达式来解析句子。

该类的文档指出：

这个分词器假设文本已经被分割成句子。任何句点——除了字符串末尾的句点——都被认为是它们所附加的单词的一部分（例如，缩写等），并且没有单独标记。

底层tokenize方法本身非常简单：

def tokenize(self, text):
    for regexp in self.CONTRACTIONS2:
        text = regexp.sub(r'\1 \2', text)
    for regexp in self.CONTRACTIONS3:
        text = regexp.sub(r'\1 \2 \3', text)

    # Separate most punctuation
    text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)

    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)

    # Separate single quotes if they're followed by a space.
    text = re.sub(r"('\s)", r' \1', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)

    return text.split()

基本上，该方法通常所做的是将句点标记为单独的标记，如果它位于字符串的末尾：

>>> nltk.tokenize.word_tokenize("Hello, world.")
['Hello', ',', 'world', '.']

在假设它是缩写的情况下，任何落在字符串中的句点都被标记为单词的一部分：

>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") 
['Hello', ',', 'world.', 'How', 'are', 'you', '?']

只要这种行为是可以接受的，你应该没问题。

score 1 · Accepted Answer

试试这种技巧：

>>> from string import punctuation as punct
>>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?"
# Add spaces before punctuations
>>> for ch in sent:
...     if ch in punct:
...             sent = sent.replace(ch, " "+ch+" ")
# Remove double spaces if it happens after adding spaces before punctuations.
>>> sent = " ".join(sent.split())

那么很可能下面的代码也是你需要计算频率的=）

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
>>> for i in fdist:
...     print i, fdist[i]

python - 滥用 nltk 的 word_tokenize(sent) 的后果

2 回答 2

Related

Reference