python - 我怎样才能取几段文字，看看是否有句子有代词，然后选择所有这些句子来组成一个新段落？

Question

我应该使用 NLTK 还是正则表达式来拆分它？
我该如何选择代词（他/她）。我想选择任何有代词的句子。

这是一个更大项目的一部分，我是 Python 新手。你能给我指出任何有用的代码吗？

score 1 · Accepted Answer

NLTK 是您最好的选择。给定一串句子作为输入，您可以通过执行以下操作获得包含代词的句子列表：

from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
       if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]

回报：

['Take it or leave it.']

基本上，我们将字符串拆分为句子列表，将这些句子拆分为单词列表，并将每个句子的单词列表转换为一组词性标签（这很重要，因为如果我们不这样做，当我们有多个句子中的代词，我们会得到重复的句子）。

score 1 · Accepted Answer

我正在做一个有类似需求的 NLP 项目。我建议您使用NLTK，因为它使事情变得非常简单，并为我们提供了很大的灵活性。由于您需要收集所有带有代词的句子，您可以将文本中的所有句子拆分并保存在一个列表中。然后，您可以遍历列表并查找包含代词的句子。还要确保记下句子的索引（在列表中），或者您可以形成一个新列表。

下面的示例代码：

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []

for sentence in sentences:
    words = word_tokenize(sentence)
    for word in words:
        word_pos = pos_tag([word])
        if word_pos[0][1] == 'PRP':
            sentences_with_pronouns.append(sentence)
            break

print sentences_with_pronouns

输出：

['she also loves to play chess.']

PS还要检查pylucene和whoosh库，它们是非常有用的 NLP python 包。

python - 我怎样才能取几段文字，看看是否有句子有代词，然后选择所有这些句子来组成一个新段落？

2 回答 2

Related

Reference