python - 在保留引号的同时使用 nltk 拆分句子

Question

我正在使用 nltk 将文本拆分为句子单元。但是，我需要将包含引号的句子作为一个单元提取。现在每个句子，即使它在引用中，也被提取为单独的部分。

这是我试图提取为单个单元的示例：

"This is a sentence. This is also a sentence," said the cat.

现在我有这个代码：

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

text = 'This is a sentence. This is also a sentence," said the cat.'

print '\n-----\n'.join(tokenizer.tokenize(text, realign_boundaries=True))

这很好用，但即使引号本身包含多个句子，我也想保留带有引号的句子。

上面的代码产生：

This is a sentence.
-----
This is also a sentence," said the cat.

我正在尝试将整个文本提取为一个单元：

"This is a sentence. This is also a sentence," said the cat.

有没有一种简单的方法可以使用 nltk 来做到这一点，或者我应该使用正则表达式吗？开始使用 nltk 是多么容易给我留下了深刻的印象，但现在我被困住了。

score 2 · Accepted Answer

如果我正确理解了这个问题，那么这个正则表达式应该这样做：

import re

text = '"This is a sentence. This is also a sentence," said the cat.'

for grp in re.findall(r'"[^"]*\."|("[^"]*")*([^".]*\.)', text):
    print "".join(grp)

它是 2 种模式的组合或组合在一起。第一个找到普通的引用句子。第二个查找普通句子或带有引号后跟句号的句子。如果您有更复杂的句子，则可能需要进一步调整。

score 0 · Accepted Answer

只需将您的打印语句更改为：

print ' '.join(tokenizer.tokenize(text, realign_boundaries=True))

这将用空格而不是\n-----\n.

python - 在保留引号的同时使用 nltk 拆分句子

2 回答 2

Related

Reference