python - 打开文件并阅读句子

Question

我想打开一个文件并获取句子。文件中的句子跨行，如下所示：

"He said, 'I'll pay you five pounds a week if I can have it on my own
terms.'  I'm a poor woman, sir, and Mr. Warren earns little, and the
money meant much to me.  He took out a ten-pound note, and he held it
out to me then and there.

目前我正在使用此代码：

text = ' '.join(file_to_open.readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

readlines切穿句子，有没有解决这个问题的好方法，只得到句子？（没有 NLTK）

谢谢你的关注。

目前的问题：

file_to_read = 'test.txt'

with open(file_to_read) as f:
    text = f.read()

import re
word_list = ['Mrs.', 'Mr.']     

for i in word_list:
    text = re.sub(i, i[:-1], text)

我得到的（在测试用例中）是 Mrs. 改为 Mr. 而 Mr. 只是 Mr. 我尝试了其他几件事，但似乎没有用。答案可能很简单，但我错过了

score 3 · Accepted Answer

如果您这样做，您的正则表达式适用于上面的文本：

with open(filename) as f:
    text = f.read()

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

唯一的问题是，正则表达式在“先生”中的点上分裂。从你上面的文字，所以你需要修复/改变它。

对此的一种解决方案虽然并不完美，但您可以在 Mr 之后取出所有出现的点：

text = re.sub(r'(M\w{1,2})\.', r'\1', text) # no for loop needed for this, like there was before

this 匹配一个 'M' 后跟最少 1 个，最多 2 个字母数字字符 (\w{1,3})，后跟一个点。模式的括号部分被分组和捕获，并在替换中引用为 '\1'（或组 1，因为您可以有更多括号组）。所以本质上，先生或夫人是匹配的，但只有先生或夫人部分被捕获，然后先生或夫人被不包括点的捕获部分替换。

进而：

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

会按照你想要的方式工作。

score 1 · Accepted Answer

您可能想尝试一下text-sentence tokenizer 模块。

从他们的示例代码中：

>>> from text_sentence import Tokenizer
>>> t = Tokenizer()
>>> list(t.tokenize("This is first sentence. This is second one!And this is third, is it?"))
[T('this'/sent_start), T('is'), T('first'), T('sentence'), T('.'/sent_end),
 T('this'/sent_start), T('is'), T('second'), T('one'), T('!'/sent_end),
 T('and'/sent_start), T('this'), T('is'), T('third'), T(','/inner_sep),
 T('is'), T('it'), T('?'/sent_end)]

不过，我从未真正尝试过，我更喜欢使用 NLTK/punkt。

python - 打开文件并阅读句子

谢谢你的关注。

2 回答 2

Related

Reference