python - 使用 NLTK 在 Python 文件的特定区域中使用 sent_tokenize？

Question

我有一个包含数千个句子的文件，我想找到包含特定字符/单词的句子。

最初，我对整个文件进行标记（使用sent_tokenize），然后遍历句子以查找单词。但是，这太慢了。既然我可以快速找到单词的索引，我可以利用它来发挥我的优势吗？有没有办法只标记一个单词周围的区域（即找出哪个句子包含一个单词）？

谢谢。

编辑：我在 Python 中并使用 NLTK 库。

score 2 · Accepted Answer

你用的是什么平台？在 unix/linux/macOS/cygwin 上，您可以执行以下操作：

sed 's/[\.\?\!]/\n/' < myfile | grep 'myword'

它将仅显示包含您的单词的行（并且 sed 将非常粗略地标记为句子）。如果您想要特定语言的解决方案，您应该说出您正在使用的内容！

编辑Python：

以下将起作用 - 如果您的单词上有正则表达式匹配，它只会调用标记化（这是一个非常快速的操作）。这意味着您只标记包含您想要的单词的行：

import re
import os.path

myword = 'using'
fname = os.path.abspath('path/to/my/file')

try:
    f = open(fname)

    matching_lines = list(l for l in f if re.search(r'\b'+myword+r'\b', l))
    for match in matching_lines:
        #do something with matching lines
        sents = sent_tokenize(match)
except IOError:
    print "Can't open file "+fname
finally:
    f.close()

score 0 · Accepted Answer

这是一个可以加快搜索速度的想法。您创建一个附加列表，在其中存储大文本中每个句子的总字数。使用我从 Alex Martelli 那里学到的生成器函数，尝试以下操作：

def running_sum(a):
  tot = 0
  for item in a:
    tot += item
    yield tot

from nltk.tokenize import sent_tokenize

sen_list = sent_tokenize(bigtext)
wc = [len(s.split()) for s in sen_list]
runningwc = list(running_sum(wc)) #list of the word count for each sentence (running total for the whole text)

word_index = #some number that you get from word index

for index,w in enumerate(runningwc):
    if w > word_index:
        sentnumber = index-1 #found the index of the sentence that contains the word
        break

print sen_list[sentnumber]

希望这个想法有所帮助。

更新：如果 sent_tokenize 很慢，那么您可以尝试完全避免它。使用已知索引在大文本中查找单词。

现在，逐个字符地向前和向后移动，以检测句子的结尾和句子的开头。像“[.!?]”（句号、感叹号或问号，后跟一个空格）之类的东西将表示句子的开始和结束。您只会在目标词附近进行搜索，因此它应该比 sent_tokenize 快得多。

python - 使用 NLTK 在 Python 文件的特定区域中使用 sent_tokenize？

2 回答 2

Related

Reference