4

Jep 仍然在玩 Python。

我决定尝试 Gensim,这是一个为选定的单词和上下文找出主题的工具。

所以我想知道如何在一段文本中找到一个单词并连同它一起提取 20 个单词(例如在那个特定单词之前的 10 个单词和那个特定单词之后的 10 个单词)然后将它与其他这样的提取一起保存,这样 Gensim 就可以运行它。

对我来说似乎很难的是找到一种方法来在找到所选单词时提取前后 10 个单词。我之前玩过 nltk,只需将文本标记为单词或句子,就很容易掌握句子。仍然在那个特定句子之前和之后得到那些单词或句子对我来说似乎很难弄清楚该怎么做。

对于那些感到困惑的人(这里是凌晨 1 点,所以我可能会感到困惑),我将举一个例子:

话音刚落,她的血液就涌上心头,因为听到白雪公主还活着,她很生气。“可是现在,”她心想,“我要不要做点什么,彻底毁掉她。” 说着,她用自己懂的术做了一把毒梳子,然后伪装成老寡妇的样子。她翻过七座小山来到七个小矮人的家,[15] 敲了敲门,喊道:“今天卖的好货!”

如果我们说这个词是白雪公主,那么我想提取这部分:

她的心,因为听到白雪公主还活着,她很生气。“可是现在,”她心想,“会

白雪公主前后各 10 个字。

如果这可以在 nltk 中完成并且更容易,那么在出现 Snow-White 的句子之前和之后获取句子也很酷。

我的意思是,如果有人可以帮助我,我会对这两种解决方案中的一种感到满意。

如果 Gensim 也可以做到这一点……而且这更容易,那么我也会对此感到满意。所以这 3 种方法中的任何一种都可以……我只是想尝试看看如何做到这一点,因为 atm 我的脑袋一片空白。

4

3 回答 3

8

该过程称为上下文中的关键字(KWIC)

第一步是将您的输入拆分为单词。使用正则表达式模块有很多方法可以做到这一点,例如re.splitre.findall 。

找到一个特定的单词后,您可以使用切片来查找之前的十个单词和之后的十个单词。

要为所有单词建立索引,带有 maxlen 的deque可以方便地实现滑动窗口。

这是使用itertools有效地做到这一点的一种方法:

from re import finditer
from itertools import tee, islice, izip, chain, repeat

def kwic(text, tgtword, width=10):
    'Find all occurrences of tgtword and show the surrounding context'
    matches = (mo.span() for mo in finditer(r"[A-Za-z\'\-]+", text))
    padded = chain(repeat((0,0), width), matches, repeat((-1,-1), width))
    t1, t2, t3 = tee((padded), 3)
    t2 = islice(t2, width, None)
    t3 = islice(t3, 2*width, None)
    for (start, _), (i, j), (_, stop) in izip(t1, t2, t3):
        if text[i: j] == tgtword:
            context = text[start: stop]
            yield context

print list(kwic(text, 'Snow-White'))
于 2012-05-11T23:20:20.547 回答
7
text = """
As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!"
"""
spl = text.split()

def ans(word):
    for ind, x in enumerate(spl):
       if x.strip(",'\".!") == word:
           break
    return " ".join(spl[ind-10:ind] + spl[ind:ind+11])


>>> ans('Snow-White')
her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will
于 2012-05-11T23:05:46.340 回答
0

只是想更新Raymond Hettinger对 python 3 的精彩回答:

您所要做的就是更改izipzip

from re import finditer
from itertools import chain, islice, repeat, tee

def kwic(text, tgtword, width=20):
'Find all occurrences of tgtword and show the surrounding context'
    matches = (mo.span() for mo in finditer(r"[A-Za-z\'\-]+", text))
    padded = chain(repeat((0,0), width), matches, repeat((-1,-1), width))
    t1, t2, t3 = tee((padded), 3)
    t2 = islice(t2, width, None)
    t3 = islice(t3, 2*width, None)
    for (start, _), (i, j), (_, stop) in zip(t1, t2, t3):
        if text[i: j] == tgtword:
            context = text[start: stop]
            yield context

此外,为了完整起见,两者NLTKTexacity内置了此功能;但是,两者都不像雷蒙德的答案那样有效,因为两者都使用字符数作为 window 而不是tokens

NLTK

import nltk

test = """
As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!"
"""

tokens = nltk.word_tokenize(test)
text = nltk.Text(tokens)
text.concordance('Snow-White', width=100)

Displaying 1 of 1 matches:
er heart , for she was so angry to hear that Snow-White was yet living . `` But now , '' thought she

文本性

from textacy.text_utils import KWIC


test = """
As soon as it had finished, all her blood rushed to her heart, for she was so angry to hear that Snow-White was yet living. "But now," thought she to herself, "will I make something which shall destroy her completely." Thus saying, she made a poisoned comb by arts which she understood, and then, disguising herself, she took the form of an old widow. She went over the seven hills to the house of the seven Dwarfs, and[15] knocking at the door, called out, "Good wares to sell to-day!" 
"""

snow_white = KWIC(test, "Snow-White", window_width=50)
print(next(snow_white, ["Finished"]))

d to her heart, for she was so angry to hear that  Snow-White  was yet living. "But now," thought she to herself
['Finished']
于 2021-03-23T14:06:30.530 回答