python - 如何改进 textacy.extract.semistructured_statements() 结果

Question

对于这个项目，我使用了 Wikipedia、spacy 和 textacy.extract 模块。

我使用维基百科模块来抓取我设置主题的页面。它将返回其内容的字符串。

然后，我使用 textacy.extract.semistructured_statements() 过滤掉事实。它需要两个必需的参数。第一个是文档，第二个是实体。

出于测试目的，我尝试将主题设置为 Ubuntu 和比尔盖茨。

#The Subject we are looking for
subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)
wikiPage = wikipedia.page(wikiResults[0]).content

#Spacy
nlp = spacy.load("en_core_web_sm")
document = nlp(wikiPage)

#Textacy.Extract
statments = textacy.extract.semistructured_statements(document, subject)

for statement in statements:
    subject, verb, fact = statement

    print(fact)

所以当我运行程序时，我会返回多个搜索 Ubuntu 的结果，但不是比尔盖茨。为什么会这样？如何改进我的代码以从 Wikipedia 页面中提取更多事实？

编辑：这是最终结果

Ubuntu：

比尔盖茨：

score 3 · Accepted Answer

您需要使用不同的线索处理文档以提取用于描述主题的常用动词，如果要搜索多个单词，您还需要拆分字符串。例如，对于比尔盖茨，您将需要搜索“比尔”、“盖茨”、“比尔盖茨”的组合，并且您需要提取用于描述感兴趣的人/对象的不同提示基础动词。

因此，例如搜索“Gates”：

statments = textacy.extract.semistructured_statements(document, "Gates", cue = 'have',  max_n_words = 200, )

会给你更多的东西，比如：

* entity: Gates , cue: had , fact: primary responsibility for Microsoft's product strategy from the company's founding in 1975 until 2006
* entity: Gates , cue: is , fact: notorious for not being reachable by phone and for not returning phone calls
* entity: Gates , cue: was , fact: the second wealthiest person behind Carlos Slim, but regained the top position in 2013, according to the Bloomberg Billionaires List
* entity: Bill , cue: were , fact: the second-most generous philanthropists in America, having given over $28 billion to charity
* entity: Gates , cue: was , fact: seven years old
* entity: Gates , cue: was , fact: the guest on BBC Radio 4's Desert Island Discs on January 31, 2016, in which he talks about his relationships with his father and Steve Jobs, meeting Melinda Ann French, the start of Microsoft and some of his habits (for example reading The Economist "from cover to cover every week
* entity: Gates , cue: was , fact: the world's highest-earning billionaire in 2013, as his net worth increased by US$15.8 billion to US$78.5 billion

请注意，动词可以是否定的，就像在 2 结果中一样！

我还注意到，使用超过默认 20 个单词的 max_n_words 会导致更有趣的语句。

这是我的完整脚本：

import wikipedia
import spacy
import textacy
import en_core_web_sm

subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)
#print("wikiResults:", wikiResults)
wikiPage = wikipedia.page(wikiResults[0]).content
print("\n\nwikiPage:", wikiPage, "'\n")
nlp = en_core_web_sm.load()
document = nlp(wikiPage)
uniqueStatements = set()
for word in ["Gates", "Bill", "Bill Gates"]:
    for cue in ["be", "have", "write", "talk", "talk about"]:
        statments = textacy.extract.semistructured_statements(document, word, cue = cue,  max_n_words = 200, )
        for statement in statments:
            uniqueStatements.add(statement)

print("found", len(uniqueStatements), "statements.")
for statement in uniqueStatements:
    entity, cue, fact = statement
    print("* entity:",entity, ", cue:", cue, ", fact:", fact)

变化的主题和线索动词让我得到 23 个结果而不是 1 个。

score -1 · Accepted Answer

要感谢加布里埃尔米。给我方向。

我添加了我在神经核模块示例中看到的 ["It","he","she","they"]。

下面的代码将为您完成工作

import wikipedia
import spacy
import textacy
import en_core_web_sm

subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)

wikiPage = wikipedia.page(wikiResults[0]).content

nlp = en_core_web_sm.load()
document = nlp(wikiPage)
uniqueStatements = set()

for word in ["It","he","she","they"]+subject.split(' '):    
    for cue in ["be", "have", "write", "talk", "talk about"]:
        statments = textacy.extract.semistructured_statements(document, word, cue = cue,  max_n_words = 200, )
        for statement in statments:
            uniqueStatements.add(statement)

for statement in uniqueStatements:
    entity, cue, fact = statement
    print(entity, cue, fact)

python - 如何改进 textacy.extract.semistructured_statements() 结果

编辑：这是最终结果

2 回答 2

Related

Reference