您需要使用不同的线索处理文档以提取用于描述主题的常用动词,如果要搜索多个单词,您还需要拆分字符串。例如,对于比尔盖茨,您将需要搜索“比尔”、“盖茨”、“比尔盖茨”的组合,并且您需要提取用于描述感兴趣的人/对象的不同提示基础动词。
因此,例如搜索“Gates”:
statments = textacy.extract.semistructured_statements(document, "Gates", cue = 'have', max_n_words = 200, )
会给你更多的东西,比如:
* entity: Gates , cue: had , fact: primary responsibility for Microsoft's product strategy from the company's founding in 1975 until 2006
* entity: Gates , cue: is , fact: notorious for not being reachable by phone and for not returning phone calls
* entity: Gates , cue: was , fact: the second wealthiest person behind Carlos Slim, but regained the top position in 2013, according to the Bloomberg Billionaires List
* entity: Bill , cue: were , fact: the second-most generous philanthropists in America, having given over $28 billion to charity
* entity: Gates , cue: was , fact: seven years old
* entity: Gates , cue: was , fact: the guest on BBC Radio 4's Desert Island Discs on January 31, 2016, in which he talks about his relationships with his father and Steve Jobs, meeting Melinda Ann French, the start of Microsoft and some of his habits (for example reading The Economist "from cover to cover every week
* entity: Gates , cue: was , fact: the world's highest-earning billionaire in 2013, as his net worth increased by US$15.8 billion to US$78.5 billion
请注意,动词可以是否定的,就像在 2 结果中一样!
我还注意到,使用超过默认 20 个单词的 max_n_words 会导致更有趣的语句。
这是我的完整脚本:
import wikipedia
import spacy
import textacy
import en_core_web_sm
subject = 'Bill Gates'
#The Wikipedia Page
wikiResults = wikipedia.search(subject)
#print("wikiResults:", wikiResults)
wikiPage = wikipedia.page(wikiResults[0]).content
print("\n\nwikiPage:", wikiPage, "'\n")
nlp = en_core_web_sm.load()
document = nlp(wikiPage)
uniqueStatements = set()
for word in ["Gates", "Bill", "Bill Gates"]:
for cue in ["be", "have", "write", "talk", "talk about"]:
statments = textacy.extract.semistructured_statements(document, word, cue = cue, max_n_words = 200, )
for statement in statments:
uniqueStatements.add(statement)
print("found", len(uniqueStatements), "statements.")
for statement in uniqueStatements:
entity, cue, fact = statement
print("* entity:",entity, ", cue:", cue, ", fact:", fact)
变化的主题和线索动词让我得到 23 个结果而不是 1 个。