我想PyTextRank
用于关键词提取。如何将 500 万份文档(每个文档由几段组成)提供给包?
这是我在官方教程上看到的例子。
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
我是否只能选择将数百万个文档连接到一个字符串并将其传递给nlp(text)
?我不认为我可以使用nlp.pipe(texts)
,因为我想通过计算所有文档中的单词/短语来创建一个网络。