spacy - 将大文本提供给 PyTextRank

Question

我想PyTextRank用于关键词提取。如何将 500 万份文档（每个文档由几段组成）提供给包？

这是我在官方教程上看到的例子。

text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
    ic(phrase.rank, phrase.count, phrase.text)
    ic(phrase.chunks)

我是否只能选择将数百万个文档连接到一个字符串并将其传递给nlp(text)？我不认为我可以使用nlp.pipe(texts)，因为我想通过计算所有文档中的单词/短语来创建一个网络。

score 3 · Accepted Answer

不，相反，并行运行这些任务几乎肯定会更好。许多用例pytextrank使用 Spark、Dask、Ray 等，通过spaCy管道并行化运行文档pytestrank以提取实体。有关 Ray 并行化的示例，请参阅https://github.com/Coleridge-Initiative/rclc/blob/4d5347d8d1ac2693901966d6dd6905ba14133f89/bin/index_phrases.py#L45

一个问题是您如何将提取的实体与文档相关联？这些是否被收集到数据集中，或者可能是数据库或键/值存储？

无论这些结果如何收集，您都可以构建一个同时出现的短语的图表，并且还包括额外的语义来帮助构建结果。为此类用例创建了一个姊妹项目kglab https://github.com/DerwenAI/kglab 。项目附带的 Jupyter 笔记本中有一些示例kglab；见https://derwen.ai/docs/kgl/tutorial/

FWIW，我们将在 ODSC West 上提供有关使用的教程kglab，pytextrank并且有几个在线视频（在Graph Data Science下）用于以前的会议教程。我们还通过https://www.knowledgegraph.tech/提供每月公共办公时间——在 Tw 上给我@pacoid 发送消息以了解详细信息。

spacy - 将大文本提供给 PyTextRank

1 回答 1

Related

Reference