nlp - 如何执行与本地知识图谱的实体链接？

Question

我正在使用在线文章从头开始构建自己的知识库。

我正在尝试将我抓取的 SPO 三元组（主题和可能的对象）中的实体映射到我自己的实体记录，这些实体由我从其他网站上抓取的上市公司组成。

我研究了大多数库，该方法专注于将实体映射到 Wikipedia、YAGO 等大型知识库，但我不确定如何将这些技术应用到我自己的知识库中。

目前，我找到了声称能够这样做的 NEL Python 包，但我不太了解文档，它只关注维基百科数据转储。

是否有任何技术或库可以让我这样做？

score 0 · Accepted Answer

我假设你有类似于 wikidata 知识库的东西，它是一个带有别名的巨大概念列表。

这或多或少可以表示如下：

C1 new york
C1 nyc
C1 big apple

现在将一个句子的跨度链接到上述 KB，对于单个单词很容易，您只需设置一个索引，将单个单词概念映射到标识符。

困难的部分是连接多个单词概念或短语概念，如“纽约”或“大苹果”。

为了实现这一点，我使用了一种算法，将一个句子分成所有可能的片段。我称这些为“跨度”。然后尝试将单个跨度或一组单词与数据库中的一个概念（单个单词或多个单词）进行匹配。

例如，这里是一个简单句子的所有跨度的示例。它是一个存储字符串列表的列表：

[['new'], ['york'], ['is'], ['the'], ['big'], ['apple']]
[['new'], ['york'], ['is'], ['the'], ['big', 'apple']]
[['new'], ['york'], ['is'], ['the', 'big'], ['apple']]
[['new'], ['york'], ['is'], ['the', 'big', 'apple']]
[['new'], ['york'], ['is', 'the'], ['big'], ['apple']]
[['new'], ['york'], ['is', 'the'], ['big', 'apple']]
[['new'], ['york'], ['is', 'the', 'big'], ['apple']]
[['new'], ['york'], ['is', 'the', 'big', 'apple']]
[['new'], ['york', 'is'], ['the'], ['big'], ['apple']]
[['new'], ['york', 'is'], ['the'], ['big', 'apple']]
[['new'], ['york', 'is'], ['the', 'big'], ['apple']]
[['new'], ['york', 'is'], ['the', 'big', 'apple']]
[['new'], ['york', 'is', 'the'], ['big'], ['apple']]
[['new'], ['york', 'is', 'the'], ['big', 'apple']]
[['new'], ['york', 'is', 'the', 'big'], ['apple']]
[['new'], ['york', 'is', 'the', 'big', 'apple']]
[['new', 'york'], ['is'], ['the'], ['big'], ['apple']]
[['new', 'york'], ['is'], ['the'], ['big', 'apple']]
[['new', 'york'], ['is'], ['the', 'big'], ['apple']]
[['new', 'york'], ['is'], ['the', 'big', 'apple']]
[['new', 'york'], ['is', 'the'], ['big'], ['apple']]
[['new', 'york'], ['is', 'the'], ['big', 'apple']]
[['new', 'york'], ['is', 'the', 'big'], ['apple']]
[['new', 'york'], ['is', 'the', 'big', 'apple']]
[['new', 'york', 'is'], ['the'], ['big'], ['apple']]
[['new', 'york', 'is'], ['the'], ['big', 'apple']]
[['new', 'york', 'is'], ['the', 'big'], ['apple']]
[['new', 'york', 'is'], ['the', 'big', 'apple']]
[['new', 'york', 'is', 'the'], ['big'], ['apple']]
[['new', 'york', 'is', 'the'], ['big', 'apple']]
[['new', 'york', 'is', 'the', 'big'], ['apple']]
[['new', 'york', 'is', 'the', 'big', 'apple']]

每个子列表可能会或可能不会映射到一个概念。要找到最佳映射，您可以根据匹配的概念数量对上述每一行进行评分。

根据示例知识库，以下是上述两个跨度列表中得分最高的：

2  ~  [['new', 'york'], ['is'], ['the'], ['big', 'apple']]
2  ~  [['new', 'york'], ['is', 'the'], ['big', 'apple']]

所以它猜测“纽约”是一个概念，“大苹果”也是一个概念。

这是完整的代码：

input = 'new york is the big apple'.split()


def spans(lst):
    if len(lst) == 0:
        yield None
    for index in range(1, len(lst)):
        for span in spans(lst[index:]):
            if span is not None:
                yield [lst[0:index]] + span
    yield [lst]

knowledgebase = [
    ['new', 'york'],
    ['big', 'apple'],
]

out = []
scores = []

for span in spans(input):
    score = 0
    for candidate in span:
        for uid, entity in enumerate(knowledgebase):
            if candidate == entity:
                score += 1
    out.append(span)
    scores.append(score)

leaderboard = sorted(zip(out, scores), key=lambda x: x[1])

for winner in leaderboard:
    print(winner[1], ' ~ ', winner[0])

这必须改进以将与概念匹配的列表与其概念标识符相关联，并找到一种拼写检查所有内容的方法（根据知识库）。

nlp - 如何执行与本地知识图谱的实体链接？

1 回答 1

Related

Reference