2

我正在使用在线文章从头开始构建自己的知识库。

我正在尝试将我抓取的 SPO 三元组(主题和可能的对象)中的实体映射到我自己的实体记录,这些实体由我从其他网站上抓取的上市公司组成。

我研究了大多数库,该方法专注于将实体映射到 Wikipedia、YAGO 等大型知识库,但我不确定如何将这些技术应用到我自己的知识库中。

目前,我找到了声称能够这样做的 NEL Python 包,但我不太了解文档,它只关注维基百科数据转储。

是否有任何技术或库可以让我这样做?

4

1 回答 1

0

我假设你有类似于 wikidata 知识库的东西,它是一个带有别名的巨大概念列表。

这或多或少可以表示如下:

C1 new york
C1 nyc
C1 big apple

现在将一个句子的跨度链接到上述 KB,对于单个单词很容易,您只需设置一个索引,将单个单词概念映射到标识符。

困难的部分是连接多个单词概念或短语概念,如“纽约”或“大苹果”。

为了实现这一点,我使用了一种算法,将一个句子分成所有可能的片段。我称这些为“跨度”。然后尝试将单个跨度或一组单词与数据库中的一个概念(单个单词或多个单词)进行匹配。

例如,这里是一个简单句子的所有跨度的示例。它是一个存储字符串列表的列表:

[['new'], ['york'], ['is'], ['the'], ['big'], ['apple']]
[['new'], ['york'], ['is'], ['the'], ['big', 'apple']]
[['new'], ['york'], ['is'], ['the', 'big'], ['apple']]
[['new'], ['york'], ['is'], ['the', 'big', 'apple']]
[['new'], ['york'], ['is', 'the'], ['big'], ['apple']]
[['new'], ['york'], ['is', 'the'], ['big', 'apple']]
[['new'], ['york'], ['is', 'the', 'big'], ['apple']]
[['new'], ['york'], ['is', 'the', 'big', 'apple']]
[['new'], ['york', 'is'], ['the'], ['big'], ['apple']]
[['new'], ['york', 'is'], ['the'], ['big', 'apple']]
[['new'], ['york', 'is'], ['the', 'big'], ['apple']]
[['new'], ['york', 'is'], ['the', 'big', 'apple']]
[['new'], ['york', 'is', 'the'], ['big'], ['apple']]
[['new'], ['york', 'is', 'the'], ['big', 'apple']]
[['new'], ['york', 'is', 'the', 'big'], ['apple']]
[['new'], ['york', 'is', 'the', 'big', 'apple']]
[['new', 'york'], ['is'], ['the'], ['big'], ['apple']]
[['new', 'york'], ['is'], ['the'], ['big', 'apple']]
[['new', 'york'], ['is'], ['the', 'big'], ['apple']]
[['new', 'york'], ['is'], ['the', 'big', 'apple']]
[['new', 'york'], ['is', 'the'], ['big'], ['apple']]
[['new', 'york'], ['is', 'the'], ['big', 'apple']]
[['new', 'york'], ['is', 'the', 'big'], ['apple']]
[['new', 'york'], ['is', 'the', 'big', 'apple']]
[['new', 'york', 'is'], ['the'], ['big'], ['apple']]
[['new', 'york', 'is'], ['the'], ['big', 'apple']]
[['new', 'york', 'is'], ['the', 'big'], ['apple']]
[['new', 'york', 'is'], ['the', 'big', 'apple']]
[['new', 'york', 'is', 'the'], ['big'], ['apple']]
[['new', 'york', 'is', 'the'], ['big', 'apple']]
[['new', 'york', 'is', 'the', 'big'], ['apple']]
[['new', 'york', 'is', 'the', 'big', 'apple']]

每个子列表可能会或可能不会映射到一个概念。要找到最佳映射,您可以根据匹配的概念数量对上述每一行进行评分。

根据示例知识库,以下是上述两个跨度列表中得分最高的:

2  ~  [['new', 'york'], ['is'], ['the'], ['big', 'apple']]
2  ~  [['new', 'york'], ['is', 'the'], ['big', 'apple']]

所以它猜测“纽约”是一个概念,“大苹果”也是一个概念。

这是完整的代码:

input = 'new york is the big apple'.split()


def spans(lst):
    if len(lst) == 0:
        yield None
    for index in range(1, len(lst)):
        for span in spans(lst[index:]):
            if span is not None:
                yield [lst[0:index]] + span
    yield [lst]

knowledgebase = [
    ['new', 'york'],
    ['big', 'apple'],
]

out = []
scores = []

for span in spans(input):
    score = 0
    for candidate in span:
        for uid, entity in enumerate(knowledgebase):
            if candidate == entity:
                score += 1
    out.append(span)
    scores.append(score)

leaderboard = sorted(zip(out, scores), key=lambda x: x[1])

for winner in leaderboard:
    print(winner[1], ' ~ ', winner[0])

这必须改进以将与概念匹配的列表与其概念标识符相关联,并找到一种拼写检查所有内容的方法(根据知识库)。

于 2019-09-30T11:06:14.950 回答