python - 使用 python Dedupe 包检查单个记录

Question

我正在使用 Dedupe python 包检查传入记录的重复项。我已经训练了大约。来自 CSV 文件的 500000 条记录。使用 Dedupe 包，我将 500000 条记录聚集到不同的集群中。我试图使用settings_file训练结束对新记录进行重复数据删除（data在代码中）。我在下面分享了一个代码片段。

import dedupe
from unidecode import unidecode
import os

deduper=None
if os.path.exists(settings_file):
    with open(settings_file, 'rb') as sf :
        deduper = dedupe.StaticDedupe(sf)

clustered_dupes = deduper.match(data, 0)

数据，这是一条新记录，我必须检查它是否有重复。data好像

{1:{'SequenceID': 6855406, 'ApplicationID': 7065902, 'CustomerID': 6153222, 'Name': 'X', 'col1': '-42332423', 'col2': '0', 'col3': '0', 'col4': '0', 'col5': '24G0859681', 'col6': '0', 'col7': 'xyz12345', 'col8': 'xyz', 'col9': '1234', 'col10': 'xyz10'}}

这会引发错误。

没有记录被一起阻止。您尝试匹配的数据是否像您训练的数据一样？

如何使用此集群数据检查新记录是否重复？是否可以像我们对任何 ML 模型所做的那样？我查看了多个来源，但没有找到解决此问题的方法。

大多数消息来源都在谈论培训，而不是关于我如何使用集群数据来检查单个记录。

有没有别的出路。

我提到的一些链接：link1 link2 link3

任何帮助表示赞赏。

score 0 · Accepted Answer

您需要根据预先训练的设置将初始训练的数据与新记录一起作为输入传递给集群

python - 使用 python Dedupe 包检查单个记录

1 回答 1

Related

Reference