python-dedupe - How to understand Dedupe library?

翻译自：https://stackoverflow.com/questions/48661871 2018-02-07T10:47:14.577

283 次

1

Two questions:

How to interpret the 'confidence score' when there is cluster with 3 rows and 3 confidence scores (0.98, 0.45, 0.45). Where this confidence scores come from? From logistic regression or somehow from hierarchical clustering?
10 000 of my 16 millions is labeled as duplicates, should I put this all as trening data? or only 10 positive and 10 negative will be enough? what number will be better for quality and time of execution?

1 回答 1

1

置信度得分1 - square root of the average squared distance介于记录和集群中的其他记录之间，distance其中1 - predicted probability that a pair of records are coreferent

有关详细信息，请参阅https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.cluster

于 2020-03-09T21:26:00.597 回答