0

我有一个系统,信息可以来自各种来源。我想确保我没有添加确切(或极其相似)的信息。这是一个例子:

文字A:有一天,一个人翻过小山,看到了太阳

文字B:有一天,一个人翻过一座小山,看到了太阳

文字 C:一个星期,一个女人翻过一座小山,看到了太阳

在这种情况下,我想为信息块之间的差异获得某种数值。从那里我可以应用以下逻辑:

  1. 将文本添加到数据库时,检查数据库中的现有值
  2. 如果发现值非常相似,则不要添加
  3. 如果值被认为足够不同,那么添加

因此,我们最终在数据库中得到不同的信息,而不是重复的,但我们允许少量的余地。

谁能告诉我如何在 Python 中尝试这个?

4

4 回答 4

2

查看您的问题,difflib.SequenceMatcher.ratio()可能会派上用场。

这个漂亮的例程,接受两个字符串并计算 [0,1] 范围内的相似度指数

快速演示

>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80


Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
于 2013-08-22T11:35:48.803 回答
1

有几个 python 库可以帮助你。看看这个Q:。

levisthein 距离是一种常用的算法。我发现 nysiis 算法非常有用。特别是如果您想将字符串表示形式保存在数据库中。

链接将为您提供一个很好的概述:

于 2013-08-22T11:26:22.930 回答
1

执行此操作的原始方法...但是您可以遍历字符串,比较另一个字符串中的等效顺序单词,然后获得匹配与失败的比率:

>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12

所以在这个例子中,你可以看到匹配的 11/12 个单词。然后您可以设置通过/失败级别

于 2013-08-22T11:27:19.757 回答
0

在 python 或任何其他语言中,哈希是删除重复项的最简单方法。

您可以维护已添加的哈希表。当您添加另一个时,只需检查哈希是否存在。

使用 hashlib

添加 hashlib 使用示例

import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()

m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()

m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()

答案

d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b
于 2013-08-22T11:28:19.537 回答