php - 如何检查一个文本是否包含在另一个文本中？

Question

我正在开发一个文档系统，每次创建一个新系统时，它都必须检测并丢弃大约 500.000 条记录的数据库中的重复项。

目前，我正在使用搜索引擎检索 20 个最相似的文档，并将它们与我们正在尝试创建的新文档进行比较。问题是我必须检查新文档是否相似（使用similar_text 很容易），或者即使它包含在其他文本中，所有这些操作都考虑到文本可能已被用户部分更改（这里是问题）。我怎么能这样做？

例如：

<?php

$new = "the wild lion";

$candidates = array(
  'the dangerous lion lives in Africa',//$new is contained into this one, but has changed 'wild' to 'dangerous', it has to be detected as duplicate
  'rhinoceros are native to Africa and three to southern Asia.'
);

foreach ( $candidates as $candidate ) {
  if( $candidate is similar or $new is contained in it) {
       //Duplicated!!
  }
}

当然，在我的系统中，文档长度超过 3 个单词 :)

score 1 · Accepted Answer

这是我正在使用的时间解决方案：

function contained($text1, $text2, $factor = 0.9) {
    //Split into words
    $pattern= '/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u';
    $words1 = preg_split($pattern, mb_strtolower($text1), -1, PREG_SPLIT_NO_EMPTY);
    $words2 = preg_split($pattern, mb_strtolower($text2), -1, PREG_SPLIT_NO_EMPTY);

    //Set long and short text
    if (count($words1) > count($words2)) {
        $long = $words1;
        $short = $words2;
    } else {
        $long = $words2;
        $short = $words1;
    }

    //Count the number of words of the short text that also are in the long
    $count = 0;
    foreach ($short as $word) {
        if (in_array($word, $long)) {
            $count++;
        }
    }

    return ($count / count($short)) > $factor;
}

score 0 · Accepted Answer

您可能会进行或进一步调查的一些想法是：

索引文档，然后搜索类似的文档。所以开源索引/搜索系统，如Solr、Sphinx或Zend Search Lucene可以派上用场。
您可以使用sim hashing algorithm或shingling。简而言之，simhash 算法将让您计算相似文档的相似哈希值。因此，您可以将这个值存储在每个文档中，并检查各种文档的相似程度。

您可能会发现有助于从中获得一些想法的其他算法是：

1. 莱文斯坦距离

2. 贝叶斯过滤- SO 问题重新贝叶斯过滤。此列表项中的第一个链接指向 Wiki 上的贝叶斯垃圾邮件过滤文章，但此算法可以适应您尝试执行的操作。

php - 如何检查一个文本是否包含在另一个文本中？

2 回答 2

Related

Reference