c++ - 抄袭检测-风选算法-指纹冲突

Question

我在大文本文件中编写剽窃检测应用程序。在阅读了很多关于它的文章后，我决定使用Winnowing 算法（使用 Karp-Rabin 滚动哈希函数），但我遇到了一些问题。

数据：

我有两个简单的文本文件——第一个更大，第二个只是第一个的一个段落。

使用的算法：

这是我用来从所有哈希中选择我的指纹的算法。

void winnow(int w /*window size*/) {
    // circular buffer implementing window of size w
    hash_t h[w];
    for (int i=0; i<w; ++i) h[i] = INT_MAX;
    int r = 0; // window right end
    int min = 0; // index of minimum hash
    // At the end of each iteration, min holds the
    // position of the rightmost minimal hash in the
    // current window. record(x) is called only the
    // first time an instance of x is selected as the
    // rightmost minimal hash of a window.
    while (true) {
        r = (r + 1) % w; // shift the window by one
        h[r] = next_hash(); // and add one new hash, if hash = -1, then it's end of file
        if(h[r] == -1)
            break;
        if (min == r) {
            // The previous minimum is no longer in this
            // window. Scan h leftward starting from r
            // for the rightmost minimal hash. Note min
            // starts with the index of the rightmost
            // hash.
            for(int i=(r-1)%w; i!=r; i=(i-1+w)%w)
                if (h[i] < h[min]) min = i;
                    record(h[min], global_pos(min, r, w));
        } else {
            // Otherwise, the previous minimum is still in
            // this window. Compare against the new value
            // and update min if necessary.
            if (h[r] <= h[min]) { // (*)
                min = r;
                record(h[min], global_pos(min, r, w));
            }
        }
    }
}

接下来，为了检测两个文件中是否有相同的文本，我只需比较两个文本的指纹以检查我们是否有匹配项。因此，为了检测抄袭，算法必须采用在文本中完全相同的位置开始的哈希，例如：

Text1: A do run |t^his my check your.

文本 2：我的 bla lol |t^his my dasd 鸡。

为了获得正确的哈希值，它具有相同的值（这也意味着我们有相同的文本），算法应该从我用“|”指向的地方获取指纹或“^”（我假设我们需要 5 个字符来计算哈希，没有空格）。它不能从 '|' 中获取哈希值在文本 1 中和文本 2 中的 '^' 因为这两个哈希值不同并且不会检测到抄袭。

问题：

要检测这一段是否是从文本编号 1 复制的，我必须在两个文本的某个位置有两个相同的指纹。问题是算法选择不适合彼此的指纹，我的意思是他们只是错过了，即使在更大的文本中也是如此。

问题：

你有什么想法我可以改进这个算法（这实际上降低了羚牛指纹的正确算法），它会更有可能发现抄袭？

我的想法：

我考虑过运行 winnow 函数几次，对于不同的窗口大小（这将导致采用不同的哈希值），但对于该程序必须在其上工作的大文本（如 2MB 只是文本），这将花费太多时间。

score 2 · Accepted Answer

If you have running window over which you are computing the hash, you can update the hash value in constant time when window moves. The method is called Rabin fingerprint (see also). This should allow you to compute all fingerprints of size X in O(n) running time (n is a size of an input document). I guess the paper you cite is some advanced extension of this method and when implemented correctly it should also give you a similar running time. The key is to update the hash not recompute it.

c++ - 抄袭检测-风选算法-指纹冲突

1 回答 1

Related

Reference