algorithm - 检查提交的作业之间的相似性百分比的最佳算法是什么？

Question

我计划为最后一年构建一个类似于相似性检查器的项目。在项目中，我计划检查提交的作业之间的相似性百分比，即离线。

例如：

当第一个学生提交作业时，它不会与任何其他作业一起检查。
当第二个学生提交作业时，它会与第一个作业一起检查。
当第三个学生提交作业时，会检查第一个和第二个提交的作业。
同样，如果有 35 名学生，则将第 36 个提交的作业与其余 35 个提交的作业进行检查。

现在，问题来了，如何比较两个作业。在这种情况下，比较是文档中文本之间的相似性。我想要类似这样的结果：

我只想显示相似句子的百分比以及它们是什么？

我做了什么：

我研究了不同的算法，如 td-idf、余弦相似度算法，但我无法正确插入算法的结果。

所以，我想知道在这种情况下哪种算法最好，我想知道这是如何完成的。是否有任何对我有帮助的网站、博客的参考资料？

score 0 · Accepted Answer

这取决于您使用的算法如何返回比较结果。

例如，以下函数比较文档内容列表并返回映射到它们之间的常用单词序列列表的文档对字典。它不区分彼此包含的单词序列，因为重叠的较长和较短单词序列的出现次数可能不同。

import re
from itertools import combinations

def wordList(document): return re.findall("(\w+|\d+)",document.lower())

def compareDocs(documents, minSize=2, maxSize=25):
    result  = dict() # { (documentIndex,documentIndex) : [CommonExpressions] }
    def tallyDuplicates(expressionDocs):
        for expression,docIndexes in expressionDocs.items():
            for docIndex,otherDoc in combinations(docIndexes,2):
                result.setdefault((docIndex,otherDoc),[]).append(expression)

    documentWords    = [ wordList(document) for document in documents ]
    wordCounts       = [ len(words) for words in documentWords ]
    expressionRanges = dict()
    for docIndex,words in enumerate(documentWords):
        for wordIndex,word in enumerate(words):
            expressionRanges.setdefault(word,[]).append((docIndex,wordIndex))

    size = 1    
    while size == 1 or expressionDocs and size <= maxSize:        
        nextExpressions   = dict()
        expressionDocs    = dict()
        for expression,starts in expressionRanges.items():
            for docIndex,startIndex in starts:
                endIndex = startIndex+size
                if endIndex >= wordCounts[docIndex]: continue
                extended = " ".join([expression,documentWords[docIndex][endIndex]])
                expressionDocs.setdefault(extended,set()).add(docIndex)
                nextExpressions.setdefault(extended,[]).append( (docIndex,startIndex) )
        expressionDocs   = { expression:docIndexes for expression,docIndexes in expressionDocs.items() if len(docIndexes) > 1 }
        expressionRanges = { expression:ranges for expression,ranges in nextExpressions.items() if expression in expressionDocs }  
        if size >= minSize: tallyDuplicates(expressionDocs)
        size += 1

    return result

根据这些比较结果，您需要分析每个文档对的内容，以统计常用表达（词序列）所涵盖的词。鉴于一个表达式包含多个单词，每个表达式将在相似度中占多个单词：words-in matching-expressions / words-in-document。

[编辑] 我将结果分析放在它自己的函数中，并添加了一个 html 输出以突出显示文档文本中的表达式：

def analyzeComparison(doc1,doc2,commonExpr):
    words1  = wordList(doc1)
    words2  = wordList(doc2)
    normalizedDoc1 = " ".join(words1)
    normalizedDoc2 = " ".join(words2)
    expressions.sort(key=lambda s:len(s),reverse=True)
    matches = []
    for expression in expressions:
        count1 = len(re.findall(expression,normalizedDoc1))
        count2 = len(re.findall(expression,normalizedDoc2))
        commonCount = min(count1,count2)
        if commonCount == 0: continue
        expressionId = "<#"+str(len(matches))+"#>"
        normalizedDoc1 = re.sub(expression,expressionId,normalizedDoc1,commonCount)
        normalizedDoc2 = re.sub(expression,expressionId,normalizedDoc2,commonCount)
        matches.append((expression,commonCount))
    commonWords = sum( count*len(expr.split(" ")) for expr,count in matches)
    percent1 = 100*commonWords/len(words1)
    percent2 = 100*commonWords/len(words2)
    for index,match in enumerate(matches):
        expressionId = "<#"+str(index)+"#>"
        expressionHighlight = "<span style='background-color:yellow'>"+match[0]+"</span>"
        normalizedDoc1 = re.sub(expressionId,expressionHighlight,normalizedDoc1)
        normalizedDoc2 = re.sub(expressionId,expressionHighlight,normalizedDoc2)
    return (percent1,percent2,matches,normalizedDoc1,normalizedDoc2)

例如：如果您有以下 3 个文档（您通常会从文件中读取它们）：

doc1 = """
Plagiarism, one of the main scourges of the academic life, is quite an easy concept, but, nonetheless, harmful. In short, to plagiarize means to steal someone else’s idea or part of work and use it as your own. But why exactly it is considered to be so bad and immoral? And it is really considered immoral and a serious offence. In case it is discovered, it may lead to very unpleasant consequences; the higher the position of the offender is, the more unpleasant they are.
copy and paste
There are two major kinds of harm plagiarism causes. First, it is something as simple as stealing and lying – you just steal someone else’s work and trick somebody into believing it was you who had written it, which is as immoral as any other kind of theft is. It means that somebody had actually spent time and effort in order to create something, while you did nothing but ripping it off and submitting it.
copy and paste function
Second, it is a crime you commit against yourself. If you study at an educational institution, there are certain tasks copy and paste you are given in order to ensure that you learn something. When you resort to plagiarism, you undo all these efforts for, instead of actually doing something and understanding it in process, you use someone else’s work and the certain amount of experience that you were supposed to get just misses you.
"""
doc2 = """
Plagiarism has always been a problem in schools. However, with the invention of the internet,copy and paste  it has made plagiarism even more of a challenge. Plagiarism.org, “estimates that nearly 30 percent of all students may be plagiarizing on all their written assignments and that the use of the Internet has made plagiarism much worse.” [1] The act of plagiarism can be defined as, “To steal and pass off (the ideas or words of another) as one’s own, to use (another’s production) without crediting the source, to commit literary theft, to present as new and original as idea or product derived from an existing source”2. Plagiarism has become such a concern for colleges that almost all the sites on this topic are sponsored by schools. The three main topics with plagiarism are the copy and paste function, “paper mills” and the ways that can be used to prevent students from doing this. 
it is quite an easy concept
The first major concern with the internet would be the copy and paste function. Wittenberg copy and paste function lists that “Widespread availability of the internet and increased access to full text databases has made cut and paste plagiarism very easy”.3 While the function is actually very nice to have, people are using it the wrong way. Instead of just using it to copy quotes from websites, than pasting it to their word document and giving it the proper credit, people are passing it off as their own. This is where the problem occurs.
"""

doc3 = """
Plagiarism has always been a problem in schools. However, it is something as simple as stealing and lying
it is a crime you. some other text
"""

您将首先在文档内容列表上调用 compareDocs()，并且对于每对文档（由函数返回），您将使用 analyzeComparison() 来获取百分比、计数和亮点：

documents   = [doc1,doc2,doc3]
comparisons = compareDocs( documents )
for documentPair,expressions in comparisons.items():
    docIndex1,docIndex2 = documentPair
    doc1 = documents[docIndex1]
    doc2 = documents[docIndex2]        
    pct1,pct2,matches,doc1,doc2 = analyzeComparison(doc1,doc2,expressions)

    # print result on console ...
    print(int(pct1//1)," % of document #",docIndex1," is same as document #", docIndex2)
    print(int(pct2//1)," % of document #",docIndex2," is same as document #", docIndex1)
    print("Common expressions are:")
    for expression,count in matches:
        print( "    ",expression,"(",count,"times )")
    print("")

    # output comparison result to an HTML file...
    htmlPage = "<html><body><table border='1'>"
    htmlPage += "<tr><th>#" + str(docIndex1) + ": Source " + str(int(pct1//1)) + "% duplicate</th>"
    htmlPage += "<th>#" + str(docIndex2) + ": Target  " + str(int(pct2//1)) + "% duplicate</th></tr>"
    htmlPage += "<tr><td width='50%' valign='top'>" + doc1 + "</td><td valign='top'>" + doc2 + "</td></tr>"
    htmlPage +="</table></body></html>"        
    fileName = str(docIndex1)+"-"+str(docIndex2)+".html"
    with open(fileName,"w") as f: f.write(htmlPage)

这将打印以下信息并创建一堆看起来与您想要的结果相似的 HTML 文件：

3.0  % of document # 1  is same as document # 2
34.0  % of document # 2  is same as document # 1
Common expressions are:
     plagiarism has always been a problem in schools however ( 1 times )

6.0  % of document # 0  is same as document # 1
5.0  % of document # 1  is same as document # 0
Common expressions are:
     is quite an easy concept ( 1 times )
     copy and paste function ( 1 times )
     copy and paste ( 2 times )

5.0  % of document # 0  is same as document # 2
53.0  % of document # 2  is same as document # 0
Common expressions are:
     it is something as simple as stealing and lying ( 1 times )
     it is a crime you ( 1 times )

总而言之，整个过程如下：

1）运行比较函数来识别每对文档共有的表达式（单词序列）。

compareDocs 函数在给定文档文本列表的情况下在一次调用中执行此操作。
如果您使用不同的比较算法，它可能被设计为仅执行两个文档之间的比较，或者在分类器的情况下，它可能只是返回一个文档的词/词频列表。
根据算法的输入和输出，您将需要或多或少地将逻辑包装在您自己的代码中以获得所需的结果
在此阶段您应该寻找的是不同文档对之间的常用表达（单词序列）列表。
如果您正在使用仅提取词频的算法（例如 td-idf），那么您将面临一个高复杂度的问题，即交叉匹配文档对之间的词频。

例如，分类器可能返回频率：“cut”=25 次，“and”=97 次，“paste”=31 次给定文档。这不会告诉您“剪切和粘贴”这个词实际上存在于文档中，或者它会出现多少次。该文件可能是在谈论牙膏，而这三个词从来没有按顺序排列。仅根据词频比较文档会发现同一主题的论文之间存在高度相关性，但这并不意味着存在抄袭。

此外，即使您的分类器设法返回两个或更多单词的所有表达式，每个文档也会产生接近 w*2^n 的表达式，其中 w 是文档中的单词数，n 是表达式的最大长度单词（您必须决定的最大值）。这将很容易达到每个文档数百万，然后您需要将它们与其他文档中的数百万进行匹配。如果您有 Google 的资源，这可能不是问题，但对于我们其他人来说，这将是一个问题。

2）要衡量文档之间的相似度百分比，您需要找到双方的常用表达，并测量每个文档的单词中有多少被常用表达覆盖。

识别表达式的位置是一个简单的文本搜索过程
但是，您需要避免多次计算任何给定的单词，因为百分比的分母是文档中的单词数（并且您不想高估或超过 100%）
这可以通过首先处理较长的表达式并将它们从文本中删除（或屏蔽它们）来实现，这样它们的单词就不会被后续（较短的）表达式再次计算在内
analyzeComparison() 函数通过将在文本中找到的表达式替换为占位符来屏蔽它们，该占位符稍后将用于重新注入带有突出显示标签 (HTML) 的文本。

3) 在自己的程序中使用文档比较分析。这取决于您希望如何呈现信息以及是否需要存储结果（取决于您）。例如，您可以确定相似性阈值并仅输出可疑的文档对。该阈值可以基于百分比、常用表达的数量、常用表达的最大或平均长度等。

[EDIT2] compareDocs 的工作原理...

该函数创建一个表达式字典，将它们映射到每个文档中第一个单词的位置。这存储在 expressionRanges 变量中。
- 例如：{“复制和粘贴”：[ (0,57), (1,7),(1,32) ] .... }
- 这意味着在文档#0 的位置 57（单词“复制”的位置）和文档 #1 的位置 7 和 32 中找到了 3 个单词的表达式“复制和粘贴”。
表达式字典 (expressionRanges) 从一个单词表达式开始，并使用它来获取 2 个单词的表达式，然后是 3 个单词的表达式，依此类推。
在继续下一个表达式大小之前，通过删除仅在一个文档中找到的所有表达式来清理表达式字典。
- 尺寸 1 ==> { "复制":[ (0,57),(0,72),(1,7),(1,32),(1,92) ] ... }
- 清理 ...
- 大小 2 ==> { "复制和": [ (0,57),(1,7),(1,32),(1,92) ] ... }
- 清理 ...
- 大小 3 ==> { "复制和粘贴": [ (0,57),(1,7),(1,32) ] ... }
这种清理是通过创建一个单独的字典 (expressionDocs) 来实现的，该字典将表达式映射到一组包含该表达式的文档索引。最终在其集合中只有一个文档的表达式将从两个字典中删除。
expressionDocs 字典也用于生成函数的输出。出现在多个文档中的表达式被映射到文档对（2 个的组合）以形成一个字典，其中包含：{（文档对）：[表达式列表]}（这是函数的结果）
tallyDuplicates 子函数通过将表达式添加到文档索引列表中 2 的每个组合来执行从 {Expression:[list of document index]} 到 {(document pair):[list of expressions]} 的转置。

expressionRanges 的连续改进大大减少了要执行的单词匹配的数量。每次传递只是为每个表达式添加一个单词，并在进入下一个表达式大小之前立即清理。expressionRanges 字典开始时条目的数量与文档中不同单词的数量一样多，但会迅速减小到更小的大小（除非文档实际上是相同的）。

这种方法的一个缺点是具有大量非常长的匹配表达式的文档将导致字典增长而不是缩小，并且 while 循环将运行更长的时间。最坏的情况是两个相同的文件。这可以通过引入最大表达式大小以使循环更早停止来避免。例如，如果您将最大大小设置为 25，则该函数将仅报告 25 个词的常用表达式和 5 个词的常用表达式，而不是 30 个词的表达式（如果有的话）。这可能是一个可接受的折衷方案，以避免几乎相同的文档需要很长的处理时间。就相似的百分比而言，差异将是最小的。（IE

algorithm - 检查提交的作业之间的相似性百分比的最佳算法是什么？

1 回答 1

Related

Reference