0

我有两个包含大量记录的文本文件,这些记录是用竖线分隔的。我需要比较两个文本文件的数据相似性。比如说,File1 和 File2 应该有相同的记录。但是,即使它们具有相同的记录,它们似乎也不在同一行上。说 file1 上的 record1 可能在 row10 上,但 file2 上的相同 record1 不一定在同一行上,它可能出现在任何行上。现在我需要在 file1 中取 row1,需要遍历 file2 中的所有记录并查看匹配发生的位置。同样,我需要检查 file1 中的所有行。我更担心 file1 行与 file2 匹配,然后 file2 与 file1 匹配,因为 file2 可能有多个冗余记录。

我尝试使用 Python 脚本研究这种方法。我遇到了下面的代码片段,但是,它逐行比较两个文件,并且没有考虑到行可能不是按顺序排列的。

有人可以建议如何实现这一目标吗?

代码链接:https ://gist.github.com/insachin/c960cfeb1fef6454a8132a07cb9ebd5a

# Ask the user to enter the names of files to compare
fname1 = input("Enter the first filename: ")
fname2 = input("Enter the second filename: ")

# Open file for reading in text mode (default mode)
f1 = open(fname1)
f2 = open(fname2)

# Print confirmation
print("-----------------------------------")
print("Comparing files ", " > " + fname1, " < " + fname2, sep='\n')
print("-----------------------------------")

# Read the first line from the files
f1_line = f1.readline()
f2_line = f2.readline()

# Initialize counter for line number
line_no = 1

# Loop if either file1 or file2 has not reached EOF
while f1_line != '' or f2_line != '':

    # Strip the leading whitespaces
    f1_line = f1_line.rstrip()
    f2_line = f2_line.rstrip()

    # Compare the lines from both file
    if f1_line != f2_line:

        # If a line does not exist on file2 then mark the output with + sign
        if f2_line == '' and f1_line != '':
            print(">+", "Line-%d" % line_no, f1_line)
        # otherwise output the line on file1 and mark it with > sign
        elif f1_line != '':
            print(">", "Line-%d" % line_no, f1_line)

        # If a line does not exist on file1 then mark the output with + sign
        if f1_line == '' and f2_line != '':
            print("<+", "Line-%d" % line_no, f2_line)
        # otherwise output the line on file2 and mark it with < sign
        elif f2_line != '':
            print("<", "Line-%d" % line_no, f2_line)

        # Print a blank line
        print()

    # Read the next line from the file
    f1_line = f1.readline()
    f2_line = f2.readline()

    # Increment line counter
    line_no += 1

f1.close()
f2.close()
4

0 回答 0