这是我认为满足您要求的一种方法。它还允许您指定是否在每一行上只允许相同的差异(这会将您的第二个文件示例视为不匹配):
更新:这说明主文件和其他文件中的行不一定是相同的顺序
from itertools import zip_longest
def get_min_diff(master_lines, to_check):
min_diff = None
match_line = None
for ln, ml in enumerate(master_lines):
diff = [w for w, m in zip_longest(ml, to_check) if w != m]
n_diffs = len(diff)
if min_diff is None or n_diffs < min_diff:
min_diff = n_diffs
match_line = ln
return min_diff, diff, match_line
def check_files(master, files):
# get lines to compare against
master_lines = []
with open(master) as mstr:
for line in mstr:
master_lines.append(line.strip().split())
matches = []
for f in files:
temp_master = list(master_lines)
diff_sizes = set()
diff_types = set()
with open(f) as checkfile:
for line in checkfile:
to_check = line.strip().split()
# find each place in current line where it differs from
# the corresponding line in the master file
min_diff, diff, match_index = get_min_diff(temp_master, to_check)
if min_diff <= 1: # acceptable number of differences
# remove corresponding line from master search space
# so we don't match the same master lines to multiple
# lines in a given test file
del temp_master[match_index]
# if it only differs in one place, keep track of what
# word was different for optional check later
if min_diff == 1:
diff_types.add(diff[0])
diff_sizes.add(min_diff)
# if you want any file where the max number of differences
# per line was 1
if max(diff_sizes) == 1:
# consider a match if there is only one difference per line
matches.append(f)
# if you instead want each file to only
# be different by the same word on each line
#if len(diff_types) == 1:
#matches.append(f)
return matches
根据您提供的示例,我制作了一些测试文件进行检查:
::::::::::::::
test1.txt
::::::::::::::
file contains y
the image is of y type
the user is admin
the address is y
::::::::::::::
test2.txt
::::::::::::::
file contains x
the image is of x type
the user is admin
the address is x
::::::::::::::
test3.txt
::::::::::::::
file contains xyz
the image is of abc type
the user is admin
the address is pqrs
::::::::::::::
testmaster.txt
::::::::::::::
file contains m
the image is of m type
the user is admin
the address is m
::::::::::::::
test_nomatch.txt
::::::::::::::
file contains y and some other stuff
the image is of y type unlike the other
the user is bongo the clown
the address is redacted
::::::::::::::
test_scrambled.txt
::::::::::::::
the image is of y type
file contains y
the address is y
the user is admin
运行时,上面的代码返回正确的文件:
In: check_files('testmaster.txt', ['test1.txt', 'test2.txt', 'test3.txt', 'test_nomatch.txt', 'test_scrambled.txt'])
Out: ['test1.txt', 'test2.txt', 'test3.txt', 'test_scrambled.txt']