0

我有一个包含某些文本的主文件 - 比方说 -

file contains x
the image is of x type
the user is admin
the address is x

然后还有 200 个其他文本文件,其中包含以下文本 -

file contains xyz
the image if of abc type
the user is admin
the address if pqrs

我需要匹配这些文件。如果文件包含的文本与主文件中的文本完全相同,则结果将为真,每个文件的 x 都不同,即主文件中的“x”可以是其他文件中的任何内容,结果将为真。我有什么想出是

arr=master.split('\n')
for file in files:
    a=[]
    file1=file.split('\n')
    i=0
    for line in arr:
        line_list=line.split()
        indx=line_list.index('x')
        line_list1=line_list[:indx]+line_list[indx+1:]
        st1=' '.join(line_list1)
        file1_list=file1[i].split()
        file1_list1=file1_list[:indx]+file1_list[indx+1:]
        st2=' '.join(file1_list1)
        if st1!=st2:
            a.append(line)
        i+=1

这是非常低效的。有没有一种方法可以将文件与主文件进行映射并在其他文件中生成差异?

4

3 回答 3

0

这是我认为满足您要求的一种方法。它还允许您指定是否在每一行上只允许相同的差异(这会将您的第二个文件示例视为不匹配):

更新:这说明主文件和其他文件中的行不一定是相同的顺序

from itertools import zip_longest

def get_min_diff(master_lines, to_check):
    min_diff = None
    match_line = None
    for ln, ml in enumerate(master_lines):
        diff = [w for w, m in zip_longest(ml, to_check) if w != m]
        n_diffs = len(diff)
        if min_diff is None or n_diffs < min_diff:
            min_diff = n_diffs
            match_line = ln

    return min_diff, diff, match_line

def check_files(master, files):
    # get lines to compare against
    master_lines = []
    with open(master) as mstr:
        for line in mstr:
            master_lines.append(line.strip().split())      
    matches = []
    for f in files:
        temp_master = list(master_lines)
        diff_sizes = set()
        diff_types = set()
        with open(f) as checkfile:
            for line in checkfile:
                to_check = line.strip().split()
                # find each place in current line where it differs from
                # the corresponding line in the master file
                min_diff, diff, match_index = get_min_diff(temp_master, to_check)
                if min_diff <= 1:  # acceptable number of differences
                    # remove corresponding line from master search space
                    # so we don't match the same master lines to multiple
                    # lines in a given test file
                    del temp_master[match_index]
                    # if it only differs in one place, keep track of what
                    # word was different for optional check later
                    if min_diff == 1:
                        diff_types.add(diff[0])
                diff_sizes.add(min_diff)
            # if you want any file where the max number of differences
            # per line was 1
            if max(diff_sizes) == 1:
                # consider a match if there is only one difference per line
                matches.append(f)
            # if you instead want each file to only
            # be different by the same word on each line
            #if len(diff_types) == 1:
                #matches.append(f)
    return matches

根据您提供的示例,我制作了一些测试文件进行检查:

::::::::::::::
test1.txt
::::::::::::::
file contains y
the image is of y type
the user is admin
the address is y
::::::::::::::
test2.txt
::::::::::::::
file contains x
the image is of x type
the user is admin
the address is x
::::::::::::::
test3.txt
::::::::::::::
file contains xyz
the image is of abc type
the user is admin
the address is pqrs
::::::::::::::
testmaster.txt
::::::::::::::
file contains m
the image is of m type
the user is admin
the address is m
::::::::::::::
test_nomatch.txt
::::::::::::::
file contains y and some other stuff
the image is of y type unlike the other
the user is bongo the clown
the address is redacted
::::::::::::::
test_scrambled.txt
::::::::::::::
the image is of y type
file contains y
the address is y
the user is admin

运行时,上面的代码返回正确的文件:

In: check_files('testmaster.txt', ['test1.txt', 'test2.txt', 'test3.txt', 'test_nomatch.txt', 'test_scrambled.txt'])
Out: ['test1.txt', 'test2.txt', 'test3.txt', 'test_scrambled.txt']
于 2017-05-22T17:53:36.620 回答
0

我知道这不是一个真正的解决方案,但您可以通过执行以下操作来检查文件是否为相同格式:

if "the image is of" in var:
    to do

通过检查其余的行

“文件包含”

“用户是”

“地址是”

如果您正在检查的文件有效,您将能够在某种程度上验证

您可以查看此链接以阅读有关此“子字符串想法”的更多信息

Python 是否有一个字符串包含子字符串的方法?

于 2017-05-22T17:06:43.987 回答
0

那个“万能”是独一无二的吗?例如,如果键确实是 ,x您是否保证它不会x出现在行中的其他任何地方?或者主文件可以有类似的东西

excluding x records and x axis values

如果您确实有唯一的密钥...

对于每一行,在您的 key 上拆分主文件x。这为您提供了两条线,正面和背面。然后只检查startswith前部和endswith后部是否对齐。就像是

for line in arr:
    front, back = line.split(x_key)
    # grab next line in input file
    ...
    if line_list1.startswith(front) and 
       line_list1.endswith(back):
        # process matching line
    else:
        # process non-matching line

查看文档


根据操作评论更新

只要x在行内是唯一的,您就可以轻松地适应这一点。正如您在评论中提到的那样,您想要类似的东西

if len(line) == len(line_list1):
    if all(line[i] == line_list1[i] for i in len(line) ):
        # Found matching lines
    else:
        # Advance to the next line
于 2017-05-22T17:32:50.870 回答