python - Python - 具有相同名称但内容不同的文件

Question

我有两个文件夹，dir1 和 dir2。我必须找到两个文件夹（或子文件夹）中名称相同但内容不同的文件。

类似于：so.1.0/p/q/search.c so.1.1/p/q/search.c 不同

有任何想法吗？

我以这种方式获取我需要的文件：

import os, sys, fnmatch, filecmp

folder1 = sys.argv[1]
folder2 = sys.argv[2]

filelist1 = []

filelist2 = []

for root, dirs, files in os.walk(folder1):
    for filename in fnmatch.filter(files, '*.c'):
         filelist1.append(os.path.join(root, filename))

for root, dirs, files, in os.walk(folder1):
    for filename in fnmatch.filter(files, '*.h'):
        filelist1.append(os.path.join(root, filename))

for root, dirs, files in os.walk(folder2):
    for filename in fnmatch.filter(files, '*.c'):
         filelist2.append(os.path.join(root, filename))

for root, dirs, files, in os.walk(folder2):
    for filename in fnmatch.filter(files, '*.h'):
        filelist2.append(os.path.join(root, filename))

现在我想比较两个文件列表，获取具有相同文件名的条目并检查它们的内容是否不同。你怎么看？

score 2 · Accepted Answer

用于os.walk()生成任一目录中的文件列表（路径相对于它们的根目录）：

import os

def relative_files(path):
    """Generate filenames with pathnames relative to the initial path."""
    for root, dirnames, files in os.walk(path):
        relroot = os.path.relpath(root, path)
        for filename in files:
            yield os.path.join(relroot, filename)

从一个路径创建一组路径：

root_one = 'so.1.0'  # use an absolute path here
root_two = 'so.1.1'  # use an absolute path here
files_one = set(relative_files(root_one))

然后使用集合交集查找其他根中相同的所有路径名：

from itertools import izip_longest

def different_files(root_one, root_two):
    """Yield files that differ between the two roots

    Generate pathnames relative to root_one and root_two that are present in both
    but have different contents.

    """
    files_one = set(relative_files(root_one))
    for same in files_one.intersection(relative_files(root_two)):
        # same is a relative path, so same file in different roots
        with open(os.path.join(root_one, same)) as f1, open(os.path.join(root_two, same)) as f2:
            if any(line1 != line2 for line1, line2 in izip_longest(f1, f2)):
                # lines don't match, so files don't match! 
                yield same

itertools.izip_longest()循环文件有效地配对行；如果一个文件比另一个文件长，则剩余的行将配对，None以确保您检测到一个与另一个不同。

演示：

$ mkdir -p /tmp/so.1.0/p/q
$ mkdir -p /tmp/so.1.1/p/q
$ echo 'file one' > /tmp/so.1.0/p/q/search.c
$ echo 'file two' > /tmp/so.1.1/p/q/search.c
$ echo 'file three' > /tmp/so.1.1/p/q/ignored.c
$ echo 'matching' > /tmp/so.1.0/p/q/same.c
$ echo 'matching' > /tmp/so.1.1/p/q/same.c

>>> for different in different_files('/tmp/so.1.0', '/tmp/so.1.1'):
...     print different
... 
p/q/search.c

score 1 · Accepted Answer

正如@Martijn 为遍历目的而回答的，您可以使用os.walk()

for root, dirs, files in os.walk(path):
    for name in files:

对于文件名比较，我会推荐filecmp

>>> import filecmp
>>> filecmp.cmp('undoc.rst', 'undoc.rst') 
True
>>> filecmp.cmp('undoc.rst', 'index.rst') 
False

并用于比较文件内容检查difflib

python - Python - 具有相同名称但内容不同的文件

2 回答 2

Related

Reference