python - 查找重复的文件名，并且只使用 python 保留最新的文件

Question

我有 +20 000 个文件，如下所示，都在同一个目录中：

8003825.pdf
8003825.tif
8006826.tif

如何在忽略文件扩展名的情况下找到所有重复的文件名。

澄清：我指的是重复文件是具有相同文件名的文件，而忽略了文件扩展名。我不在乎文件是否不是 100% 相同（例如 hashsize 或类似的东西）

例如：

"8003825" appears twice

然后查看每个重复文件的元数据，只保留最新的。

类似于这篇文章：

保留最新文件并删除所有其他文件

我想我必须创建一个所有文件的列表，检查文件是否已经存在。如果是这样，那么使用 os.stat 来确定修改日期？

我有点担心将所有这些文件名加载到内存中。并想知道是否有更蟒蛇的做事方式......

Python 2.6 视窗 7

score 7 · Accepted Answer

您可以通过O(n)复杂性来做到这一点。sort具有O(n*log(n))复杂性的解决方案。

import os
from collections import namedtuple

directory = #file directory
os.chdir(directory)

newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file = newest_files.get(name)
    this_file_date = os.path.getmtime(file_name)
    if cashed_file is None:
        newest_files[name] = Entry(this_file_date,file_name)
    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)

newest_files是一个字典，文件名不带扩展名作为键，命名元组的值包含文件完整文件名和修改日期。如果遇到的新文件在字典中，则将其日期与存储在字典中的日期进行比较，并在必要时进行替换。

最后，您有一本包含最新文件的字典。

然后您可以使用此列表执行第二遍。请注意，字典中的查找复杂度为O(1). 所以查找n字典中所有文件的总体复杂度是O(n).

例如，如果您只想保留最新的同名文件并删除其他文件，可以通过以下方式实现：

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file_name = newest_files.get(name).file_name
    if file_name != cashed_file_name: #it's not the newest with this name
        os.remove(file_name)

正如Blckknght在评论中所建议的那样，您甚至可以避免第二遍并在遇到新文件时立即删除旧文件，只需添加一行代码：

    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)
            os.remove(cashed_file.file_name) #this line added

score 2 · Accepted Answer

首先，获取文件名列表并对其进行排序。这将使所有重复项彼此相邻。

然后，去掉文件扩展名并与邻居进行比较，os.path.splitext()这itertools.groupby()可能在这里有用。

将重复项分组后，选择要继续使用的那个os.stat()。

最后你的代码可能看起来像这样：

import os, itertools

files = os.listdir(base_directory)
files.sort()
for k, g in itertools.groupby(files, lambda f: os.path.splitext(f)[0]):
     dups = list(g)
     if len(dups) > 1:
         # figure out which file(s) to remove

您不必担心这里的内存，您正在查看大约几兆字节的内容。

score 0 · Accepted Answer

对于文件名计数器，您可以使用defaultdict存储每个文件出现的次数：

import os
from collections import defaultdict

counter = defaultdict(int)
for file_name in file_names:
   file_name = os.path.splitext(os.path.basename(file_name))[0]
   counter[file_name] += 1

python - 查找重复的文件名，并且只使用 python 保留最新的文件

3 回答 3

Related

Reference