python - python删除与未知模式匹配的旧文件（棘手）

Question

我的服务器已满，我需要自动删除文件。文件通常每天都会添加到我的服务器中，但有时会出现暂停，使其每两周或每月一次。他们停止进来几个月然后重新开始，这是不可预测的。

我的脚本需要删除超过 30 天的文件，但始终保留它找到的任何文件模式的最新 5 个文件。这是棘手的部分。

关于文件的唯一可预测的事情/模式是文件总是在某处包含一个 yyyymmddhhmmss 时间戳和一些重复的模式，文件名的其他部分并不总是可预测的。如果文件没有时间戳，我不想删除它。

一个示例目录我有这样的东西

20121118011335_team1-pathway_Truck_Report_Data_10342532.zip
20121119011335_team1-pathway_Truck_Report_Data_102345234.zip
20121120011335_team1-pathway_Truck_Report_Data_10642224.zip
20121121011335_team1-pathway_Truck_Report_Data_133464.zip
20121122011335_team1-pathway_Truck_Report_Data_126434344.zip
20121123011335_team1-pathway_Truck_Report_Data_12444656.zip
20121124011335_team1-pathway_Truck_Report_Data_1624444.zip
20121125011335_team1-pathway_Truck_Report_Data_3464433.zip
randomefilewithnodate.zip
20121119011335_team2-Paper_Size_Report_336655.zip
20121120011335_team2- Paper_Size_Report_336677.zip
20121121011335_team2-Paper_Size_Report_338877.zip
20121122011335_team2-Paper_Size_Report_226688.zip
20121123011335_team2-Paper_Size_Report_776688.zip
20121124011335_team2-Paper_Size_Report_223355.zip
20121125011335_team_Size_Report_11111_zipReport_11111_zip

在这种情况下，我的脚本应该只删除第一个模式的最旧的 3 个文件 20121118011335_team1-pathway_Truck_Report_Data_10342532.zip
20121119011335_team1-pathway_Truck_Report_Data_102345234.zip 20121120011335_team1-
pathway_Truck_zip_Data

和第二个模式最旧的 2 个文件
20121119011335_team2-Paper_Size_Report_336655.zip
20121120011335_team2-Paper_Size_Report_336677.zip

这样它会保留 5 个最新文件，并且不会触及没有日期的文件

我的问题是我无法知道 yyyymmddhhmmss_ 究竟会发生什么

到目前为止，如果时间戳存在，我已经想出了匹配的正则表达式，但我想不出如何让我的脚本足够聪明，以检测文件的其余模式并保持 5 天的模式。

欢迎任何想法！下面的脚本并不完美，我可以修复一些小错误。

我真的需要帮助来保存 5 个最新的文件部分，主要是

奖金问题是 epoc 时间部分。

def myCleansingMethod(self, client)

    # Get rid of things older than 30 days
    # 30 days has this many seconds 30 * 24 * 60 * 60
    numberOfSeconds = 2592000
    # establish what the epoc time of the oldest file I want to keep is
    oldestFileThatIWantToKeep = time.time() - numberOfSeconds
    #establish my working directory
    workingDirectory = "/home/files/%s" % (client)
    try:
        files = os.listdir(workingDirectory)
        except:
        print "Could not find directory"
        return

        files.sort()
        for file in files:
            # define Full File Name (path + file)
            fullFileName = "%s/%s" % (workingDirectory, file)
            # make sure the file contains yyyymmddhhmmss
            match = re.search(r'[0-9]{4}(1[0-2]|0[1-9])(3[01]|[12][0-9]|0[1-9])([01]\d|2[0123])([0-5]\d){2}', file)
            if match:
                #get what was matched in the RegEx
                fileTime = match.group()
                #convert fileTime to Epoc time
                fileTimeToEpoc = (fileTime + NOT SURE HOW TO DO THIS PART YET)

                if fileTimeToEpoc < oldestFileThatIWantToKeep AND (CODE THAT MAKES SURE   THERE ARE AT LEAST 5 FILES OF THE SAME PATTERN PRESENT) :
                print "Delete file: %s\t%s" % (fileTimeToEpoc, fullFileName)
                command = "rm -Rf %s" % fullFileName
                print command
                os.system (command)
                else:
                pass  
            else:
            pass

score 3 · Accepted Answer

这是一项不错的任务，我大量使用了功能模式，主要来自itertools. 我喜欢使用迭代器，因为它们是可扩展的，即使对于巨大的列表也是如此，并且所涉及的功能思想使代码可读和可维护。

首先，从 itertools 和 datetime 中导入我们需要的内容：

from itertools import groupby, chain
from datetime import datetime

获取您的示例文件名作为列表：

filenames = """20121118011335_team1-pathway_Truck_Report_Data_10342532.zip
20121119011335_team1-pathway_Truck_Report_Data_102345234.zip
20121120011335_team1-pathway_Truck_Report_Data_10642224.zip
20121121011335_team1-pathway_Truck_Report_Data_133464.zip
20121122011335_team1-pathway_Truck_Report_Data_126434344.zip
20121123011335_team1-pathway_Truck_Report_Data_12444656.zip
20121124011335_team1-pathway_Truck_Report_Data_1624444.zip
20121125011335_team1-pathway_Truck_Report_Data_3464433.zip
randomefilewithnodate.zip
20121119011335_team2-Paper_Size_Report_336655.zip
20121120011335_team2-Paper_Size_Report_336677.zip
20121121011335_team2-Paper_Size_Report_338877.zip
20121122011335_team2-Paper_Size_Report_226688.zip
20121123011335_team2-Paper_Size_Report_776688.zip
20121124011335_team2-Paper_Size_Report_223355.zip
20121125011335_team2-Paper_Size_Report_111111.zip""".split("\n")

一些辅助功能。名称应该是自我解释的。

def extract_date(s):
    return datetime.strptime(s.split("_")[0], "%Y%m%d%H%M%S")

def starts_with_date(s):
    try:
        extract_date(s)
        return True
    except Exception:
        return False

如果它不涵盖所有情况，您可能需要调整下一个方法 - 对于您的样本数据，它确实如此。

def get_name_root(s):
    return "".join(s.split(".")[0].split("_")[1:-1])

def find_files_to_delete_for_group(group):
    sorted_group = sorted(group, key=extract_date)
    return sorted_group[:-5]

现在，整个例程可以通过一些迭代来完成。首先，我过滤文件名列表，所有不以数据开头的文件（以您的格式）都被过滤掉。然后，其余的按其“名称根”分组（想不出更好的名称）。

fn_groups = groupby(
                filter(
                    starts_with_date,
                    filenames),
                get_name_root
            )

现在，对于每个组，我应用过滤方法（见上文）来查找所有不包含五个最新日期的文件名。为每个组找到的是chained，即从多个列表创建一个迭代器：

fns_to_delete = chain(*[find_files_to_delete_for_group(g) for k, g in fn_groups])

最后，为了方便检查结果，我将迭代器转换为列表并打印出来：

print list(fns_to_delete)

这个脚本的输出是：

['20121118011335_team1-pathway_Truck_Report_Data_10342532.zip', '20121119011335_team1-pathway_Truck_Report_Data_102345234.zip', '20121120011335_team1-pathway_Truck_Report_Data_10642224.zip', '20121119011335_team2-Paper_Size_Report_336655.zip', '20121120011335_team2-Paper_Size_Report_336677.zip']

如果有任何不清楚的地方，请询问。

这是整个脚本，用于简单的 c&p-ing：

from itertools import groupby, chain
from datetime import datetime

filenames = """20121118011335_team1-pathway_Truck_Report_Data_10342532.zip
20121119011335_team1-pathway_Truck_Report_Data_102345234.zip
20121120011335_team1-pathway_Truck_Report_Data_10642224.zip
20121121011335_team1-pathway_Truck_Report_Data_133464.zip
20121122011335_team1-pathway_Truck_Report_Data_126434344.zip
20121123011335_team1-pathway_Truck_Report_Data_12444656.zip
20121124011335_team1-pathway_Truck_Report_Data_1624444.zip
20121125011335_team1-pathway_Truck_Report_Data_3464433.zip
randomefilewithnodate.zip
20121119011335_team2-Paper_Size_Report_336655.zip
20121120011335_team2-Paper_Size_Report_336677.zip
20121121011335_team2-Paper_Size_Report_338877.zip
20121122011335_team2-Paper_Size_Report_226688.zip
20121123011335_team2-Paper_Size_Report_776688.zip
20121124011335_team2-Paper_Size_Report_223355.zip
20121125011335_team2-Paper_Size_Report_111111.zip""".split("\n")

def extract_date(s):
    return datetime.strptime(s.split("_")[0], "%Y%m%d%H%M%S")

def starts_with_date(s):
    try:
        extract_date(s)
        return True
    except Exception:
        return False

def get_name_root(s):
    return "".join(s.split(".")[0].split("_")[1:-1])

def find_files_to_delete_for_group(group):
    sorted_group = sorted(group, key=extract_date)
    return sorted_group[:-5]        

fn_groups = groupby(
                filter(
                    starts_with_date,
                    filenames),
                get_name_root
            )

fns_to_delete = chain(*[find_files_to_delete_for_group(g) for k, g in fn_groups])

print list(fns_to_delete)

score 1 · Accepted Answer

您需要做的困难部分不是编码问题，而是定义问题，因此不能仅通过编写更好的代码来解决 :-)

为什么它与和20121125011335_team1-pathway_Truck_Report_Data_3464433.zip属于同一组？你（作为一个人）如何意识到重要的共同部分是和不是？20121118011335_team1-pathway_Truck_Report_Data_10342532.zip20121119011335_team1-pathway_Truck_Report_Data_102345234.zip_team1-pathway_Truck_Report_Data__team1-pathway_Truck_Report_Data_1

回答这个问题（我怀疑答案将涉及“下划线”和/或“数字”），你就会有前进的方向。

我只知道这将是 yyyymmddhhmmss_something_consistent_random_random 或 yyyymmddhhmmss_something_consistent_something_consistent_random_random.xyz 的各种迭代

如果这就是所有可能的变化，那么我会说你需要寻找下划线包围的常见初始序列。这是有效的，因为随机的东西总是在最后，所以如果你想包含重要的文件扩展名，那么你必须特别对待它（例如将它移动到你正在比较的字符串的前面）。如果您发现几个文件具有三个共同的“单词”但不是四个，那么您假设第四个块是“随机的”并且三个块是“一致的”。然后，您按日期对所有该类型的文件进行排序，从列表中删除最新的五个文件，并删除超过 30 天的所有文件。

找到那些常见的初始序列的“明显”方法是按照除 date 之外的组件的字典顺序对文件名进行排序。然后具有共同初始序列的文件是相邻的，因此您可以遍历列表，将每个文件与当前最长运行的 files-with-a-prefix-in-common 进行比较。

在编码时，请确保如果可能出现以下情况，请正确处理：

<some_date>_truck1_548372.zip
<some_date>_truck1_847284.zip
<some_date>_truck1_data_4948739.zip
<some_date>_truck1_data_9487203.zip

也就是说，确保您知道在这种情况下您是在处理一个组（“truck1”）还是两个组（“truck1”和“truck1_data”）。这很重要，因为您可能希望truck1_data从保留 5 个文件的要求中排除任何truck1文件。

另一种方法：

查找所有超过 30 天的文件（例如<some_date>_truck1_57349.zip）并将它们从最旧到最新排序
对于每个文件，寻求“许可”将其删除，如下所示：
- 从文件名的开头删除日期
- 搜索所有文件（不仅仅是那些超过 30 天的文件），忽略它们自己的日期，与这个文件有一个共同的初始下划线包围的子字符串（所以我们在这里找到truck1文件和truck1_data文件）
- 找到这些文件后，找到其中至少两个共享的最长子字符串 ( truck1_data)
- 如果目标文件不共享该子字符串，请从集合中删除所有具有公共子字符串的文件并重复上一步（现在我们只处理truck1文件）
- 一旦目标文件共享子字符串，就计算它们。如果至少有 5 个，则删除目标文件。

如前所述，这是不必要的缓慢，但我认为它简单地说明了这一点。在最后一步，您实际上可以删除除 5 个其余文件之外的所有文件，并将其他五个文件从未来考虑中删除，因为您已经确定了一组文件。同样，当您删除所有公共子字符串长于与目标文件共享的子字符串的文件时，您已经识别出一个组，您可以将其作为一个组进行处理，而不是仅仅将其扔回大海以供将来识别。

score 1 · Accepted Answer

关于文件的唯一可预测的事情/模式是文件总是在某处包含一个 yyyymmddhhmmss 时间戳和一些重复的模式

要允许yyyymmddhhmmss文件名中的任何位置并自动查找重复模式，您可以首先yyyymmddhhmmss从文件名中删除，然后使用重复至少两次的最长前缀作为重复模式。

import os
from itertools import groupby
from os.path import commonprefix

def files_to_delete(topdir):
    for rootdir, dirs, files in os.walk(topdir):
        # find files with yyyymmddhhmmss
        files_with_date = []
        for filename in files:
            for m in re.finditer(r"(?:^|\D)(\d{14})(?:\D|$)", filename):
                date = parse_date(m.group(1))
                if date is not None: # found date in the filename
                   # strip date
                   no_date = filename[:m.start(1)] + filename[m.end(1):]
                   # add to candidates for removal
                   files_with_date.append((no_date, date, filename))
                   break

        # find repeating pattern
        files_with_date.sort() # sort by filename with a removed date
        # given ["team1-a", "team2-b", "team2-c"]
        # yield [["team1-a"], ["team2-b", "team2-c"]] where 
        #    roots are "team1" and "team2"
        # reject [["team1-a", "team2-b", "team2-c"]] grouping (root "team")
        #     because the longer root "team2" occurs more than once
        roots = [commonprefix(a[0],b[0]) for a,b in pairwise(files_with_date)]
        roots.sort(key=len, reverse=True) # longest roots first
        def longest_root(item):
            no_date = item[0]
            return next(r for r in roots if no_date.startswith(r)) or no_date
        for common_root, group in groupby(files_with_date, key=longest_root):
            # strip 5 newest items (sort by date)
            for _, d, filename in sorted(group, key=lambda item: item[1])[:-5]:
                if d < month_ago: # older than 30 days
                   yield os.path.join(rootdir, filename)

注意：使用重复模式['team1-a', 'team2-b', 'team3-c', ...]组合在一起，即如果“重复模式”在文件列表中不重复，上述算法将失败。[['team1-a', 'team2-b', 'team3-c', ...]]'team'

实用程序：

from datetime import datetime, timedelta
from itertools import izip, tee

month_ago = datetime.utcnow() - timedelta(days=30)

def parse_date(yyyymmddhhmmss):
    try: return datetime.strptime(yyyymmddhhmmss, "%Y%m%d%H%M%S")
    except ValueError:
         return None

def pairwise(iterable): # itertools recipe
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

要删除文件，您可以调用os.remove(path)而不是os.system().

如果您以后可以将文件的命名方案更改为更具确定性，例如，[]在文件名中使用模式，那么您可以将 root 提取为：

root = re.match(r'[^[]*\[([^]]+)\]', filename).group(1)

score 0 · Accepted Answer

我的建议：

搜索文件，直到找到一个名称包含连续 14 位数字的文件。
检查这些数字是否可以是时间戳。
搜索名称包含相同余数的所有其他文件。
按日期的降序对它们进行排序。
从有序文件列表中的第 6 个文件开始删除所有超过 5 天的文件。

score 0 · Accepted Answer

如果之前的随机部分.zip总是由数字和下划线组成，你可以使用这个正则表达式

(\d{14})_(.*?)[_\d]+\.zip

匹配中的第一组是可能的日期，您可以使用datetime.datetime.strptime. 第二组是用于分组的常量部分。

score 0 · Accepted Answer

此功能会将带有日期戳的文件名转换为纪元时间它不使用正则表达式您必须在import time某个地方

def timestamp2epoch(str):
   x=[]
   for v in (str[0:4],str[4:6],str[6:8],str[8:10],str[10:12],str[12:14],0,0,0):
     x.append(int(v))
   return time.mktime(x)

score 0 · Accepted Answer

只需将问题分开，一种方法查找模式，另一种查找日期并删除内容

import os
import subprocess

def find_patterns():
    filenames = os.walk("/path/to/my/data/dir").next()[2]
    parsed_patterns = map(lambda file: "_".join(file.split("_")[1:-1]),filenames)
    unique_patterns = list(set(parsed_patterns))
    return unique_patterns

def delete_dates_after_past_five_days(unique_patterns):
    filenames = os.walk("/path/to/my/data/dir").next()[2]
    for pattern in unique_patterns:
        matched_files = filter(lambda file: pattern in file, filenames)
    sorted_files = sorted(matched_files) #Will sort by date
    if len(sorted_files) > 5:
        for file in sorted_files[:-5]:
            subprocess.call(["rm", os.path.join("/path/to/my/data/dir", file)])

unique_patterns = find_patterns()
delete_dates_from_past_five_days(unique_patterns)

python - python删除与未知模式匹配的旧文件（棘手）

7 回答 7

Related

Reference