database - 文件夹搜索算法

Question

不确定这是否是这里经常被问到的问题，或者我是否会得到任何答案，但我正在寻找一种伪代码方法来从包含图像的文件夹结构中生成数据库链接记录文件。

我有一组文件夹，结构如下：

+-make_1/
  | +--model_1/
  |    +-default_version/
  |    |   +--1999
  |    |   +--2000
  |    |   |   +--image_01.jpg
  |    |   |   +--image_02.jpg
  |    |   |   +--image_03.jpg
  |    |   |   ...
  |    |   +--2001
  |    |   +--2002
  |    |   +--2003
  |    |   ...
  |    |   +--2009
  |    +--version_1/
  |    |   +--1999
  |    |   ...
  |    |   +--2009
  |    +--version_2/
  |    |   +--1999
  |    |   +--2000
  |    |   +--2001
  |    |   |   +--image_04.jpg
  |    |   |   +--image_05.jpg
  |    |   |   +--image_06.jpg
  |    |   |   ...
  |    |   +--2002
  |    |   +--2003
  |    |   |   +--image_07.jpg
  |    |   |   +--image_08.jpg
  |    |   |   +--image_09.jpg
  |    |   ...
  |    |   +--2009
  ...  ... ...

从本质上讲，它代表了从 1999 年开始的车辆的可能图像。

品牌和型号（例如品牌：Alfa Romeo，型号：145）有各种装饰或版本。每种装饰或版本都可以在许多看起来相同但燃料类型或发动机容量不同的车辆中找到。

为了节省重复，上面的文件夹结构使用了默认文件夹......并且从 2000 年开始，默认版本会出现图像。我需要为每个版本生成链接表 - 基于是否有自己的覆盖图像，或者是否使用默认版本......

例如，version_1 没有图像文件，所以我需要为默认图像创建链接，从 2000 年开始一直持续到 2009 年。

另一方面，版本 2 开始使用 2000 年的默认图像，但随后使用了两个新集，首先是 2001-2002 年，然后是 2003-2009 年。因此，所需的链接列表是...

version    start     end   file_name
=======    =====   =====   =========
version_1   2000    2009   image_01.jpg
version_1   2000    2009   image_02.jpg
version_1   2000    2009   image_03.jpg
...
version_2   2000    2001   image_01.jpg
version_2   2000    2001   image_02.jpg
version_2   2000    2001   image_03.jpg
version_2   2001    2003   image_04.jpg
version_2   2001    2003   image_05.jpg
version_2   2001    2003   image_06.jpg
version_2   2003    2009   image_07.jpg
version_2   2003    2009   image_08.jpg
version_2   2003    2009   image_09.jpg
...

（默认就是这样 - 一个占位符，不需要链接。）

目前我正在浏览文件夹，构建数组，然后在最后修剪脂肪。我只是想知道是否有捷径，使用某种文本处理方法？大约有 45,000 个文件夹，其中大部分是空的 :-)

score 1 · Accepted Answer

这是一些 Python 伪代码，非常接近可执行文件（需要合适的导入和用于执行实际写入的 writerow 函数的 def ——无论是中间文件、DB、CSV 等等）：

# first, collect all the data in a dict of dicts of lists
# first key is version, second key is year (only for non-empty years)

tree = dict()
for root, dirs, files in os.walk('make_1/model_1'):
    head, tail = os.path.split(root)
    if dirs:
       # here, tail is a version
       tree[tail] = dict
    elif files:
       # here, tail is a year
       tree[os.path.basename(head)][tail] = files

# now specialcase default_version
default_version = tree.pop('default_version')
# determine range of years; rule is quite asymmetrical:
#   for min, only years with files in them count
min_year = min(d for d in default_version if default_version[d])
#   for max, all years count, even if empty
max_year = max(default_version)

for version, years in tree.iteritems():
    current_files = default_version[min_year]
    years.append(max_year + 1)
    y = min_year
    while years:
        next_change = min(years)
        if y < next_change:
            for f in current_files:
                writerow(version, y, next_change-1, f)
        y = next_change
        current_files = years.pop(y)

规范和示例中的一个歧义是 default_version 是否有可能在几年内更改文件集-在这里，我假设这不会发生（只有特定版本会以这种方式更改，默认版本总是带有一组文件）。

如果不是这种情况，如果默认版本在 1999 年和 2003 年发生变化，并且版本 1 在 2001 年和 2005 年发生变化——版本 1 应该使用哪些文件用于 03 和 04，默认版本中的新文件，还是在 01 中指定的那些？

在最复杂的规范版本中（default_version 和特定版本都可以更改，最近的更改优先，如果特定和默认更改在同一年，则特定优先）需要获得所有对于每个特定版本，“下一个更改年份”序列，通过仔细“优先合并”默认和特定版本的更改年份序列，而不是years像我在这里那样仅使用（特定版本中的更改序列）-当然，顺序中的每个更改年份都必须与相应的文件集相关联。

因此，如果可以表达确切的规范，直到极端情况，我可以通过修改这个伪代码来展示如何进行所需的合并——我宁愿在明确规范之前不做这项工作，因为，如果规格确实更简单，不需要这项工作！-)

编辑：作为一个新的评论澄清，确切的规格确实是最复杂的，所以我们确实做了适当的合并。因此，上述简单答案末尾的循环更改为：

for version, years_dict in tree.iteritems():
    # have years_dict override default_version when coincident
    merged = dict(default_version, **years_dict)
    current_files = merged.pop(min_year)
    merged[max_year + 1] = None
    y = min_year
    while merged:
        next_change = min(merged)
        for f in current_files:
            writerow(version, y, next_change-1, f)
        y = next_change
        current_files = merged.pop(y)

关键变化是这一merged = dict(...行：在 Python 中，这意味着合并一个新的 dict（一个 dict 是一个通用映射，在其他语言中通常称为 hashmap），它是 and 的总和或合并，default_version但是years_dict当一个key 存在于这两者中，值 fromyears_dict优先 - 它满足两者中存在的一年（即文件发生变化的一年）的关键条件。

之后就一帆风顺了：anydict.pop(somekey) 返回与键对应的值（并将其从 anydict 中删除）；min(anydict) 返回字典中的最小键。请注意“哨兵”习语merged[max_year + 1] = None：这表示“最大一年之后的一年”始终被视为更改年（虚拟占位符值为 None），因此最后一组行始终正确写入（根据需要，最大年份为max_year + 1 - 1，即恰好max_year）。

这个算法不是最有效的，只是最简单的！我们min(merged)一遍又一遍地做，使它成为 O(N 平方)——我认为我们可以负担得起，因为每个merged人最多应该有几十个变化年，但纯粹主义者会畏缩。我们当然可以提出一个 O(N logN) 的解决方案——只需对年份进行一劳永逸的排序，然后遍历该序列以获得的连续值next_change。只是为了完整性......：

default_version[max_year + 1] = None

for version, years_dict in tree.iteritems():
    merged = dict(default_version, **years_dict)
    for next_change in sorted(merged):
        if next_change > min_year:
            for f in merged[y]:
                writerow(version, y, next_change-1, f)
        y = next_change

这里sorted给出了一个列表，其中的键merged按排序顺序排列，我已经切换到for语句从头到尾遍历该列表（以及第一次不输出任何内容的 if 语句）。哨兵现在放在 default_version 中（所以它在循环之外，为了另一个轻微的优化）。有趣的是，这个优化版本（主要是因为它在稍高的抽象级别上工作）结果比以前的版本更小更简单；-)。

database - 文件夹搜索算法

1 回答 1

Related

Reference