python - 提高python os.walk + 正则表达式算法的效率

Question

我正在使用 os.walk 从特定文件夹中选择与正则表达式匹配的文件。

for dirpath, dirs, files in os.walk(str(basedir)):
    files[:] = [f for f in files if re.match(regex, os.path.join(dirpath, f))]
    print dirpath, dirs, files

但这要处理 basedir 下的所有文件和文件夹，相当耗时。我正在寻找一种方法来使用与文件相同的正则表达式来过滤掉每一步中不需要的目录。或者一种只匹配正则表达式的一部分的方法......

例如，在类似的结构中

/data/2013/07/19/file.dat

使用例如以下正则表达式

/data/(?P<year>2013)/(?P<month>07)/(?P<day>19)/(?P<filename>.*\.dat)

无需查看例如 /data/2012 即可找到所有 .dat 文件

score 1 · Accepted Answer

例如，如果您只想/data/2013/07/19处理文件 in，只需启动os.walk()from 目录top /data/2013/07/19。这类似于 Tommi Komulainen 的建议，但您无需修改循环代码。

score 0 · Accepted Answer

我偶然发现了这个问题（很清楚问题是什么，即使没有实际问题）所以由于没有人回答，我想即使很晚也可能有用。

您需要将原始 RE 拆分为段，以便在循环内过滤中间目录。过滤，然后匹配文件。

regex_parts = regex.split("/")
del regex_parts[0]  # Because [0] = "" it's not needed

for base, dirs, files in os.walk(root):
   if len(regex_parts) > 1:
       dirs[:] = [dir for dir in dirs if re.match(regex_parts[0], dir)]
       regex_parts[:] = regex_parts[1:]
       continue

   files[:] = [f for f in files if re.match(regex, os.path.join(dirpath, f))]

由于您正在匹配文件（路径的最后一部分），因此在您尽可能过滤之前没有理由进行实际匹配。有 len 检查，因此可能与最后一部分匹配的目录不会被破坏。这可能会变得更有效率，但它对我有用（我今天遇到了类似的问题）。

python - 提高python os.walk + 正则表达式算法的效率

2 回答 2

Related

Reference