2

I'm using Python to parse a WordPress site downloaded via wget. All the HTML files are nested inside a complicated folder structure (thanks to WordPress and its long URLs), like site_dump/2010/03/11/post-title/index.html.

However, within the post-title directory there are other directories for the feed and for Google News-esque number-based indexes:

site_dump/2010/03/11/post-title/index.html  # I want this
site_dump/2010/03/11/post-title/feed/index.html  # Not these
site_dump/2010/03/11/post-title/115232/site.com/2010/03/11/post-title/index.html

I only want to access the index.html files that are at the 5th nested level (site_dump/2010/03/11/post-title/index.html), and not beyond. Right now I split the root variable by a slash (/) in the os.walk loop and only deal with the file if it is inside 5 levels of folders:

import os

for root, dirs, files in os.walk('site_dump'):
  nested_levels = root.split('/')
  if len(nested_levels) == 5:
    print(nested_levels)  # Eventually do stuff with the file here

However, this seems kind of inefficient, since os.walk is still traversing those really deep folders. Is there a way to limit how deep os.walk goes when traversing a directory tree?

4

1 回答 1

2

您可以修改 dirs 以防止进一步遍历目录结构。

for root, dirs, files in os.walk('site_dump'):
  nested_levels = root.split('/')
  if len(nested_levels) == 5:
    del dirs[:]
    # Eventually do stuff with the file here

del dirs[:]将删除列表的内容,而不是将 dirs 替换为对新列表的引用。这样做时,就地修改列表很重要。

文档中,topdown引用os.walk您省略的可选参数,默认为True

当 topdown 为 True 时,调用者可以就地修改 dirnames 列表(可能使用 del 或 slice 赋值),并且 walk() 只会递归到名称保留在 dirnames 中的子目录;这可用于修剪搜索,强制执行特定的访问顺序,甚至在调用者再次恢复 walk() 之前通知 walk() 有关调用者创建或重命名的目录。topdown 为 False 时修改 dirnames 是无效的,因为在自底向上模式下,dirnames 中的目录是在 dirpath 本身生成之前生成的。

于 2013-07-05T16:38:20.700 回答