I'm using Python to parse a WordPress site downloaded via wget. All the HTML files are nested inside a complicated folder structure (thanks to WordPress and its long URLs), like site_dump/2010/03/11/post-title/index.html
.
However, within the post-title
directory there are other directories for the feed and for Google News-esque number-based indexes:
site_dump/2010/03/11/post-title/index.html # I want this
site_dump/2010/03/11/post-title/feed/index.html # Not these
site_dump/2010/03/11/post-title/115232/site.com/2010/03/11/post-title/index.html
I only want to access the index.html files that are at the 5th nested level (site_dump/2010/03/11/post-title/index.html
), and not beyond. Right now I split the root
variable by a slash (/
) in the os.walk
loop and only deal with the file if it is inside 5 levels of folders:
import os
for root, dirs, files in os.walk('site_dump'):
nested_levels = root.split('/')
if len(nested_levels) == 5:
print(nested_levels) # Eventually do stuff with the file here
However, this seems kind of inefficient, since os.walk
is still traversing those really deep folders. Is there a way to limit how deep os.walk
goes when traversing a directory tree?