所以作为作业的一部分,我必须遍历目录树,似乎 os.walk 是最好的选择。我正在使用 cygwin 运行我的 python 脚本。我要遍历的树的路径是:/cygdrive/c/Users/Kamal/Documents/School/Spring2015/CS410/htmlfiles
所以在我的代码中,这里是 os.walk() 调用的片段:
for dirName, subdirList, fileList in os.walk('/cygdrive/c/Users/Kamal/Documents/School/Spring2015/CS410/htmlfiles'):
但是,当我执行脚本时,它给了我以下错误:
$ python testparser.py
Traceback (most recent call last):
File "testparser.py", line 1, in <module>
Spring2015/CS410/htmlfiles/testparser.py
NameError: name 'Spring2015' is not defined
我很困惑为什么它认为“Spring2015”是未定义的?该目录显然存在于我计算机上的给定路径中
编辑:这是整个代码,因为有些人问过:
from bs4 import BeautifulSoup
import os
import shutil
cnt = 0
print "starting..."
for dirName, subdirList, fileList in os.walk('/cygdrive/c/Users/Kamal/Documents/School/Spring2015/CS410/htmlfiles'):
for f in fileList:
#print the path
print "Processing " + os.path.abspath(f) + "...\n"
#open the HTML file
html = open(f)
soup = BeautifulSoup(html)
#Filter out unwanted stuff
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
visible_text_encoded = visible_text.encode('utf-8')
visible_text_split = visible_text_encoded.split('\n')
visible_text_filtered = filter(lambda l: l != '', visible_text_split)
#Generate the name of the output text file
outfile_name = 'chaya2_' + str(cnt) + '.txt'
#Open the output file to write in
outfile = open(outfile_name, "w")
#Get the URL of the html file using its Path, write it to the first line
outfile.write(os.path.relpath(f, '/cygdrive/c/Users/Kamal/Documents/School/') + ' \n')
#Write the visible text to the
for l in visible_text_filtered:
outfile.write(l+'\n')
#Done writing, move the output file to the appropriate directory
shutil.move(os.path.abspath(outfile_name), '/cygdrive/c/Users/Kamal/Documents/School/Spring2015/CS410/txtFiles')
#Rename the html file
html_name = 'chaya2_' + str(cnt) + '.html'
os.rename(f, html_name)
#Move the html file to the appropriate directory
shutil.move(os.path.abspath(html_name), '/cygdrive/c/Users/Kamal/Documents/School/Spring2015/CS410/htmlFilesAfter')
print html_name + " converted to " + outfile_name + "\n"
outfile.close()
html.close()
cnt+=1