Python 和 BeautifulSoup 的新手。我有一个 Python 程序,它打开一个名为“example.html”的文件,对其运行 BeautifulSoup 操作,然后对其运行 Bleach 操作,然后将结果保存为文件“example-cleaned.html”。到目前为止,它适用于“example.html”的所有内容。
我需要对其进行修改,以便它打开文件夹“/posts/”中的每个文件,在其上运行程序,然后将其保存为“/posts-cleaned/X-cleaned.html”,其中 X 是原始文件名。
这是我的代码,最小化:
from bs4 import BeautifulSoup
import bleach
import re
text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"
很高兴收到对现有解决方案的帮助和指示!