python - 使用 BeautifulSoup/Python 解析目录中的每个文件，另存为新文件

Question

Python 和 BeautifulSoup 的新手。我有一个 Python 程序，它打开一个名为“example.html”的文件，对其运行 BeautifulSoup 操作，然后对其运行 Bleach 操作，然后将结果保存为文件“example-cleaned.html”。到目前为止，它适用于“example.html”的所有内容。

我需要对其进行修改，以便它打开文件夹“/posts/”中的每个文件，在其上运行程序，然后将其保存为“/posts-cleaned/X-cleaned.html”，其中 X 是原始文件名。

这是我的代码，最小化：

from bs4 import BeautifulSoup
import bleach
import re

text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")

tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}

# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())

# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"

很高兴收到对现有解决方案的帮助和指示！

score 5 · Accepted Answer

您可以使用os.listdir()获取目录中所有文件的列表。如果你想一直递归到目录树，你需要os.walk().

我会移动所有这些代码来处理单个文件来运行，然后编写第二个函数来处理整个目录的解析。像这样的东西：

def clean_dir(directory):

    os.chdir(directory)

    for filename in os.listdir(directory):
        clean_file(filename)

def clean_file(filename):

    tag_black_list = ['iframe', 'script']
    tag_white_list = ['p','div']
    attr_white_list = {'*': ['title']}

    with open(filename, 'r') as fhandle:
        text = BeautifulSoup(fhandle)
        text.encode("utf-8")

        # Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
        [s.decompose() for s in text(tag_black_list)]
        pretty = (text.prettify())

        # Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
        cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

        # this appends -cleaned to the file; 
        # relies on the file having a '.'
        dot_pos = filename.rfind('.')
        cleaned_filename = '{0}-cleaned{1}'.format(filename[:dot_pos], filename[dot_pos:])

        with open(cleaned_filename, 'w') as fout:
            fout.write(cleaned.encode("utf-8"))

    print "Done"

然后你只需打电话clean_dir('/posts')或不打电话。

我将“-cleaned”附加到文件中，但我想我更喜欢您使用全新目录的想法。-cleaned这样，如果某些文件等已经存在，您就不必处理冲突。

我还使用该with语句在此处打开文件，因为它会关闭它们并自动处理异常。

score 2 · Accepted Answer

回答我自己的问题，对于其他可能觉得 os.listdir 的 Python 文档有点无用的人：

from bs4 import BeautifulSoup
import bleach
import re
import os, os.path

tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}

postlist = os.listdir("posts/")

for post in postlist: 

        # HERE: you need to specify the directory again, the value of "post" is just the filename:
    text = BeautifulSoup(open("posts/"+post))
    text.encode("utf-8")

    # Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
    [s.decompose() for s in text(tag_black_list)]
    pretty = (text.prettify())

    # Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
    cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

    fout = open("posts-cleaned/"+post, "w")
    fout.write(cleaned.encode("utf-8"))
    fout.close()

我作弊并创建了一个名为“posts-cleaned/”的单独文件夹，因为将文件保存到那里比拆分文件名、添加“cleaned”并重新加入它更容易，尽管如果有人想向我展示一个这样做的好方法，那就更好了。

python - 使用 BeautifulSoup/Python 解析目录中的每个文件，另存为新文件

2 回答 2

Related

Reference