python - 从 BeautifulSoup.findAll 创建行列表的更优雅的方式

Question

我正在使用 BeautifulSoup 编写网络解析器。我创建了一个用生成的行列表bs.findAll(text=True)，然后逐行拆分并在那里应用我的逻辑。html_payload是任意网页。

到目前为止我得到的代码是有效的，但它不是很漂亮，这让我觉得必须有一种更好、更优雅的编写方式。

    data_to_parse = BeautifulSoup(html_payload)
    lines_to_parse = []

    d = data_to_parse.findAll(text=True)
    for line in d:
        for line2 in line.strip().split('\n'):
            if line2:
                lines_to_parse.append(line2)

    for line in lines_to_parse:
        pass # here's where I start analyzing results

有没有人可以提出更好的方法来解决这个问题？

score 1 · Accepted Answer

只需一次获取所有文本并将其分成几行：

data_to_parse = BeautifulSoup(html_payload)
for line in data_to_parse.get_text().split("\n"):
    pass  # ... do something

score 1 · Accepted Answer

您可以使用列表理解：

lines_to_parse = [line2 for line in data_to_parse.findAll(text=True) for line2 in line.strip().split('\n') if line2]

或者，您实际上可以结合收集和分析步骤：

d = data_to_parse.findAll(text=True)
for line in d:
     for line2 in line.strip().split('\n'):
         if line2:
             # analyze here

或者，请记住，您没有大量使用BeautifulSoupxmltodict可能会帮助您将数据收集到列表中，看看。

希望有帮助。

python - 从 BeautifulSoup.findAll 创建行列表的更优雅的方式

2 回答 2

Related

Reference