python - BeautifulSoup4 性能

Question

这是一段简单的代码，用于查找具有特定 id 的元素。例如，我拿了大的随机 Wiki 文章。

测试代码：

# coding: utf8

from bs4 import BeautifulSoup, Tag
import requests
import time
import sys

TAG_NAME = "li"


def find_with_index(index, id):
    if id and id in index:
        return index[id]
    return None

page_text = requests.get("https://en.wikipedia.org/wiki/United_States").text
page = BeautifulSoup(page_text, 'lxml')
if page:
    print("Page was downloaded and parsed")
else:
    print("Something wrong.")
    sys.exit()

all_ids = set()
for child in page.recursiveChildGenerator():
    if type(child) is Tag and child.has_attr("id") and child.name == TAG_NAME:
        all_ids.add(child.attrs["id"])

print(str(len(all_ids)) + " ids in total")

bs_find_start = time.clock()
[page.find(TAG_NAME, {"id": id}) for id in all_ids]
bs_find_end = time.clock()

index_find_start = time.clock()
simple_index = {li.attrs["id"]: li for li in page.find_all(TAG_NAME) if li.has_attr("id")}
[find_with_index(simple_index, id) for id in all_ids]
index_find_end = time.clock()

print("Spent on bs.find: " + str(bs_find_end - bs_find_start))
print("Spent on indexed find: " + str(index_find_end - index_find_start))

我有这个输出：

Spent on bs.find: 122.81345616345673
Spent on indexed find: 0.027779648046461602

问题是：就性能而言，这绝对是一场灾难。这是否意味着 BS 内部没有任何类型的索引，并且无论我需要查找什么，都会一次又一次地遍历整个 DOM 树来执行查找操作？或者我不完全了解如何有效地执行查找操作？当有很多查找操作（100+）时，这可能是一个严重的瓶颈，我不能说找到问题是非常明显的。

python - BeautifulSoup4 性能

0 回答 0

Related

Reference