这是一段简单的代码,用于查找具有特定 id 的元素。例如,我拿了大的随机 Wiki 文章。
测试代码:
# coding: utf8
from bs4 import BeautifulSoup, Tag
import requests
import time
import sys
TAG_NAME = "li"
def find_with_index(index, id):
if id and id in index:
return index[id]
return None
page_text = requests.get("https://en.wikipedia.org/wiki/United_States").text
page = BeautifulSoup(page_text, 'lxml')
if page:
print("Page was downloaded and parsed")
else:
print("Something wrong.")
sys.exit()
all_ids = set()
for child in page.recursiveChildGenerator():
if type(child) is Tag and child.has_attr("id") and child.name == TAG_NAME:
all_ids.add(child.attrs["id"])
print(str(len(all_ids)) + " ids in total")
bs_find_start = time.clock()
[page.find(TAG_NAME, {"id": id}) for id in all_ids]
bs_find_end = time.clock()
index_find_start = time.clock()
simple_index = {li.attrs["id"]: li for li in page.find_all(TAG_NAME) if li.has_attr("id")}
[find_with_index(simple_index, id) for id in all_ids]
index_find_end = time.clock()
print("Spent on bs.find: " + str(bs_find_end - bs_find_start))
print("Spent on indexed find: " + str(index_find_end - index_find_start))
我有这个输出:
Spent on bs.find: 122.81345616345673
Spent on indexed find: 0.027779648046461602
问题是:就性能而言,这绝对是一场灾难。这是否意味着 BS 内部没有任何类型的索引,并且无论我需要查找什么,都会一次又一次地遍历整个 DOM 树来执行查找操作?或者我不完全了解如何有效地执行查找操作?当有很多查找操作(100+)时,这可能是一个严重的瓶颈,我不能说找到问题是非常明显的。