python - 从 html 元素中提取 CSS 样式属性的快速方法

Question

出于机器学习的目的，我有一个 html 页面作为输入，以提取所有 DOM 元素的所有样式属性。所以，这是我的初步代码：

from selenium import webdriver

start = time.time()
driver = webdriver.PhantomJS()
driver.get('example page')
elements = driver.find_elements(By.XPATH, "//*[not(child::*)]") #select only leaf nodes
l = {}
css_properties=("line-height", "text-align","font-size", "font-style")

for i in elements:
    if i.text:
        #print time.time() - end_dl
        if i.text not in l:
            l[i.text] = {}
        for el in css_properties:
            l[i.text][el] = str(i.value_of_css_property(el))
            l[i.text]["text_length"] = len(i.text)

问题是这段代码解析我的特征（~8s）的时间太长了。任何人都可以以更快的方式思考吗？

score 0 · Accepted Answer

你确定是解析步骤花了这么长时间吗？

如果是这样，这里有几个选项...

尝试 BeautifulSoup4 来解析 DOM。
部署在具有更快硬件的云服务器上。您可以使用 amazon EC2 或 digitalocean 按小时收费。
部署在分布式系统上。

python - 从 html 元素中提取 CSS 样式属性的快速方法

1 回答 1

Related

Reference