python - Finding links fast: regex vs. lxml

Question

I am trying to build a fast web crawler, and as a result, I need an efficient way to locate all the links on a page. What is the performance comparison between a fast XML/HTML parser like lxml and using regex matching?

score 7 · Accepted Answer

这里的问题不在于正则表达式与 lxml。正则表达式不是解决方案。您将如何限制链接来源的元素？一个更真实的例子是格式错误的 HTML。您将如何从此链接中提取href属性的内容？

<A href = /text" data-href='foo>' >Test</a>

lxml 解析它就好了，就像 Chrome 一样，但祝你好运让正则表达式工作。如果您对实际的速度差异感到好奇，这是我进行的快速测试。

设置：

import re
import lxml.html

def test_lxml(html):
    root = lxml.html.fromstring(html)
    #root.make_links_absolute('http://stackoverflow.com/')

    for href in root.xpath('//a/@href'):
        yield href

LINK_REGEX = re.compile(r'href="(.*?)"')

def test_regex(html):
    for href in LINK_REGEX.finditer(html):
        yield href.group(1)

测试 HTML：

html = requests.get('http://stackoverflow.com/questions?pagesize=50').text

结果：

In [22]: %timeit list(test_lxml(html))
100 loops, best of 3: 9.05 ms per loop

In [23]: %timeit list(test_regex(html))
1000 loops, best of 3: 582 us per loop

In [24]: len(list(test_lxml(html)))
Out[24]: 412

In [25]: len(list(test_regex(html)))
Out[25]: 416

为了进行比较，以下是 Chrome 挑选出的链接数量：

> document.querySelectorAll('a[href]').length
413

此外，仅作记录，Scrapy是目前最好的网络抓取框架之一，它使用 lxml 来解析 HTML。

score -2 · Accepted Answer

-2

您可以使用 pyquery，这是一个为您提供 jquery 函数的 python 库。

于 2013-06-05T06:50:39.423 回答

python - Finding links fast: regex vs. lxml

2 回答 2

Related

Reference