python - 以下 Python 代码有什么作用？这就像一个带括号的列表推导。

Question

我正在研究用 Python 制作的网络爬虫，我偶然发现了一个非常简单的爬虫。但是，我不明白最后几行，在以下代码中突出显示：

import sys
import re
import urllib2
import urlparse

tocrawl = [sys.argv[1]]
crawled = []

keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')

while 1:
    crawling = tocrawl.pop(0)
    response = urllib2.urlopen(crawling)
    msg = response.read()
    keywordlist = keywordregex.findall(msg)
    crawled.append(crawling)
    links = linkregex.findall(msg)
    url = urlparse.urlparse(crawling)

    a = (links.pop(0) for _ in range(len(links))) //What does this do?

    for link in a:
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link

        if link not in crawled:
            tocrawl.append(link)

那条线对我来说看起来像是某种列表理解，但我不确定，我需要一个解释。

score 9 · Accepted Answer

它是一个生成器表达式links，它只是在您迭代它时清空列表。

他们本可以更换这部分

a = (links.pop(0) for _ in range(len(links))) //What does this do?

for link in a:

有了这个：

while links:
    link = links.pop(0)

它的工作原理是一样的。但由于从列表末尾弹出更有效，这将比以下任何一种都好：

links.reverse()
while links:
    link = links.pop()

当然，如果您可以按相反顺序跟踪链接（我不明白为什么需要按顺序处理它们），那么不颠倒links列表而直接弹出结尾会更有效。

score 2 · Accepted Answer

它创建了一个生成器，它将对象从链接列表中删除。

解释：

range(len(links))返回从 0 到但不包括链接列表长度的数字列表。所以如果 links 包含[ "www.yahoo.com", "www.google.com", "www.python.org" ]，那么它将生成一个列表 [0, 1, 2 ]。

for _ in blah, 只是循环列表，丢弃结果。

links.pop(0)从链接中删除第一项。

整个表达式返回一个生成器，它从链接列表的头部弹出项目。

最后，在 python 控制台中进行演示：

>>> links = [ "www.yahoo.com", "www.google.com", "www.python.org "]
>>> a = (links.pop(0) for _ in range(len(links)))
>>> a.next()
'www.yahoo.com'
>>> links
['www.google.com', 'www.python.org ']
>>> a.next()
'www.google.com'
>>> links
['www.python.org ']
>>> a.next()
'www.python.org '
>>> links
[]

score 0 · Accepted Answer

a = (links.pop(0) for _ in range(len(links)))

也可以写成：

a = []
for _ in range(len(links)):
    a.append(links.pop(0))

编辑：

唯一的区别是在使用生成器时它是懒惰地完成的，因此项目仅在通过 a 请求时才从链接中弹出。而不是一次全部，在处理大量数据时效率更高，如果不使用高级 pythonic 函数，就无法做到这一点。

python - 以下 Python 代码有什么作用？这就像一个带括号的列表推导。

3 回答 3

Related

Reference