我正在研究用 Python 制作的网络爬虫,我偶然发现了一个非常简单的爬虫。但是,我不明白最后几行,在以下代码中突出显示:
import sys
import re
import urllib2
import urlparse
tocrawl = [sys.argv[1]]
crawled = []
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')
while 1:
crawling = tocrawl.pop(0)
response = urllib2.urlopen(crawling)
msg = response.read()
keywordlist = keywordregex.findall(msg)
crawled.append(crawling)
links = linkregex.findall(msg)
url = urlparse.urlparse(crawling)
a = (links.pop(0) for _ in range(len(links))) //What does this do?
for link in a:
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.append(link)
那条线对我来说看起来像是某种列表理解,但我不确定,我需要一个解释。