python-2.7 - 我的网络爬虫没有循环获取所有链接 - 使用 foo 函数（Python）

Question

我正在创建一个网络爬虫，在第一步中，我需要爬取一个网站并提取其所有链接，但是我的代码没有循环提取。我尝试使用 append 但这会产生一个列表列表。我正在尝试使用 foo 并出现错误。任何帮助，将不胜感激。谢谢

from urllib2 import urlopen

import re

def get_all_urls(url):

    get_content = urlopen(url).read()
    url_list = []

    find_url = re.compile(r'a\s?href="(.*)">')
    url_list_temp = find_url.findall(get_content)
    for i in url_list_temp:
        url_temp = url_list_temp.pop()
        source = 'http://blablabla/'
        url = source + url_temp
        url_list.append(url)
    #print url_list
    return url_list


def web_crawler(seed):

    tocrawl = [seed]
    crawled = []

    i = 0

    while i < len(tocrawl):
        page = tocrawl.pop()
        if page not in crawled:
            #tocrawl.append(get_all_urls(page))
            foo = (get_all_urls(page))
            tocrawl = foo
            crawled.append(page)
        if not tocrawl:
            break
    print crawled
    return crawled

score 0 · Accepted Answer

首先，使用正则表达式解析 HTML 是个坏主意，原因如下：

此处：用于 HTML 解析的 Python 正则表达式 (BeautifulSoup)
这里：Python正则表达式匹配HTML
这里：正则表达式 python 解析 html 页面
等等..

您应该使用 HTML 解析器来完成这项工作。Python 在其标准库中提供了一个： HTMLParser，但您也可以使用BeautifulSoup甚至lxml。我倾向于使用 BeautifulSoup，因为它的 API 很好。

现在，回到您的问题，您正在修改您正在迭代的列表：

for i in url_list_temp:
    url_temp = url_list_temp.pop()
    source = 'http://blablabla/'
    ...

这很糟糕，因为它隐喻地相当于锯掉你正坐在的树枝。此外，您似乎不需要此删除，因为没有必须删除或不删除 url 的条件。

最后，使用后会出现错误，append因为正如您所说，它会创建一个列表列表。你应该extend改用。

>>> l1 = [1, 2, 3]
>>> l2 = [4, 5, 6]
>>> l1.append(l2)
>>> l1
[1, 2, 3, [4, 5, 6]]
>>> l1 = [1, 2, 3]
>>> l1.extends(l2)
>>> l1
[1, 2, 3, 4, 5, 6]

注意：查看http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/以获取使用 beautifulsoup 进行抓取的更多帮助

python-2.7 - 我的网络爬虫没有循环获取所有链接 - 使用 foo 函数（Python）

1 回答 1

Related

Reference