4

我最近看了一下python-requests模块,我想用它写一个简单的网络爬虫。给定一组起始 url,我想编写一个 Python 函数,在起始 url 的网页内容中搜索其他 url,然后再次调用相同的函数作为回调,并将新的 url 作为输入,依此类推。起初,我认为事件挂钩是用于此目的的正确工具,但它的文档部分非常稀疏。在另一页上,我读到用于事件挂钩的函数必须返回传递给它们的相同对象。所以事件挂钩对于这类任务显然是不可行的。或者我根本没听懂……

这是我想做的一些伪代码(从伪 Scrapy 蜘蛛借来的):

import lxml.html    

def parse(response):
    for url in lxml.html.parse(response.url).xpath('//@href'):
        return Request(url=url, callback=parse)

有人可以告诉我如何使用 python-requests 来做这件事吗?事件挂钩是正确的工具还是我需要不同的东西?(注意:由于各种原因,Scrapy 不是我的选择。)非常感谢!

4

1 回答 1

7

这是我的做法:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    if len(urls) == 0:
        return
    rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
    responses = grequests.map(rs)
    url_lists = [get_urls_from_response(response) for response in responses]
    urls = sum(url_lists, [])  # flatten list of lists into a list
    recursive_urls(urls)

我还没有测试过代码,但总体思路就在那里。

请注意,我使用grequests而不是requests提高性能。grequest基本上是gevent+request这样,根据我的经验,这种任务要快得多,因为您可以异步检索链接gevent


编辑:这是不使用递归的相同算法:

import grequests
from bs4 import BeautifulSoup


def get_urls_from_response(r):
    soup = BeautifulSoup(r.text)
    urls = [link.get('href') for link in soup.find_all('a')]
    return urls


def print_url(args):
    print args['url']


def recursive_urls(urls):
    """
    Given a list of starting urls, recursively finds all descendant urls
    recursively
    """
    while True:
        if len(urls) == 0:
            break
        rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
        responses = grequests.map(rs)
        url_lists = [get_urls_from_response(response) for response in responses]
        urls = sum(url_lists, [])  # flatten list of lists into a list

if __name__ == "__main__":
    recursive_urls(["INITIAL_URLS"])
于 2012-09-02T17:59:19.023 回答