3

我有一个脚本来解析一个包含数千个 url 的列表。但我的问题是,这份清单需要很长时间才能完成。

URL 请求大约需要 4 秒才能加载页面并进行解析。有什么方法可以快速解析大量的 URL?

我的代码如下所示:

from bs4 import BeautifulSoup   
import requests                 

#read url-list
with open('urls.txt') as f:
    content = f.readlines()
# remove whitespace characters
content = [line.strip('\n') for line in content]

#LOOP through urllist and get information
for i in range(5):
    try:
        for url in content:

            #get information
            link = requests.get(url)
            data = link.text
            soup = BeautifulSoup(data, "html5lib")

            #just example scraping
            name = soup.find_all('h1', {'class': 'name'})

编辑:在这个例子中如何处理带有钩子的异步请求?我尝试了本网站Asynchronous Requests with Python requests上提到的以下内容:

from bs4 import BeautifulSoup   
import grequests

def parser(response):
    for url in urls:

        #get information
        link = requests.get(response)
        data = link.text
        soup = BeautifulSoup(data, "html5lib")

        #just example scraping
        name = soup.find_all('h1', {'class': 'name'})

#read urls.txt and store in list variable
with open('urls.txt') as f:
    urls= f.readlines()
# you may also want to remove whitespace characters 
urls = [line.strip('\n') for line in urls]

# A list to hold our things to do via async
async_list = []

for u in urls:
    # The "hooks = {..." part is where you define what you want to do
    # 
    # Note the lack of parentheses following do_something, this is
    # because the response will be used as the first argument automatically
    rs = grequests.get(u, hooks = {'response' : parser})

    # Add the task to our list of things to do via async
    async_list.append(rs)

# Do our list of things to do via async
grequests.map(async_list, size=5)

这对我不起作用。我什至没有在控制台中收到任何错误,它只是运行了很长时间直到它停止。

4

1 回答 1

1

If someone is curious about this question - i decided to start my project again from zero and use scrapy instead of beautifulsoup.

Scrapy is a full framework for webscraping and it has builtin features for handling 1000's of requests at once and you can throttle your requests down that you scrape "more friendly" from your destinated site.

I hope this might help someone. For me it was the better choice for this project.

于 2017-09-22T10:39:42.777 回答