python-3.x - 使用 Python 进行异步抓取：grequests 和 Beautifulsoup4

Question

我正在尝试抓取这个网站。我设法通过使用 urllib 和 beautifulsoup 来做到这一点。但是 urllib 太慢了。我想要异步请求，因为网址有数千个。我发现一个不错的包是 grequests。

例子：

import grequests
from bs4 import BeautifulSoup

pages = []
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
for i in range(1,1000):
    pages.append(page)
    page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
    page = page + "/offset_{}".format(i*10)

rs = (grequests.get(item) for item in pages)
a=grequests.map(rs)

问题是我不知道如何继续和使用beautifulsoup。从而得到每个页面的html代码。很高兴听到你的想法。谢谢！

score 0 · Accepted Answer

参考下面的脚本，还要检查源的链接。我会帮你的。

reqs = (grequests.get(link) for link in links)
resp=grequests.imap(reqs, grequests.Pool(10))
 
for r in resp:
   soup = BeautifulSoup(r.text, 'lxml')
   results = soup.find_all('a', attrs={"class":'product__list-name'})
   print(results[0].text)
   prices = soup.find_all('span', attrs={'class':"pdpPriceMrp"})
   print(prices[0].text)
   discount = soup.find_all("div", attrs={"class":"listingDiscnt"})
   print(discount[0].text)

来源：https ://blog.datahut.co/asynchronous-web-scraping-using-python/

python-3.x - 使用 Python 进行异步抓取：grequests 和 Beautifulsoup4

1 回答 1

Related

Reference