python - Aiohttp 异步会话请求

Question

所以我一直在使用会话来抓取一个带有请求的网站（ www.cardsphere.com ）受保护的页面，如下所示：

import requests

payload = {
            'email': <enter-email-here>,
            'password': <enter-site-password-here>
          }

with requests.Session() as request:
   requests.get(<site-login-page>)
   request.post(<site-login-here>, data=payload)
   request.get(<site-protected-page1>)
   save-stuff-from-page1
   request.get(<site-protected-page2>)
   save-stuff-from-page2
   .
   .
   .
   request.get(<site-protected-pageN>)
   save-stuff-from-pageN
the-end

现在因为它有很多页面，我想用 Aiohttp + asyncio 来加速它......但我错过了一些东西。我已经能够或多或少地使用它来抓取不受保护的页面，如下所示：

import asyncio
import aiohttp

async def get_cards(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            data = await resp.text()
            <do-stuff-with-data>

urls  = [
         'https://www.<url1>.com'
         'https://www.<url2>.com'
         .
         .
         . 
         'https://www.<urlN>.com'
        ]

loop = asyncio.get_event_loop()
loop.run_until_complete(
    asyncio.gather(
        *(get_cards(url) for url in urls)
    )
)

这给出了一些结果，但是对于需要登录的页面，我该如何做呢？我尝试session.post(<login-url>,data=payload)在异步函数中添加，但显然效果不佳，它只会继续登录。有没有办法在循环函数之前“设置”一个 aiohttp ClientSession？因为我需要先登录，然后在同一个会话中，使用 asyncio + aiohttp 从一堆受保护的链接中获取数据？

对python来说还是很新的，异步更是如此，我在这里遗漏了一些关键概念。如果有人能指出我正确的方向，我将不胜感激。

score 2 · Accepted Answer

这是我能想到的最简单的方法，取决于你在做什么，<do-stuff-with-data>你可能会遇到一些其他关于并发的麻烦，你去的兔子洞……开个玩笑，把你的头绕在 coros 上有点复杂和承诺和任务，但一旦你得到它就像顺序编程一样简单

import asyncio
import aiohttp


async def get_cards(url, session, sem):
    async with sem, session.get(url) as resp:
        data = await resp.text()
        # <do-stuff-with-data>


urls = [
    'https://www.<url1>.com',
    'https://www.<url2>.com',
    'https://www.<urlN>.com'
]


async def main():
    sem = asyncio.Semaphore(100)
    async with aiohttp.ClientSession() as session:
        await session.get('auth_url')
        await session.post('auth_url', data={'user': None, 'pass': None})
        tasks = [asyncio.create_task(get_cards(url, session, sem)) for url in urls]
        results = await asyncio.gather(*tasks)
        return results


asyncio.run(main())

python - Aiohttp 异步会话请求

1 回答 1

Related

Reference