为了为具有动态加载内容的页面创建爬虫,requests-html
提供模块以在 JS 执行后获取渲染页面。但是,当尝试在多线程实现中AsyncHTMLSession
调用该arender()
方法时,生成的 HTML 不会改变。
例如,在源代码中提供的 URL 中,表格 HTML 值默认为空,并且在脚本执行后,由arender()
预期将值插入标记的方法模拟,尽管在源代码中没有注意到可见的变化。
from pprint import pprint
#from bs4 import BeautifulSoup
import asyncio
from timeit import default_timer
from concurrent.futures import ThreadPoolExecutor
from requests_html import AsyncHTMLSession, HTML
async def fetch(session, url):
r = await session.get(url)
await r.html.arender()
return r.content
def parseWebpage(page):
print(page)
async def get_data_asynchronous():
urls = [
'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'
]
with ThreadPoolExecutor(max_workers=20) as executor:
with AsyncHTMLSession() as session:
# Set any session parameters here before calling `fetch`
# Initialize the event loop
loop = asyncio.get_event_loop()
# Use list comprehension to create a list of
# tasks to complete. The executor will run the `fetch`
# function for each url in the urlslist
tasks = [
await loop.run_in_executor(
executor,
fetch,
*(session, url) # Allows us to pass in multiple arguments to `fetch`
)
for url in urls
]
# Initializes the tasks to run and awaits their results
for response in await asyncio.gather(*tasks):
parseWebpage(response)
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_data_asynchronous())
loop.run_until_complete(future)
main()