我正在抓取类似网页的列表,有时会出错(见最后)。
我使用的代码:
from requests_html import HTMLSession
import pyppdf.patch_pyppeteer
link = 'https://www.wildberries.ru/catalog/1588749/detail.aspx?targetUrl=BP'
# It's always a different link from the list, but here I simplified it.
session = HTMLSession()
resp = session.get(link)
resp.html.render()
大多数页面不会导致错误,但少数页面会导致错误。错误出现在resp = session.get(link)
或上resp.html.render()
。这里是:
Traceback (most recent call last):
File "/Users/max/Dropbox/WORK/projects/wildberries_parser/parsers/catalog_parser_3.py", line 133, in <module>
row = parse_item_page(link)
File "/Users/max/Dropbox/WORK/projects/wildberries_parser/parsers/catalog_parser_3.py", line 36, in parse_item_page
resp.html.render()
File "/Users/max/opt/anaconda3/envs/wildberries_parser/lib/python3.6/site-packages/requests_html.py", line 598, in render
content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
File "/Users/max/opt/anaconda3/envs/wildberries_parser/lib/python3.6/asyncio/base_events.py", line 488, in run_until_complete
return future.result()
File "/Users/max/opt/anaconda3/envs/wildberries_parser/lib/python3.6/site-packages/requests_html.py", line 512, in _async_render
await page.goto(url, options={'timeout': int(timeout * 1000)})
File "/Users/max/opt/anaconda3/envs/wildberries_parser/lib/python3.6/site-packages/pyppeteer/page.py", line 856, in goto
raise PageError(result)
pyppeteer.errors.PageError: net::ERR_NAME_NOT_RESOLVED at https://www.wildberries.ru/catalog/1588749/detail.aspx?targetUrl=BP
我无法理解,也没有自己弄清楚。你能告诉我,这是怎么回事吗?