javascript - 如何在滚动时从使用 javascript 加载元素的网页中抓取？

Question

我的朋友问我是否可以编写一个网络抓取脚本来从特定网站收集 pokemon 的数据。

我编写了以下代码来呈现 javascript 并获取一个特定的类来从网站 ( https://www.smogon.com/dex/ss/pokemon/ ) 收集数据。

问题是，当您向下滚动页面时，页面会加载更多条目。有没有办法从这个刮？我是网络抓取的新手，所以我不完全确定这一切是如何运作的。

from requests_html import HTMLSession

def getPokemon(link):
    session = HTMLSession()
    r = session.get(link)
    r.html.render()
    for pokemon in r.html.find("div.PokemonAltRow"):
        print(pokemon)
    quit()

getPokemon('https://www.smogon.com/dex/ss/pokemon/')

score 3 · Accepted Answer

数据实际上存在于页面源中。请参阅view-source:https://www.smogon.com/dex/ss/pokemon/（它作为 javascript 变量存在于脚本标记中）。

import requests
import re
import json


response = requests.get('https://www.smogon.com/dex/ss/pokemon/')

# The following regex will help you take the json string from the response text
data = "".join(re.findall(r'dexSettings = (\{.*\})', response.text))

# the above will only return a string, we need to parse that to json in order to process it as a regular json object using `json.loads()`
data = json.loads(data)

# now we can query json string like below.
data = data.get('injectRpcs', [])[1][1].get('items', [])

for row in data:
  print(row.get('name', ''))
  print(row.get('description', ''))

在此处查看实际操作

javascript - 如何在滚动时从使用 javascript 加载元素的网页中抓取？

1 回答 1

Related

Reference