总是问你 - 有没有更简单的方法?
是的,你应该去:)
如果你想抓取,首先看看你是否真的必须从网站上抓取内容,或者是否有一个api
可以提供结构良好的信息。
请求 api 的示例
import requests
import pandas as pd
url = "https://site.web.api.espn.com/apis/common/v3/sports/basketball/nba/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=true&page=1&limit=50&sort=offensive.avgPoints%3Adesc&season=2021&seasontype=2"
headers = {"user-agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
response.raise_for_status()
ranking=[]
for i,player in enumerate(response.json()['athletes'], start=1):
rank = i
name = player['athlete']['displayName']
team = player['athlete']['teamShortName']
category = player['athlete']['position']['abbreviation']
ranking.append({'rank':rank, 'name':name, 'team':team, 'category':category})
df = pd.DataFrame(ranking)
df
输出数据帧
rank name team category
1 James Harden HOU SG
2 Stephen Curry GS PG
3 Bradley Beal WSH SG
4 Trae Young ATL PG
5 Kevin Durant BKN SF
6 CJ McCollum POR SG
7 Kyrie Irving BKN PG
8 Jaylen Brown BOS SG
9 Giannis Antetokounmpo MIL PF
10 Jayson Tatum BOS PF
但是要回答你的问题
您也可以使用 来执行此操作BeautifulSoup
,但在我看来它更容易出错:
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
url ='https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
source = requests.get(url)
soup = BeautifulSoup(source.text, 'lxml')
for i in soup.select('tr')[1:]:
if i.select_one('td'):
rank = i.select_one('td').get_text()
if i.select_one('div > a'):
player = i.select_one('div > a').get_text()
if i.select_one('div > span'):
team =i.select_one('div > span').get_text()
data.append({'rank':rank, 'player':player, 'team':team})
pd.DataFrame(data)
如果你不想使用 css 选择器,你也可以这样做
for i in soup.find_all('tr')[1:]:
if i.find('td'):
rank = i.find('td').get_text()
if i.find('a'):
player = i.find('a').get_text()
if i.find('span'):
team =i.find('span').get_text()