0

我正在学习如何使用 Beautiful Soup 4 模块来抓取网站。我正在尝试抓取板球联赛表,到目前为止已经使用了以下代码。

#We want to scrape the cricinfo website for the league table
import requests
from bs4 import BeautifulSoup as bs

r = requests.get("https://www.espncricinfo.com/table/series/8048/season/2020/indian-premier-league")
soup = bs(r.content)
headers = soup.find_all('h5')
print(headers)

这是我运行代码后得到的输出

[<h5 class="header-title label ">Indian Premier League 2020</h5>,
 <h5 class="header-title label ">Mumbai Indians</h5>,
 <h5 class="header-title label ">Royal Challengers Bangalore</h5>,
 <h5 class="header-title label ">Delhi Capitals</h5>,
 <h5 class="header-title label ">Sunrisers Hyderabad</h5>,
 <h5 class="header-title label ">Kings XI Punjab</h5>,
 <h5 class="header-title label ">Rajasthan Royals</h5>,
 <h5 class="header-title label ">Kolkata Knight Riders</h5>,
 <h5 class="header-title label ">Chennai Super Kings</h5>,
 <h5 class="gray600">Standings are updated with the completion of each game</h5>]

我现在想做的是进一步刮掉这个并获得一个包含团队名称的列表并摆脱顶线和底线

例如,我希望最终列表类似于

teams = ['Mumbai Indians', 'Royal Challengers Bangalore', 'Delhi Capitals', 'Sunrisers Hyderabad'. 'Kings XI Punjab', 'Rajasthan Royals', 'Kolkata Knight Riders', 'Chennai Super Kings']

任何帮助将不胜感激谢谢

4

1 回答 1

1

您可以使用.string来获取 HTML 元素的文本内容。尝试这个:

teams = [header.string for header in headers]
于 2020-10-31T22:00:17.957 回答