-1

我是新手。我正在学习网络抓取,因此决定抓取一些冠状病毒数据。我想获取每个国家/地区的名称和报告的案例,它们是每个列表中的索引 0 和 1。我如何循环得到它。我读到我可以使用 Selenium 来自动化我可以在这方面提供一些帮助的数据。谢谢

import requests
import bs4 as BeautifulSoup
url = 'https://www.worldometers.info/coronavirus/'
page = requests.get(url)
page.raise_for_status()
soup = BeautifulSoup.BeautifulSoup(page.text,'html.parser')
table = soup.find('div', class_='main_table_countries_div')
data = table.find_all('tr')
row_list = list()
for tr in data:
   td = tr.find_all('td')
   row = [i.text for i in td]
   row_list.append(row)

for a in row_list:
   country_data = a
   print(country_data)








    
4

2 回答 2

0

您非常接近,除了您所做的事情之外,您唯一需要做的就是提取国家名称和报告的计数。

row_list是表中每一行的列表,所以你可以这样做:

country = []
reported = []
for a in row_list:
    if len(a) > 1:
        country.append(a[0])
        reported.append(a[1])

我添加了一个检查,len(a) > 1因为我认为第一行row_list是空的。然后将是国家列表countryreported每个国家按相同顺序报告的计数。

for c, r in zip(country ,reported):
    print("{}: {}".format(c, r))


USA: 159,689
Italy: 101,739
Spain: 85,195
Germany: 66,125
France: 44,550
Iran: 41,495
UK: 22,141
Switzerland: 15,760
Belgium: 11,899
Netherlands: 11,750
Turkey: 10,827
S. Korea: 9,661
Austria: 9,597
Canada: 7,297
Portugal: 6,408
...
于 2020-03-30T20:41:35.647 回答
0

其中一个列表是空的,当您尝试对其进行索引时会导致错误:

import requests
import bs4 as BeautifulSoup
url = 'https://www.worldometers.info/coronavirus/'
page = requests.get(url)
page.raise_for_status()
soup = BeautifulSoup.BeautifulSoup(page.text,'html.parser')
table = soup.find('div', class_='main_table_countries_div')
data = table.find_all('tr')
row_list = list()
for tr in data:
   td = tr.find_all('td')
   row = [i.text for i in td]
   row_list.append(row)

# this is erroring out because the first list is empty
print(row_list[0])
for a in row_list[1:]:
   country_data = a
   # then you can access them by index
   print(country_data[0])
   print(country_data[1])

值得注意的是,您正在重新发明轮子。如果你是为了学习而这样做的,干杯,如果不是,请查看pandas库来管理数据帧。

于 2020-03-30T20:46:57.700 回答