0

我正在使用漂亮的汤和请求来抓取美国总统的名单。我想同时抓取日期,例如总统任期开始日期和总统任期结束日期,由于某种原因,它显示 list index out of range error 。我会为您提供链接,以便您更好地理解。网站链接:https ://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States

from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)

for container in containers:
    date_container = container.find_all('span', attrs={'class': 'date'})
    print(date_container[0].text)
4

3 回答 3

1

find_all函数可以返回一个空列表,这可能会导致您出错。

您可以简单地检查一下:

all_dates = []
for container in containers:
    date_container = container.find_all('span', attrs={'class': 'date'})
    all_dates.extend([date.text for date in date_container])
于 2019-12-24T11:25:55.817 回答
0

既然它有<table>标签,你有没有考虑过使用 pandas' .read_html()?它在后台使用 BeautifulSoup。完成大量工作并将其直接放入数据框中。然后唯一需要的工作是任何操作或清理/过滤:

import pandas as pd
import re

my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'

# Returns a list of dataframes
dfs = pd.read_html(my_url)

# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]

# Rename the columns
df.columns = ['Date','Name']

# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)

# Clean up the name column using regex to pull out the name
df['Name'] =  [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]

# Drop duplicate rows
df.drop_duplicates(inplace = True) 


print (df)

输出:

print (df.to_string())
                      Name                  Start                               End
0        George Washington      April 30, 1789[d]                     March 4, 1797
1               John Adams          March 4, 1797                     March 4, 1801
2         Thomas Jefferson          March 4, 1801                     March 4, 1809
3            James Madison          March 4, 1809                     March 4, 1817
4             James Monroe          March 4, 1817                     March 4, 1825
5        John Quincy Adams          March 4, 1825                     March 4, 1829
6           Andrew Jackson          March 4, 1829                     March 4, 1837
7         Martin Van Buren          March 4, 1837                     March 4, 1841
8   William Henry Harrison          March 4, 1841     April 4, 1841(Died in office)
9               John Tyler       April 4, 1841[i]                     March 4, 1845
10           James K. Polk          March 4, 1845                     March 4, 1849
11          Zachary Taylor          March 4, 1849      July 9, 1850(Died in office)
12        Millard Fillmore        July 9, 1850[k]                     March 4, 1853
13         Franklin Pierce          March 4, 1853                     March 4, 1857
14          James Buchanan          March 4, 1857                     March 4, 1861
15         Abraham Lincoln          March 4, 1861      April 15, 1865(Assassinated)
16          Andrew Johnson         April 15, 1865                     March 4, 1869
17        Ulysses S. Grant          March 4, 1869                     March 4, 1877
18     Rutherford B. Hayes          March 4, 1877                     March 4, 1881
19       James A. Garfield          March 4, 1881  September 19, 1881(Assassinated)
20       Chester A. Arthur  September 19, 1881[n]                     March 4, 1885
21        Grover Cleveland          March 4, 1885                     March 4, 1889
22       Benjamin Harrison          March 4, 1889                     March 4, 1893
23        Grover Cleveland          March 4, 1893                     March 4, 1897
24        William McKinley          March 4, 1897  September 14, 1901(Assassinated)
25      Theodore Roosevelt     September 14, 1901                     March 4, 1909
26     William Howard Taft          March 4, 1909                     March 4, 1913
27          Woodrow Wilson          March 4, 1913                     March 4, 1921
28       Warren G. Harding          March 4, 1921    August 2, 1923(Died in office)
29         Calvin Coolidge      August 2, 1923[o]                     March 4, 1929
30          Herbert Hoover          March 4, 1929                     March 4, 1933
31   Franklin D. Roosevelt          March 4, 1933    April 12, 1945(Died in office)
32         Harry S. Truman         April 12, 1945                  January 20, 1953
33    Dwight D. Eisenhower       January 20, 1953                  January 20, 1961
34         John F. Kennedy       January 20, 1961   November 22, 1963(Assassinated)
35       Lyndon B. Johnson      November 22, 1963                  January 20, 1969
36           Richard Nixon       January 20, 1969          August 9, 1974(Resigned)
37             Gerald Ford         August 9, 1974                  January 20, 1977
38            Jimmy Carter       January 20, 1977                  January 20, 1981
39           Ronald Reagan       January 20, 1981                  January 20, 1989
40       George H. W. Bush       January 20, 1989                  January 20, 1993
41            Bill Clinton       January 20, 1993                  January 20, 2001
42          George W. Bush       January 20, 2001                  January 20, 2009
43            Barack Obama       January 20, 2009                  January 20, 2017
44            Donald Trump       January 20, 2017                         Incumbent
于 2019-12-24T12:14:00.057 回答
0

由于您有最后几行代码,它将所有日期跨度存储在第一个表“wikitable”上,您可以进行列表理解:

date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)

这将打印:

['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...
于 2019-12-24T11:31:31.890 回答