0

我正在尝试抓取一个网站。那就是页面的结构:

<h2>AFRICA (54)</h2>
<ul>
    <li> <a href="https://www.worldatlas.com/webimage/countrys/africa/dz.htm">Algeria</a> *54
</ul>

这个代码结构持续了 6 次。因为它有六大洲。我的问题是我得到了所有的标签,但我只想要标签下面的标签a文本。ah2

那是我的代码:

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.worldatlas.com/cntycont.htm')
html_text = url.text
soup = BeautifulSoup(html_text,'lxml')

continent_name_resultset = soup.findAll('h2',limit=6)
country_name_resultset = soup.findAll('big',limit=1)


for i in continent_name_resultset:
    print((i.find(text=True).strip())[:-5])
    
list = soup.find_all('a')
for i in list:
    print(i.find(text=True))

我的目标是实现这种格式:

Continent  |  Country
Africa        Algeria
Africa        Angora
          ...
          ...
4

3 回答 3

1

试试这个以获得所需的输出(Africa仅适用于国家/地区):

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.worldatlas.com/cntycont.htm')
soup = BeautifulSoup(url.text,'lxml')
for items in soup.find_all("h2",limit=1):
    for item in items.find_next_sibling().find_all("li"):
        country = items.get_text(strip=True).split(" (")[0]
        name = item.find("a").get_text(strip=True)
        print(f'{country} {name}')

输出如下:

AFRICA Algeria
AFRICA Angola
AFRICA Benin
AFRICA Botswana
AFRICA Burkina
AFRICA Burundi
AFRICA Cameroon
AFRICA Cape Verde

但是,如果您希望获得所有这些,请尝试以下操作:

url = requests.get('https://www.worldatlas.com/cntycont.htm')
soup = BeautifulSoup(url.text,'lxml')
for items in soup.find_all("h2",limit=6):
    for item in items.find_next_sibling().find_all("li"):
        country = items.get_text(strip=True).split(" (")[0]
        name = item.find("a").get_text(strip=True)
        print(f'{country} {name}')
于 2018-06-25T21:01:59.777 回答
0

这给出了大陆及其国家的字典;

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.worldatlas.com/cntycont.htm')
html_text = url.text
soup = BeautifulSoup(html_text,'lxml')


mydivs = soup.findAll("div", {"class": "miscTxt"})


for tag in mydivs:
    h2Tags = tag.find_all("h2", limit=6)
    ulTags = tag.find_all("ul", limit=6)
    continents=[]
    countries = []
    for cont in h2Tags:
        continents.append(cont.text.split('(')[0].strip())

    for countrygroup in ulTags:
        temp = []
        for country in countrygroup:
            if country.find('a') != -1:
                temp.append(country.find('a').text)
        countries.append(temp)        

    final_dict=dict(zip(continents,countries))
    print final_dict 

输出是

{u'AFRICA': [u'Algeria',
             u'Angola',
             u'Benin',
             u'Botswana',
             u'Burkina',
             u'Burundi',
             u'Cameroon',
             u'Cape Verde',
             u'Central African Republic',
             u'Chad',
             u'Comoros',
             u'Congo',
             u'Congo, Democratic Republic of',
             u'Djibouti',
             u'Egypt',
             u'Equatorial Guinea',
             u'Eritrea',
             u'Ethiopia',
             u'Gabon',
             u'Gambia',
             u'Ghana',
             u'Guinea',
             u'Guinea-Bissau',
             u'Ivory Coast',
             u'Kenya',
             u'Lesotho',
             u'Liberia',
             u'Libya',
             u'Madagascar',
             u'Malawi',
             u'Mali',
             u'Mauritania',
             u'Mauritius',
             u'Morocco',
             u'Mozambique',
             u'Namibia',
             u'Niger',
             u'Nigeria',
             u'Rwanda',
             u'Sao Tome and Principe',
             u'Senegal',
             u'Seychelles',
             u'Sierra Leone',
             u'Somalia',
             u'South Africa',
             u'South Sudan',
             u'Sudan',
             u'Swaziland',
             u'Tanzania',
             u'Togo',
             u'Tunisia',
             u'Uganda',
             u'Zambia',
             u'Zimbabwe\n'],
 u'ASIA': [u'Afghanistan',
           u'Bahrain',
           u'Bangladesh',
           u'Bhutan',
           u'Brunei',
           u'Burma (Myanmar)',
           u'Cambodia',
           u'China',
           u'East Timor',
           u'India',
           u'Indonesia',
           u'Iran',
           u'Iraq',
           u'Israel',
           u'Japan',
           u'Jordan',
           u'Kazakhstan',
           u'Korea, North',
           u'Korea, South',
           u'Kuwait',
           u'Kyrgyzstan',
           u'Laos',
           u'Lebanon',
           u'Malaysia',
           u'Maldives',
           u'Mongolia',
           u'Nepal',
           u'Oman',
           u'Pakistan',
           u'Philippines',
           u'Qatar',
           u'Russian Federation',
           u'Saudi Arabia',
           u'Singapore',
           u'Sri Lanka',
           u'Syria',
           u'Tajikistan',
           u'Thailand',
           u'Turkey',
           u'Turkmenistan',
           u'United Arab Emirates',
           u'Uzbekistan',
           u'Vietnam',
           u'Yemen'],
 u'EUROPE': [u'Albania',
             u'Andorra',
             u'Armenia',
             u'Austria',
             u'Azerbaijan',
             u'Belarus',
             u'Belgium',
             u'Bosnia and Herzegovina',
             u'Bulgaria',
             u'Croatia',
             u'Cyprus',
             u'Czech Republic',
             u'Denmark',
             u'Estonia',
             u'Finland',
             u'France',
             u'Georgia',
             u'Germany',
             u'Greece',
             u'Hungary',
             u'Iceland',
             u'Ireland',
             u'Italy',
             u'Latvia',
             u'Liechtenstein',
             u'Lithuania',
             u'Luxembourg',
             u'Macedonia',
             u'Malta',
             u'Moldova',
             u'Monaco',
             u'Montenegro',
             u'Netherlands',
             u'Norway',
             u'Poland',
             u'Portugal',
             u'Romania',
             u'San Marino',
             u'Serbia',
             u'Slovakia',
             u'Slovenia',
             u'Spain',
             u'Sweden',
             u'Switzerland',
             u'Ukraine',
             u'United Kingdom',
             u'Vatican City'],
 u'N. AMERICA': [u'Antigua and Barbuda',
                 u'Bahamas',
                 u'Barbados',
                 u'Belize',
                 u'Canada',
                 u'Costa Rica',
                 u'Cuba',
                 u'Dominica',
                 u'Dominican Republic',
                 u'El Salvador',
                 u'Grenada',
                 u'Guatemala',
                 u'Haiti',
                 u'Honduras',
                 u'Jamaica',
                 u'Mexico',
                 u'Nicaragua',
                 u'Panama',
                 u'Saint Kitts and Nevis',
                 u'Saint Lucia',
                 u'Saint Vincent and the Grenadines',
                 u'Trinidad and Tobago',
                 u'United States'],
 u'OCEANIA': [u'Australia',
              u'Fiji',
              u'Kiribati',
              u'Marshall Islands',
              u'Micronesia',
              u'Nauru',
              u'New Zealand',
              u'Palau',
              u'Papua New Guinea',
              u'Samoa',
              u'Solomon Islands',
              u'Tonga',
              u'Tuvalu',
              u'Vanuatu'],
 u'S. AMERICA': [u'Argentina',
                 u'Bolivia',
                 u'Brazil',
                 u'Chile',
                 u'Colombia',
                 u'Ecuador',
                 u'Guyana',
                 u'Paraguay',
                 u'Peru',
                 u'Suriname',
                 u'Uruguay',
                 u'Venezuela']}
于 2018-06-25T21:18:16.697 回答
-1

尝试这个,

import requests
from bs4 import BeautifulSoup
import re

url = requests.get('https://www.worldatlas.com/cntycont.htm')
html_text = url.text
soup = BeautifulSoup(html_text,'lxml')

continent_name_resultset = soup.select(".misc-content h2 + ul > li > a")


for i in continent_name_resultset:
    country = i.text
    continent = i.find_previous("h2").text
    continent = re.sub("[^a-zA-Z.-]","", continent)
    print("Country : " + country + " , Continent : " + continent)

样本输出:

Country : Algeria , Continent : AFRICA
Country : Angola , Continent : AFRICA
Country : Benin , Continent : AFRICA
Country : Botswana , Continent : AFRICA
Country : Burkina , Continent : AFRICA
Country : Burundi , Continent : AFRICA
Country : Cameroon , Continent : AFRICA
Country : Cape Verde , Continent : AFRICA
Country : Central African Republic , Continent : AFRICA
Country : Chad , Continent : AFRICA
    .
    .
    .
    .
Country : Colombia , Continent : S.AMERICA
Country : Ecuador , Continent : S.AMERICA
Country : Guyana , Continent : S.AMERICA
Country : Paraguay , Continent : S.AMERICA
Country : Peru , Continent : S.AMERICA
Country : Suriname , Continent : S.AMERICA
Country : Uruguay , Continent : S.AMERICA
Country : Venezuela , Continent : S.AMERICA
于 2018-06-25T21:15:27.623 回答