3

我正在尝试在特定的无序列表中捕获 3 个列表 (li) 项。使用 findALL 函数我能够得到我想要的。然而,虽然返回的列表包含 3 个 li,但返回的 findALL 列表中的所有内容都被视为 1 个元素。

我尝试使用 findChild 函数,它看到 7 个元素。我真正想做的是检索链接,以便我可以检索它们的内容以及我使用 findALL 或 findChild 或其他任何东西的有序列表中包含的文本

这最初是我所做的:

 focus=soup.findAll('ul',{'class':'sub-menu'})
 #output

 #[<ul class="sub-menu">
 #<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
 #item-20588" id="menu-item-20588"><a href="http://www.air- 
 #shows.org.uk/2018/06/uk-airshow-calendar-2019/">UK Airshow Calendar 
 #2019</a></li>
 #<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
 #item-22412" id="menu-item-22412"><a href="http://www.air- 
 #shows.org.uk/2018/07/european-airshow-calendar-2019/">European Airshow 
 #Calendar 2019</a></li>
 #<li class="menu-item menu-item-type-taxonomy menu-item-object-category 
 #menu-item-18245" id="menu-item-18245"><a href="http://www.air- 
 #shows.org.uk/category/display-team-schedule/">Latest Display Team 
 #Dates</a></li>
 #</ul>]

列表的长度为 1。但是,使用 findChild 我有以下内容:

for i in soup.findChild('ul',{'class':'sub-menu'}):
      print (i)
      print('==='*10)

#output

==============================
#<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
#item-20588" id="menu-item-20588"><a href="http://www.air- 
#shows.org.uk/2018/06/uk-airshow-calendar-2019/">UK Airshow Calendar 
#2019</a></li>
==============================

==============================
#<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
#item-22412" id="menu-item-22412"><a href="http://www.air- 
#shows.org.uk/2018/07/european-airshow-calendar-2019/">European Airshow 
#Calendar 2019</a></li>
==============================

==============================
#<li class="menu-item menu-item-type-taxonomy menu-item-object-category 
#menu-item-18245" id="menu-item-18245"><a href="http://www.air- 
#shows.org.uk/category/display-team-schedule/">Latest Display Team 
#Dates</a></li>
==============================

我想要的只是能够获取 href 中的 url 和这 3 个有序列表中的文本。

我正在寻找这样的东西:

www.air-shows.org.uk/2018/07/european-airshow-calendar-2019
UK Airshow Calendar 2019

www.air-shows.org.uk/2018/07/european-airshow-calendar-2019
European Airshow Calendar 2019
4

2 回答 2

2

干得好。

from bs4 import BeautifulSoup
html='''
<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
item-20588" id="menu-item-20588"><a href="http://www.air- 
shows.org.uk/2018/06/uk-airshow-calendar-2019/">UK Airshow Calendar 2019</a></li>
<li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
item-22412" id="menu-item-22412"><a href="http://www.air- 
shows.org.uk/2018/07/european-airshow-calendar-2019/">European Airshow Calendar 2019</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category 
menu-item-18245" id="menu-item-18245"><a href="http://www.air- 
shows.org.uk/category/display-team-schedule/">Latest Display Team Dates</a></li>'''

soup=BeautifulSoup(html,"html.parser")
for item in soup.find_all('a',href=True):
    print("link : " + item['href'])
    print("text : " + item.text)
于 2019-04-05T12:25:35.123 回答
1

您还可以使用以下内容(我假设在实际页面中您在文本或 href 中没有 \n。这也假设从 生成的等长列表.sub-menu li,.sub-menu a

from bs4 import BeautifulSoup as bs

html = '''
<html>
 <head></head>
 <body>
  <ul class="sub-menu"> 
   <li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
 item-20588" id="menu-item-20588"><a href="http://www.air- 
 shows.org.uk/2018/06/uk-airshow-calendar-2019/">UK Airshow Calendar 2019</a></li> 
   <li class="menu-item menu-item-type-post_type menu-item-object-post menu- 
 item-22412" id="menu-item-22412"><a href="http://www.air- 
 shows.org.uk/2018/07/european-airshow-calendar-2019/">European Airshow Calendar 2019</a></li> 
   <li class="menu-item menu-item-type-taxonomy menu-item-object-category 
 menu-item-18245" id="menu-item-18245"><a href="http://www.air- 
 shows.org.uk/category/display-team-schedule/">Latest Display Team Dates</a></li> 
  </ul>
 </body>
</html>
 '''

soup = bs(html, 'lxml')

all_items = soup.select('.sub-menu li,.sub-menu a')
events = [item.text for item in all_items[0::2]]
links = [item['href'] for item in all_items[1::2]]
print(events, links)
于 2019-04-05T13:49:45.973 回答