0

我正在尝试删除已抓取数据中存在的标签nav中的数据。我尝试了几种方法并成功提取。但是当我尝试清理其余数据时,标签中的数据也出现了。我试过了, 但都给出了相同的结果。navextractdecompose

代码

from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.parse
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service

service = Service("/home/ubuntu/selenium_drivers/chromedriver")

options = webdriver.ChromeOptions()
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3")
options.add_argument("--headless")
options.add_argument('--ignore-certificate-errors')
options.add_argument("--enable-javascript")
options.add_argument('--incognito')

URL = "https://michiganopera.org/season-schedule/frida/"

try:
    driver = webdriver.Chrome(service = service, options = options)
    driver.get(URL)
    driver.implicitly_wait(2)
    html_content = driver.page_source
    driver.quit()
except WebDriverException:
    driver.quit()

soup = BeautifulSoup(html_content, 'html.parser')
z = soup.find("nav",{"class":"nav-main"})
z.extract()
for h in soup.find_all('header'):
    try:
        h.extract()
    except:
        pass
for f in soup.find_all('footer'):
    try:
        f.extract()
    except:
        pass
try:
    cols = soup.find("div",{"class":"modal fade"})
    cols.extract()
except:
    pass
text = soup.getText(separator=u' ')
print(text)

当我们运行这段代码时,我们将得到清理过的数据,并且在这些数据的末尾有一部分如下所示,必须删除

要删除的部分

 Sponsors 
 
 
 
 
 Email Sign Up View Calendar 
 
 
       Season & Tickets + Season at a Glance MOT at Home Upcoming + Dance Theatre of Harlem Calendar Ways to save + Subscriptions Groups Gift Certificates Box Office + How to Avoid Scalper Tickets Plan Your Visit + Parking & Directions + Sunday Shuttles Dining + Cadillac Café Hotels Opera & Dance Talks FAQ Online Boutique PLAN YOUR EVENT + Catering & Events Weddings Corporate & Social Event Sky Deck COVID-19 Safety Plan Get Involved + Community Events Young Patrons Circle Opera Teens Opera Clubs Ambassadors Volunteers Dance Film Series Learn + Summer Programs + Operetta Remix Dance Classes Children’s Choruses For Schools + Field Trips In-School Performances Classroom Guides Tours Allesee Resource Library Dance Dialogues MOT Learns at Home Support + Annual Fund & DiChiera Society Other Ways to Give Planned Giving David DiChiera Artistic Fund Sponsorship Opportunities Why I give to MOT About Us + Our History + MOT History DOH History Past Seasons David DiChiera Leadership + Board of Directors Wayne S. Brown Yuval Sharon Christine Goerke Admin & Staff + Our mission Antiracism Statement of Commitment Opera America Member Musicians + Orchestra Roster Chorus Roster Children’s Choruses Non-Profit Status Press 

我在几个网站上面临同样的问题。我想我在这里遗漏了一些观点。

4

1 回答 1

1
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.parse
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service

service = Service("/home/ubuntu/selenium_drivers/chromedriver")

options = webdriver.ChromeOptions()
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3")
options.add_argument("--headless")
options.add_argument('--ignore-certificate-errors')
options.add_argument("--enable-javascript")
options.add_argument('--incognito')

URL = "https://michiganopera.org/season-schedule/frida/"

try:
    driver = webdriver.Chrome(service = service, options = options)
    driver.get(URL)
    driver.implicitly_wait(2)
    html_content = driver.page_source
    driver.quit()
except WebDriverException:
    driver.quit()

soup = BeautifulSoup(html_content, 'html.parser')
z = soup.find("nav",{"class":"nav-main"})
z.extract()
for h in soup.find_all('header'):
    try:
        h.extract()
    except:
        pass
for f in soup.find_all('footer'):
    try:
        f.extract()
    except:
        pass
try:
    cols = soup.find("div",{"class":"modal fade"})
    cols.extract()
except:
    pass
text = soup.getText(separator=u' ')
sep = 'Sponsors'
stripped = text.split(sep, 1)[0]
print(stripped)
于 2021-10-30T21:18:21.797 回答