我正在尝试删除已抓取数据中存在的标签nav中的数据。我尝试了几种方法并成功提取。但是当我尝试清理其余数据时,标签中的数据也出现了。我试过了, 但都给出了相同的结果。navextractdecompose
代码
from bs4 import BeautifulSoup
from selenium import webdriver
import urllib.parse
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
service = Service("/home/ubuntu/selenium_drivers/chromedriver")
options = webdriver.ChromeOptions()
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3")
options.add_argument("--headless")
options.add_argument('--ignore-certificate-errors')
options.add_argument("--enable-javascript")
options.add_argument('--incognito')
URL = "https://michiganopera.org/season-schedule/frida/"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content, 'html.parser')
z = soup.find("nav",{"class":"nav-main"})
z.extract()
for h in soup.find_all('header'):
try:
h.extract()
except:
pass
for f in soup.find_all('footer'):
try:
f.extract()
except:
pass
try:
cols = soup.find("div",{"class":"modal fade"})
cols.extract()
except:
pass
text = soup.getText(separator=u' ')
print(text)
当我们运行这段代码时,我们将得到清理过的数据,并且在这些数据的末尾有一部分如下所示,必须删除
要删除的部分
Sponsors
Email Sign Up View Calendar
Season & Tickets + Season at a Glance MOT at Home Upcoming + Dance Theatre of Harlem Calendar Ways to save + Subscriptions Groups Gift Certificates Box Office + How to Avoid Scalper Tickets Plan Your Visit + Parking & Directions + Sunday Shuttles Dining + Cadillac Café Hotels Opera & Dance Talks FAQ Online Boutique PLAN YOUR EVENT + Catering & Events Weddings Corporate & Social Event Sky Deck COVID-19 Safety Plan Get Involved + Community Events Young Patrons Circle Opera Teens Opera Clubs Ambassadors Volunteers Dance Film Series Learn + Summer Programs + Operetta Remix Dance Classes Children’s Choruses For Schools + Field Trips In-School Performances Classroom Guides Tours Allesee Resource Library Dance Dialogues MOT Learns at Home Support + Annual Fund & DiChiera Society Other Ways to Give Planned Giving David DiChiera Artistic Fund Sponsorship Opportunities Why I give to MOT About Us + Our History + MOT History DOH History Past Seasons David DiChiera Leadership + Board of Directors Wayne S. Brown Yuval Sharon Christine Goerke Admin & Staff + Our mission Antiracism Statement of Commitment Opera America Member Musicians + Orchestra Roster Chorus Roster Children’s Choruses Non-Profit Status Press
我在几个网站上面临同样的问题。我想我在这里遗漏了一些观点。