感谢您花时间帮助我,我正在尝试抓取网站每个页面中所有人的公开联系信息,因此我构建了 3 个功能,一个用于修改 URL,一个用于从中提取源代码使用 BeautifulSoup 和 one 对其进行转换,最后得到名称、标题、电子邮件、个人网站和简历,但由于某种我不知道的原因,我只取回每个页面的第一个元素,它确实涵盖了全部页面数量,但它只抓取第一个人。
这是我在该部分中编写代码的方式,如果您能给我一些提示或发现我正在做的错误,我将不胜感激:)
def paginas(pages):
pag = (f'https://www.hsph.harvard.edu/profiles/page/{pages}/')
return pag
def extract(pag):
url = requests.get(pag).text
soup = BeautifulSoup(url, 'lxml')
return soup
def transform(soup):
#principal = soup.find('div', class_ = 'hsph-bootstrap')
items = soup.find_all('div', class_ = 'grid-card grid-card-hover position-relative border rounded px-4 py-5')
for item in items:
try:
#name = item.find('a').text.strip() this is another way of getting it in this website
name = item.find('h2', class_ = 'h3 mb-0').text.strip() #siempre tendrá
except:
name = 'not given'
#Contact data.
website = item.find('h2', class_ = 'h3 mb-0').a['href']
main = item.find('div', class_ = 'grid-card-content')
bio = main.find('div', class_ = 'faculty-bio small').text.strip()
university = 'Harvard School of Public Health'
#INSIDE THE LINK
wd = webdriver.Chrome(options=options)
wd.get(website)
insideurl = requests.get(website).text
insidesoup = BeautifulSoup(insideurl, 'lxml')
#BIO DATA
insitem = insidesoup.find('div', class_ ='row rounded bg-white p-5')
try:
email = insitem.find('p', class_ = 'faculty-contact mb-2').text.strip()
except:
email = ''
try:
ti = insitem.find('div', class_ = 'faculty-bio')
title = ti.find('p').text
except:
ti = ''
title = ''
#EXTRA DATA ON BIO.
try:
bio2 = insidesoup.find('div', class_ = 'faculty-profile-container container mb-5')
complete = bio2.find('div', class_ = 'faculty-profile-overview-section').text.strip()
except:
bio2 = ''
complete = ''
contact = {
'name' : name,
'title' : title,
'university' : university,
'email' : email,
'website' : website,
'bio' : complete,
'area' : bio,
}
leadlist.append(lead)
return
leadlist =[]
for pages in range(1,127,1):
c = paginas(pages)
b = extract(c)
d = transform(b)
print(len(leadlist))
```