0

感谢您花时间帮助我,我正在尝试抓取网站每个页面中所有人的公开联系信息,因此我构建了 3 个功能,一个用于修改 URL,一个用于从中提取源代码使用 BeautifulSoup 和 one 对其进行转换,最后得到名称、标题、电子邮件、个人网站和简历,但由于某种我不知道的原因,我只取回每个页面的第一个元素,它确实涵盖了全部页面数量,但它只抓取第一个人。

这是我在该部分中编写代码的方式,如果您能给我一些提示或发现我正在做的错误,我将不胜感激:)

def paginas(pages):
  pag = (f'https://www.hsph.harvard.edu/profiles/page/{pages}/')
  return pag

def extract(pag):
  url = requests.get(pag).text
  soup = BeautifulSoup(url, 'lxml')
  return soup

def transform(soup):
  #principal = soup.find('div', class_ = 'hsph-bootstrap')
  items = soup.find_all('div', class_ = 'grid-card grid-card-hover position-relative border rounded px-4 py-5')
  for item in items:
      try:
        #name = item.find('a').text.strip() this is another way of getting it in this website
        name = item.find('h2', class_ = 'h3 mb-0').text.strip() #siempre tendrá
      except: 
        name = 'not given'
      #Contact data. 
      website = item.find('h2', class_ = 'h3 mb-0').a['href']
      main = item.find('div', class_ = 'grid-card-content')
      bio = main.find('div', class_ = 'faculty-bio small').text.strip()
      university = 'Harvard School of Public Health'
      #INSIDE THE LINK
      wd = webdriver.Chrome(options=options)
      wd.get(website)
      insideurl = requests.get(website).text
      insidesoup = BeautifulSoup(insideurl, 'lxml')
      #BIO DATA
      insitem = insidesoup.find('div', class_ ='row rounded bg-white p-5')
      try:
        email = insitem.find('p', class_ = 'faculty-contact mb-2').text.strip()
      except: 
        email = ''
      try:
        ti = insitem.find('div', class_ = 'faculty-bio')
        title = ti.find('p').text
      except: 
        ti = ''
        title = ''
      #EXTRA DATA ON BIO. 
      try:
        bio2 = insidesoup.find('div', class_ = 'faculty-profile-container container mb-5')
        complete = bio2.find('div', class_ = 'faculty-profile-overview-section').text.strip()
      except: 
        bio2 = ''
        complete = ''

      contact = {
          'name' : name,
          'title' : title,
          'university' : university,
          'email' : email,
          'website' : website,
          'bio' : complete,
          'area' : bio,
      }
      leadlist.append(lead)
      return

leadlist =[]


for pages in range(1,127,1):
    c = paginas(pages)
    b = extract(c)
    d = transform(b) 

print(len(leadlist))

```
4

0 回答 0