-2

我正在尝试从黄页中抓取数据。我已经成功使用过这个刮刀几次,但它最近停止工作了。我注意到黄页网站最近发生了变化,他们添加了一个包含三个结果的赞助商链接表。由于这一变化,我的爬虫唯一能找到的就是这个赞助商链接表下方的广告。它不检索任何结果。

我在哪里错了?

我在下面包含了我的代码。例如,它显示了对威斯康星州 711 个地点的搜索。

import requests
from bs4 import BeautifulSoup
import csv

my_url = "https://www.yellowpages.com/search?search_terms=7-eleven&geo_location_terms=WI&page={}"
for link in [my_url.format(page) for page in range(1,20)]:
  res = requests.get(link)
  soup = BeautifulSoup(res.text, "lxml")

placeHolder = []
for item in soup.select(".info"):
  try:
    name = item.select("[itemprop='name']")[0].text
  except Exception:
    name = ""
  try:
    streetAddress = item.select("[itemprop='streetAddress']")[0].text
  except Exception:
    streetAddress = ""
  try:
    addressLocality = item.select("[itemprop='addressLocality']")[0].text
  except Exception:
    addressLocality = ""
  try:
    addressRegion = item.select("[itemprop='addressRegion']")[0].text
  except Exception:
    addressRegion = ""
  try:
    postalCode = item.select("[itemprop='postalCode']")[0].text
  except Exception:
    postalCode = ""
  try:
    phone = item.select("[itemprop='telephone']")[0].text
  except Exception:
    phone = ""

  with open('yp-7-eleven-wi.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([name, streetAddress, addressLocality, addressRegion, postalCode, phone])
4

2 回答 2

2

您现有的脚本中有几个问题。您创建了一个 for 循环,它应该遍历 19 个不同的页面,而内容被限制在一个页面内。您定义的选择器不再包含这些元素。此外,您多次重复try:except块,这使您的刮刀看起来非常凌乱。您可以定义自定义函数来摆脱IndexErrorAttributeError解决问题。最后,您可以利用csv.DictWriter()将抓取的项目写入 csv 文件。

试一试:

import requests
import csv
from bs4 import BeautifulSoup

placeHolder = []

urls = ["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=WI&page={}".format(page) for page in range(1,5)]
for url in urls:
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "lxml")

    def get_text(item,path): return item.select_one(path).text if item.select_one(path) else ""

    for item in soup.select(".info"):
      d = {}
      d['name'] = get_text(item,"a.business-name span")
      d['streetAddress'] = get_text(item,".street-address")
      d['addressLocality'] = get_text(item,".locality")
      d['addressRegion'] = get_text(item,".locality + span")
      d['postalCode'] = get_text(item,".locality + span + span")
      d['phone'] = get_text(item,".phones")
      placeHolder.append(d)

with open("yellowpageInfo.csv","w",newline="") as infile:
  writer = csv.DictWriter(infile,['name','streetAddress','addressLocality','addressRegion','postalCode','phone'])
  writer.writeheader()
  for elem in placeHolder:
    writer.writerow(elem)
于 2018-11-24T20:06:25.053 回答
1

刮人生……斗争是真实的!

当站点更改其布局时,通常可能会更改元素和类名。您想仔细检查更新并修复刮板中使用与页面元素、类名称等相关的硬编码值的任何内容,这些值可能已更改

快速检查页面显示您正在抓取的信息位于不同的结构中:

<div class="v-card">
    <div class="media-thumbnail"><a class="media-thumbnail-wrapper chain-img" href="/milwaukee-wi/mip/7-eleven-471900245?lid=471900245"
            data-analytics="{&quot;click_id&quot;:509}" data-impressed="1"><img class="lazy" alt="7-Eleven" src="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d"
                data-original="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d" width="40"
                height="40" style="display: block;"><noscript><img alt="7-Eleven" src="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d"
                    width="40" height="40"></noscript></a></div>
    <div class="info">
        <h2 class="n">2.&nbsp;<a class="business-name" href="/milwaukee-wi/mip/7-eleven-471900245?lid=471900245"
                data-analytics="{&quot;target&quot;:&quot;name&quot;,&quot;feature_click&quot;:&quot;&quot;}" rel=""
                data-impressed="1"><span>7-Eleven</span></a></h2>
        <div class="info-section info-primary">
            <div class="ratings" data-israteable="true"></div>
            <p class="adr"><span class="street-address">1624 W Wells St</span><span class="locality">Milwaukee,&nbsp;</span><span>WI</span>&nbsp;<span>53233</span></p>
            <div class="phones phone primary">(414) 342-9710</div>
        </div>
        <div class="info-section info-secondary">
            <div class="categories"><a href="/wi/convenience-stores" data-analytics="{&quot;click_id&quot;:1171,&quot;adclick&quot;:false,&quot;listing_features&quot;:&quot;category&quot;,&quot;events&quot;:&quot;&quot;}"
                    data-impressed="1">Convenience Stores</a></div>
            <div class="links"><a class="track-visit-website" href="https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836"
                    rel="nofollow" target="_blank" data-analytics="{&quot;click_id&quot;:6,&quot;act&quot;:2,&quot;dku&quot;:&quot;https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836&quot;,&quot;FL&quot;:&quot;url&quot;,&quot;target&quot;:&quot;website&quot;,&quot;LOC&quot;:&quot;https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836&quot;,&quot;adclick&quot;:true}"
                    data-impressed="1">Website</a></div>
        </div>
        <div class="preferred-listing-features"></div>
        <div class="snippet">
            <p class="body"><span>From Business: At 7-Eleven, our doors are always open, and our friendly store teams
                    are ready to serve you. Our fresh, fast and convenient hot foods appeal to any craving, so yo…&lt;/span></p>
        </div>
    </div>
</div>

例如,对于地址,而不是itemprop=address您需要.street-address的,等等。

CSS对于 Locality 的嵌套示例,使用模仿样式选择器的内置选择器。

try:
  locality = item.select(".street-address")[0]
  addressLocality = locality.text
  state_zip = locality.findChildren("span") # returns a list
  state = state_zip[0]
  zip = state_zip[1]
  # Might want to add some checks if the state or zip is missing, etc.
except Exception:
  addressLocality = ""

总之:

修复那些硬编码的值以匹配新的类名,你应该重新开始工作。

于 2018-11-24T18:43:27.267 回答