python-2.7 - 使用漂亮的汤从网站上抓取数据的问题

Question

我正在尝试从网站上抓取 41 件商品及其价格的清单。但是我的输出 csv 缺少页面末尾的一些 2-3 项。原因是，某些设备的价格与其他设备不同。我的代码中的递归同时针对名称和价格运行，并且对于在不同类别下提到价格的项目，它从下一个设备获取价格值。因此，它会跳过最后 2-3 个项目，因为这些设备的价格已经在以前的设备的递归中输入。以下是参考代码：

# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
    items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
    prices = soup.findAll('div', {"class": "listGrid-price"})
    for item, price in zip(items, prices):
        textcontent = u' '.join(price.stripped_strings)
        if textcontent:            
            spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('â„¢','').replace('Â®','').strip(),textcontent])

价格通常会在下面提到，listGrid-price但是对于一些 2-3 件在价格低于时缺货的商品，listGrid-price-outOfStock我需要在我的递归中也包括这个，以便正确的价格出现在商品和所有设备的循环运行之前。

请原谅我的无知，因为我是编程新手

score 0 · Accepted Answer

您可以使用比较器功能进行自定义比较并将其传递给您的findAll().

因此，如果您将prices分配的行修改为：

prices = soup.findAll('div', class_=match_both)

并将函数定义为：

def match_both(arg):
    if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
        return True
    return False

（函数可以更简洁，这里的冗长只是为了让你了解它是如何工作的）

因此，它将与两者进行比较并在任何情况下返回匹配项。

更多信息可以在文档中找到。（has_six_characters 变体）

现在，由于您还询问了如何排除特定文本。

text参数findAll()也可以有自定义比较器。因此，在这种情况下，您不希望文本Write a review匹配并导致价格与文本发生变化。

因此，您编辑的脚本排除了审查部分：

# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup

def match_both(arg):
    if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
        return True
    return False

def not_review(arg):
    if not arg:
        return arg
    return "Write a review" not in arg

page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
    items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=not_review)
    prices = soup.findAll('div', class_=match_both)
    for item, price in zip(items, prices):
        textcontent = u' '.join(price.stripped_strings)
        if textcontent:
                spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('â„¢','').replace('Â®','').strip(),textcontent])

python-2.7 - 使用漂亮的汤从网站上抓取数据的问题

1 回答 1

Related

Reference