0

我正在尝试使用以下链接抓取 Groupon 交易:

当我从 shell 运行它时,scrapy shell我会在页面上看到所有交易。例如titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()让我获得 37 个头衔。

壳牌运行让我:

>>> titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()
>>> titles = [ title.rstrip().lstrip()  for title in titles ]
>>> len(titles)
37
>>> titles
[u'Le Bar du Normandy - H\xf4tel Normandy', u'Michel Balmet', u'Passion Chocolat', u'Le Caf\xe9 Clairi\xe8re', u'LES CAVES DU LOUVRE', u"L'artiste Restaurant", u'Auberge Le Relais', u'Le Caf\xe9 des Initi\xe9s', u'La Mar\xe9e (75008)', u'Ko\xef', u'Casa Paco (75116)', u'Capitaine Fracasse', u'LePergol\xe8se', u'Wine Tours Paris', u'La Maison Du Rhum', u'Au Port du Salut', u'Grains Nobles', u"L'artiste Restaurant", u'Michel Balmet, 10e', u'Feyrouz C\xf4t\xe9 Mer', u"L'agap\xe9", u"Restaurant Au Bon'art", u'Shibuya Karaok\xe9', u'Eiffel Croisieres', u'Cfv', u'Made In Italy', u'Fuumi Restaurant', u'OfbPontault', u'Le Jackpot', u'La Brasserie Centrale', u'Le cheval blanc', u'LA CANTINE DES TSARS', u'Restaurant Guy Savoy \xe0 la Monnaie de Paris', u'Chez Ma Cousine', u'MAMABALI', u'LE COSMOS', u'Restaurant Le Sancerre']
>>> 

当我从刮板运行它时,我只得到一小部分结果:

class GrouponSpider(scrapy.Spider):
    name = "deals"

    start_urls = [
            'https://www.groupon.fr/browse/paris?category=bars-et-restaurants&=undefined&gclid=Cj0KCQjwo7foBRD8ARIsAHTy2wm4-T4w6ps1KMDg5eG8S7jDsNco8VxuJIcoQO6OXkSrzQm4TWEe-QkaArFXEALw_wcB&utm_campaign=fr_dt_sea_ggl_txt_naq_sr_cbp_ch1_ybr_k*groupon%2Bparis_m*e_d*Groupon-Paris_g*Paris-Exact_c*96685051824_ap*1t1&utm_medium=cpc&utm_source=google&page0'
    ]

    def parse(self, response):
        titles = response.css('figure.card-ui').css('div.cui-udc-title-with-subtitle ::text').getall()
        titles = [ title.rstrip().lstrip()  for title in titles ]
        for title in titles:
            yield { 'title' : title }

    next_page = response.css('a.next::attr(href)').get()

    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

在这种情况下,我得到以下结果(使用 flags 运行-o items.csv -t csv),这是所有结果的一小部分:

$ cat items.csv
    title
    Le Bar du Normandy - Hôtel Normandy
    Michel Balmet
    Passion Chocolat
    Le Café Clairière
    Auberge Le Relais
    La Marée (75008)
    L'artiste Restaurant

关于如何从刮板代码中获得完整结果的任何想法?

4

0 回答 0