python - NY Times API and Scrapy

Question

I would like to export the JSON response from the NYT API to a CSV file based on a few columns. I was recently having issues connecting to the API with a Scrapy spider, but with the help of this forum I've been able to fix that. Currently, I'm having trouble with the for loop I believe I need to extract the data since I'm getting a GET error with my current code (some items commented out). Here's a snippet of the JSON response for the first article, the next begins with , {"web_url":...):

{"response":
{"meta":{"hits":1,"time":24,"offset":0},"docs":
[{"web_url":"http:\/\/www.nytimes.com\/2013\/09\/17\/arts\/design\/art-dealer-admits-role-
in-selling-fake-works.html","snippet":"Glafira Rosales, a Long Island art dealer, pleaded 
guilty to fraud on Monday in the sale of counterfeit works for more than $80 
million.","lead_paragraph":"Glafira Rosales, a Long Island art dealer, pleaded guilty to 
fraud on Monday in the sale of counterfeit works for more than $80 
million.","abstract":null,"print_page":"1","blog":[],"source":"The New York 
Times","headline":{"main":"Art Dealer Admits to Role in Fraud","print_headline":"Art    
Dealer Admits To Role In Fraud "},"keywords": 
[{"rank":"3","is_major":"N","name":"persons","value":"Rosales, Glafira"},
{"rank":"1","is_major":"N","name":"subject","value":"Frauds and Swindling"},
{"rank":"2","is_major":"N","name":"subject","value":"Art"}],"pub_date":"2013-09-
17T00:00:00Z","document_type":"article","news_desk":"Culture","section_name":"Arts","subsectio
n_name":"Art & Design","copyright":"Copyright (c) 2013 The New York Times Company.  All Rights Reserved."}

And a copy of my code so far:

from scrapy.spider import BaseSpider
from nytimesAPIjson.items import NytimesapijsonItem
import json
import urllib2

class MySpider(BaseSpider):
    name = "nytimesapijson"
    allowed_domains = ["api.nytimes.com"]
    start_urls = ['http://api.nytimes.com/svc/search/v2/articlesearch.json?q="financial     crime"&facet_field=day_of_week&begin_date=20130101&end_date=20130916&page=2&rank=newest&api-    key=xxx']

       def parse(self, response):
       jsonresponse = json.loads(response)
          ##NEED LOOP#
          #item = NytimesapijsonItem()
          #item ["pubDate"] = jsonresponse["pub_date"]
          #item ["description"] = jsonresponse["lead_paragraph"]
          #item ["title"] = jsonresponse["print_headline"]
          #item ["link"] = jsonresponse["web_url"]
          #items.append(item)
          #return items
print jsonresponse #would like to remove with for loop for above items

I'm new to python and am not sure of the syntax for the for loop. Sorry if there is too much detail, but I'd love to get this up and running. I appreciate everyones time.

Also if anyone has any good ideas about extracting metadata and aggregating keywords (not necessarily in CSV) I'd be open to suggestions. I'd like to look for trends in financial crime based on country and industry, to start.

UPDATE AFTER SUGGESTION:

I'm still getting an error with that loop. I printed the output for more than one article and here's a snippet:

{u'copyright': u'Copyright (c) 2013 The New York Times Company.  All Rights Reserved.',
 u'response':{u'docs':[{u'_id':................
                        u'word_count'}
                        {u'_id':................
                        u'word_count'},
              u'faucet':{........},
              u'meta':{.........},
 u'status':u'OK'}

There seems to be a missing right-end bracket after docs making this appear to be a list. Here's the current code:

##from scrapy.spider import BaseSpider
##from nytimesAPIjson.items import NytimesapijsonItem
##import json
##
##class MySpider(BaseSpider):
##    name = "nytimesapijson"
##    allowed_domains = ["api.nytimes.com"]
##    start_urls=['http://api.nytimes.com/svc/search/v2/articlesearch.json?q=%22laundering%22&facet_field=day_of_week&begin_date=20130917&end_date=20130917&rank=newest&api-key=xxx']
##        
##    def parse(self, response):
##        items =[]
##        jsonresponse = json.loads(response.body_as_unicode())
##        for doc in jsonresponse['copyright']['response']['docs'][0]:
##            item = NytimesapijsonItem()
##            item ["pubDate"] = doc['pub_date']
##            item ["description"] = doc['lead_paragraph']
##            item ["title"] = doc['headline']['print_headline']
##            item ["link"] = doc['web_url']
##            items.append(item)
##
##        return items

The spider runs but still gets an error at the GET HTTP portion. It runs fine if I just print the jsonresponse though.

I've tried with and without the [0] (suggestion from someone else's post). Any other ideas? Thanks again.

score 1 · Accepted Answer

以下是 Pythonpprint.pprint()显示 JSON 对象的方式：（有助于理解嵌套）

>>> import pprint
>>> pprint.pprint(json.loads(nyt))
{u'response': {u'docs': [{u'abstract': None,
                          u'blog': [],
                          u'copyright': u'Copyright (c) 2013 The New York Times Company.  All Rights Reserved.',
                          u'document_type': u'article',
                          u'headline': {u'main': u'Art Dealer Admits to Role in Fraud',
                                        u'print_headline': u'Art Dealer Admits To Role In Fraud '},
                          u'keywords': [{u'is_major': u'N',
                                         u'name': u'persons',
                                         u'rank': u'3',
                                         u'value': u'Rosales, Glafira'},
                                        {u'is_major': u'N',
                                         u'name': u'subject',
                                         u'rank': u'1',
                                         u'value': u'Frauds and Swindling'},
                                        {u'is_major': u'N',
                                         u'name': u'subject',
                                         u'rank': u'2',
                                         u'value': u'Art'}],
                          u'lead_paragraph': u'Glafira Rosales, a Long Island art dealer, pleaded guilty to fraud on Monday in the sale of counterfeit works for more than $80 million.',
                          u'news_desk': u'Culture',
                          u'print_page': u'1',
                          u'pub_date': u'2013-09-17T00:00:00Z',
                          u'section_name': u'Arts',
                          u'snippet': u'Glafira Rosales, a Long Island art dealer, pleaded guilty to fraud on Monday in the sale of counterfeit works for more than $80 million.',
                          u'source': u'The New York Times',
                          u'subsection_name': u'Art & Design',
                          u'web_url': u'http://www.nytimes.com/2013/09/17/arts/design/art-dealer-admits-role-in-selling-fake-works.html'}],
               u'meta': {u'hits': 1, u'offset': 0, u'time': 24}}}

现在你可以这样写循环

class MySpider(BaseSpider):
    name = "nytimesapijson"
    allowed_domains = ["api.nytimes.com"]
    start_urls = ['http://api.nytimes.com/svc/search/v2/articlesearch.json?q="financial     crime"&facet_field=day_of_week&begin_date=20130101&end_date=20130916&page=2&rank=newest&api-    key=xxx']

    def parse(self, response):
        items = []
        jsonresponse = json.loads(response)
        for doc in jsonresponse["response"]["docs"]:
            item = NytimesapijsonItem()
            item ["pubDate"] = doc["pub_date"]
            item ["description"] = doc["lead_paragraph"]
            item ["title"] = doc["headline"]["print_headline"]
            item ["link"] = doc["web_url"]
            items.append(item)

        return items

python - NY Times API and Scrapy

1 回答 1

Related

Reference