0

我试图编写一个脚本来获取谷歌的 ajax 搜索结果(例如:http ://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=filetype:pdf )并下载每个文件。现在我一直在尝试将响应转换为 python 字典,以便更容易通过。

import subprocess
import ast

subprocess.call("curl -G -d 'q=filetype:pdf&v=1.0' http://ajax.googleapis.com/ajax/services/search/web  > output",stderr=subprocess.STDOUT,shell=True)
file = open('output','r')
contents = file.read()
output_dict = ast.literal_eval(contents)
print output_dict

当我运行它时,我得到:

$ python script.py 
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
100  2643    0  2643    0     0  15926      0 --:--:-- --:--:-- --:--:-- 26696
Traceback (most recent call last):
  File "script.py", line 7, in <module>
    output_dict = ast.literal_eval(contents)
  File "/usr/lib/python2.7/ast.py", line 80, in literal_eval
    return _convert(node_or_string)
  File "/usr/lib/python2.7/ast.py", line 63, in _convert
    in zip(node.keys, node.values))
  File "/usr/lib/python2.7/ast.py", line 62, in <genexpr>
    return dict((_convert(k), _convert(v)) for k, v
  File "/usr/lib/python2.7/ast.py", line 79, in _convert
    raise ValueError('malformed string')
ValueError: malformed string

该文件如下所示:

{"responseData": {"results":[{"GsearchResultClass":"GwebSearch",
                                    "unescapedUrl":"http://www.foundationdb.com/AlphaLicenseAgreement.pdf",
                                             "url":"http://www.foundationdb.com/AlphaLicenseAgreement.pdf",
                                      "visibleUrl":"www.foundationdb.com",
                                        "cacheUrl":"http://www.google.com/search?q\u003dcache:W7zhFlfbm6UJ:www.foundationdb.com",
                                           "title":"FoundationDB Alpha Software Evaluation License Agreement",
                               "titleNoFormatting":"FoundationDB Alpha Software Evaluation License Agreement",
                                         "content":"FOUNDATIONDB. ALPHA SOFTWARE EVALUATION LICENSE AGREEMENT.   PLEASE READ CAREFULLY THE TERMS OF THIS ALPHA SOFTWARE \u003cb\u003e...\u003c/b\u003e",
                                      "fileFormat":"PDF/Adobe Acrobat"
                             },
                             {"GsearchResultClass":"GwebSearch",
                                    "unescapedUrl":"https://subreg.cz/registration_agreement.pdf",
                                             "url":"https://subreg.cz/registration_agreement.pdf",
                                      "visibleUrl":"subreg.cz",
                                        "cacheUrl":"http://www.google.com/search?q\u003dcache:ODtRmQsiHD0J:subreg.cz",
                                           "title":"Registration Agreement",
                               "titleNoFormatting":"Registration Agreement",
                                         "content":"Registration Agreement. In order to complete the registration process you must   read and agree to be bound by all terms and conditions herein. TERMS AND \u003cb\u003e...\u003c/b\u003e",
                                      "fileFormat":"PDF/Adobe Acrobat"
                             },
                             {"GsearchResultClass":"GwebSearch",
                                    "unescapedUrl":"http://supportdetails.com/export.pdf",
                                             "url":"http://supportdetails.com/export.pdf",
                                      "visibleUrl":"supportdetails.com",
                                        "cacheUrl":"http://www.google.com/search?q\u003dcache:h0LvxrTTKzIJ:supportdetails.com",
                                           "title":"Export PDF - Support Details",
                               "titleNoFormatting":"Export PDF - Support Details",
                                         "content":"",
                                      "fileFormat":"PDF/Adobe Acrobat"
                             },
                             {"GsearchResultClass":"GwebSearch",
                                    "unescapedUrl":"http://www.fws.gov/le/pdf/travelpetbird.pdf",
                                             "url":"http://www.fws.gov/le/pdf/travelpetbird.pdf",
                                      "visibleUrl":"www.fws.gov",
                                        "cacheUrl":"",
                                           "title":"pet bird",
                               "titleNoFormatting":"pet bird",
                                         "content":"U.S. Fish \u0026amp; Wildlife Service. Traveling Abroad with. Your Pet Bird. The Wild Bird   Conservation Act (Act), a significant step in international conservation efforts to \u003cb\u003e...\u003c/b\u003e",
                                      "fileFormat":"PDF/Adobe Acrobat"
                             }],
                  "cursor":{"resultCount":"72,800,000",
                                  "pages":[{"start":"0","label":1},
                                           {"start":"4","label":2},
                                           {"start":"8","label":3},
                                           {"start":"12","label":4},
                                           {"start":"16","label":5},
                                           {"start":"20","label":6},
                                           {"start":"24","label":7},
                                           {"start":"28","label":8}],
                            "estimatedResultCount":"72800000",
                            "currentPageIndex":0,
                             "moreResultsUrl":"http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dfiletype:pdf","searchResultTime":"0.04"
                           }
                  },
  "responseDetails": null,
  "responseStatus": 200
 }

花了很长时间才格式化的上帝

4

1 回答 1

1

Google 返回 JSON,因此请使用该json模块而不是您现在使用的 ast 模块。

file = open('output','r')
output_dict = json.load(file)

您可能还想研究urllib2模块以加载 URL 响应,而不是依赖 curl。

于 2012-09-03T22:43:59.940 回答