python - 解析你谷歌搜索的内容

Question

我想写一个脚本（最好用python，但其他语言不是问题），它可以解析你在谷歌搜索中输入的内容。假设我搜索“猫”，然后我希望能够解析字符串猫，例如，将其附加到我计算机上的 .txt 文件中。

因此，如果我的搜索是“猫”、“狗”、“奶牛”，那么我可以有一个像这样的 .txt 文件，

猫狗牛

任何人都知道任何可以解析搜索栏并返回输入字符串的 API？或者一些我可以转换成字符串的对象？

编辑：我不想制作 chrome 扩展或任何东西，但最好是我可以在终端中运行的 python（或 bash 或 ruby）脚本，可以做到这一点。

谢谢

score 1 · Accepted Answer

如果您有权访问该 URL，则可以查找“&q=”来查找搜索词。（例如http://google.com/...&q=cats ...）。

score 1 · Accepted Answer

我可以提供 2 个流行的解决方案 1）谷歌有一个搜索引擎 API https://developers.google.com/products/#google-search （它限制每天 100 个请求）

剪切代码：

def gapi_parser(args):
    query = args.text; count = args.max_sites
    import config
    api_key = config.api_key 
    cx = config.cx 

    #Note: This API returns up to the first 100 results only. 
    #https://developers.google.com/custom-search/v1/using_rest?hl=ru-RU#WorkingResults

    results = []; domains = set(); errors = []; start = 1
    while True:
        req = 'https://www.googleapis.com/customsearch/v1?key={key}&cx={cx}&q={q}&alt=json&start={start}'.format(key=api_key, cx=cx, q=query, start=start)
        if start>=100: #google API does not can do more
            break
        con = urllib2.urlopen(req) 
        if con.getcode()==200:
            data = con.read()
            j = json.loads(data)
            start = int(j['queries']['nextPage'][0]['startIndex'])
            for item in j['items']:
                match = re.search('^(https?://)?\w(\w|\.|-)+', item['link'])
                if match: 
                    domain = match.group(0)
                    if domain not in results:
                        results.append(domain)
                    domains.update([domain])
                else:
                    errors.append('Can`t recognize domain: %s' % item['link'])
            if len(domains) >= args.max_sites:
                 break 

    print
    for error in errors:
        print error
return (results, domains)

2）我写了一个基于 selenuim 的脚本来解析真实浏览器实例中的页面，但是这个解决方案有一些限制，例如，如果你像机器人一样运行搜索，则验证码。

score 0 · Accepted Answer

您可能会考虑的几个选项，以及它们的优点和缺点：

网址：
- 优势：正如 Chris 所说，访问 URL 并手动更改它是一种选择。为此编写脚本应该很容易，如果您愿意，我可以将我的 perl 脚本发送给您
- 缺点：我不确定你是否可以做到。我之前为此制作了一个 perl 脚本，但它不起作用，因为谷歌声明你不能在谷歌界面之外使用它的服务。你可能会遇到同样的问题
谷歌的搜索 API：
- 优势：大众选择。好的文档。应该是安全的选择
- 缺点：谷歌的限制。
研究其他搜索引擎：
- 优势：他们可能没有与谷歌相同的限制。您可能会发现一些搜索引擎可以让您玩得更多，并且总体上拥有更多的自由。
- 缺点：你不会得到像谷歌一样好的结果

python - 解析你谷歌搜索的内容

3 回答 3

Related

Reference