0
# -*- coding: utf-8 -*-

import re
import csv
import urllib
import urllib2
import BeautifulSoup
Filter = [' ab1',' ab2',' dc4',....]
urllists = ['myurl1','myurl2','myurl3',...]
csvfile = file('csv_test.csv','wb')
writer = csv.writer(csvfile)
writer.writerow(['keyword','url'])
for eachUrl in urllists:
    for kword in Filter:
        keyword = "site:" + urllib.quote_plus(eachUrl) + kword
        safeKeyword = urllib.quote_plus(keyword)
        fullQuery = 'http://www.google.com/search?sourceid=chrome&client=ubuntu&channel=cs&    ie=UTF-8&q=' + safeKeyword

        req = urllib2.Request(fullQuery, headers = {'User-Agent': 'Mozilla/15.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/12.04 Chrome/21.0.118083 Safari/535.11'})
        html = urllib2.urlopen(req).read()

        soup = BeautifulSoup.BeautifulSoup(html, fromEncoding = 'utf8')

        resultURLList = [t.a['href'] for t in soup.findAll('h3', {'class':'r'})]

        if resultURLList:
            for l in resultURLList:
                needCheckHtml = urllib2.urlopen(l).read()
                if needCheckHtml:
                    x = re.compile(r"\b" + kword + r"\b")
                    p = x.search(needCheckHtml)
                    if p:
                        data = [kword, l]
                        writer.writerow(data)

        else:
            print '%s: No Results' % kword
csvfile.close()

在google searchresults上显示一个关于检查url的简单脚本,打开它,检查并匹配list Filter use re中的关键字,上面的代码可能会导致一些错误,例如,HTTPERROR,URLError,但我不知道如何修复并改进代码,有人可以帮我吗?请.. 如果遇到一些谷歌拒绝,想使用 os.system("rasdial name user code") 重新连接 PPPOE 并更改 IP,那么如何修复此代码非常感谢!

4

1 回答 1

1

I'm not sure how much this helps, but there is a search API that you can use without Google blocking your request and without the need to change your IP address; although there are some restrictions here as well.

http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=AnT4i

{"responseData": {"results":[{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://www.ncbi.nlm.nih.gov/pubmed/11526138","url":"http://www.ncbi.nlm.nih.gov/pubmed/11526138","visibleUrl":"www.ncbi.nlm.nih.gov","cacheUrl":"","title":"Identification of aminoglycoside-modifying enzymes by susceptibility \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Identification of aminoglycoside-modifying enzymes by susceptibility ...","content":"In 381 Japanese MRSA isolates, the \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e, aac(6\u0026#39;)-aph(2\u0026quot;), and aph(3\u0026#39;)-III   genes \u003cb\u003e...\u003c/b\u003e Isolates with only the \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e gene had coagulase type II or III, but   isolates \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://www.ncbi.nlm.nih.gov/pubmed/1047990","url":"http://www.ncbi.nlm.nih.gov/pubmed/1047990","visibleUrl":"www.ncbi.nlm.nih.gov","cacheUrl":"","title":"[\u003cb\u003eANT(4\u0026#39;)I\u003c/b\u003e: a new aminoglycoside nucleotidyltransferase found in \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"[ANT(4\u0026#39;)I: a new aminoglycoside nucleotidyltransferase found in ...","content":"[\u003cb\u003eANT(4\u0026#39;)I\u003c/b\u003e: a new aminoglycoside nucleotidyltransferase found in \u0026quot;staphylococcus   aureus\u0026quot; (author\u0026#39;s transl)]. [Article in French]. Le Goffic F, Baca B, Soussy CJ, \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://jcm.asm.org/content/27/11/2535","url":"http://jcm.asm.org/content/27/11/2535","visibleUrl":"jcm.asm.org","cacheUrl":"","title":"Use of plasmid analysis and determination of aminoglycoside \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Use of plasmid analysis and determination of aminoglycoside ...","content":"Aminoglycoside resistance pattern determinations revealed the presence of the   \u003cb\u003eANT(4\u0026#39;)-I\u003c/b\u003e enzyme (aminoglycoside 4\u0026#39; adenyltransferase) in all group 1 isolates \u003cb\u003e...\u003c/b\u003e"},{"GsearchResultClass":"GwebSearch","unescapedUrl":"http://ukpmc.ac.uk/articles/PMC88306","url":"http://ukpmc.ac.uk/articles/PMC88306","visibleUrl":"ukpmc.ac.uk","cacheUrl":"","title":"Identification of Aminoglycoside-Modifying Enzymes by \u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"Identification of Aminoglycoside-Modifying Enzymes by ...","content":"The technique used three sets of primers delineating specific DNA fragments of   the aph(3\u0026#39;)-III, \u003cb\u003eant(4\u0026#39;)-I\u003c/b\u003e, and aac(6\u0026#39;)-aph(2\u0026quot;) genes, which influence the MICs of \u003cb\u003e...\u003c/b\u003e"}],"cursor":{"resultCount":"342","pages":[{"start":"0","label":1},{"start":"4","label":2},{"start":"8","label":3},{"start":"12","label":4},{"start":"16","label":5},{"start":"20","label":6},{"start":"24","label":7},{"start":"28","label":8}],"estimatedResultCount":"342","currentPageIndex":0,"moreResultsUrl":"http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dAnT4i","searchResultTime":"0.25"}}, "responseDetails": null, "responseStatus": 200}

see http://googlesystem.blogspot.hu/2008/04/google-search-rest-api.html

于 2012-09-06T09:07:35.233 回答