0

这里有很多帖子询问如何在 Google 上进行自动搜索。我选择使用 BeautifulSoup,并在这里阅读了很多关于它的问题。我找不到我的问题的直接答案,尽管具体任务似乎很平常。我下面的代码非常不言自明,括号中的部分是我遇到麻烦的地方(编辑“遇到麻烦”我的意思是我无法弄清楚如何为这部分实现我的伪代码,并且在阅读了文档和用代码在线搜索类似的问题,我仍然不知道该怎么做)。如果有帮助,我认为我的问题可能与在 PubMed 上进行自动搜索以查找感兴趣的特定文章的任何人非常相似。非常感谢。

#Find Description

import BeautifulSoup
import csv
import urllib
import urllib2

input_csv = "Company.csv"
output_csv = "output.csv"

def main():
    with open(input_csv, "rb") as infile:
        input_fields = ("Name")
        reader = csv.DictReader(infile, fieldnames = input_fields)
        with open(output_csv, "wb") as outfile:
            output_fields = ("Name", "Description")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            first_row = next(reader)
            for next_row in reader:
                search_term = first_row["Name"]
                url = "http://google.com/search?q=%s" % urllib.quote_plus(search_term)

                #STEP ONE: Enter "search term" into Google Search
                #req = urllib2.Request(url, None, {'User-Agent':'Google Chrome'} )
                #res = urllib2.urlopen(req)
                #dat = res.read()
                #res.close()
                #BeautifulSoup(dat)


                #STEP TWO: Find Description
                #if there is a wikipedia page for the entity:
                    #return first sentence of wikipedia page
                #if other site:
                    #return all sentences that have the keyword "keyword" in them

                #STEP THREE: Return Description as "google_search" variable

                first_row["Company_Description"] = google_search
                writer.writerow(first_row)
                first_row = next_row

if __name__ == "__main__":
    main()

附录

对于任何从事此工作或正在研究它的人,我想出了一个我仍在完成的次优解决方案。但我想我会发布它,以防它帮助其他任何来到这个页面的人。基本上,我没有处理查找要选择的网页的问题,而是做了一个初始步骤,它在维基百科中进行所有搜索。这不是我想要的,但至少它会使获取实体子集变得更容易。代码位于两个文件中(Wikipedia.py 和 wiki_test.py):

#Wikipedia.py

from BeautifulSoup import BeautifulSoup
import csv
import urllib
import urllib2
import wiki_test


input_csv = "Name.csv"
output_csv = "WIKIPEDIA.csv"

def main():
    with open(input_csv, "rb") as infile:
        input_fields = ("A", "C", "E", "M", "O", "N", "P", "Y")
        reader = csv.DictReader(infile, fieldnames = input_fields)
        with open(output_csv, "wb") as outfile:
            output_fields = ("A", "C", "E", "M", "O", "N", "P", "Y", "Description")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            first_row = next(reader)
            for next_row in reader:
                print(next_row)
                print(first_row["A"])
                search_term = first_row["A"]
                #print(search_term)
                result = wiki_test.wiki(search_term)
                first_row["Description"] = result
                writer.writerow(first_row)
                first_row = next_row

if __name__ == "__main__":
main()

还有一个基于这篇文章的帮助模块从维基百科文章中提取第一段(Python)

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

def wiki(article):
    article = urllib.quote(article)
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Google Chrome')] #wikipedia needs this
    resource = opener.open("http://en.wikipedia.org/wiki/" + article)
    #try:
    #    urllib2.urlopen(resource)
    #except urllib2.HTTPError, e:
    #    print(e)
    data = resource.read()
    resource.close()
    soup = BeautifulSoup(data)
    print soup.find('div',id="bodyContent").p

我只需要修复它以处理 HTTP 404 错误(即未找到页面),此代码适用于任何想要查找维基百科上提供的基本公司信息的人。再说一次,我宁愿有一些可以在谷歌搜索上工作的东西,并找到相关网站和网站中提到“关键字”的相关部分,但至少这个当前的程序让我们得到了一些东西。

4

0 回答 0