这里有很多帖子询问如何在 Google 上进行自动搜索。我选择使用 BeautifulSoup,并在这里阅读了很多关于它的问题。我找不到我的问题的直接答案,尽管具体任务似乎很平常。我下面的代码非常不言自明,括号中的部分是我遇到麻烦的地方(编辑“遇到麻烦”我的意思是我无法弄清楚如何为这部分实现我的伪代码,并且在阅读了文档和用代码在线搜索类似的问题,我仍然不知道该怎么做)。如果有帮助,我认为我的问题可能与在 PubMed 上进行自动搜索以查找感兴趣的特定文章的任何人非常相似。非常感谢。
#Find Description
import BeautifulSoup
import csv
import urllib
import urllib2
input_csv = "Company.csv"
output_csv = "output.csv"
def main():
with open(input_csv, "rb") as infile:
input_fields = ("Name")
reader = csv.DictReader(infile, fieldnames = input_fields)
with open(output_csv, "wb") as outfile:
output_fields = ("Name", "Description")
writer = csv.DictWriter(outfile, fieldnames = output_fields)
writer.writerow(dict((h,h) for h in output_fields))
next(reader)
first_row = next(reader)
for next_row in reader:
search_term = first_row["Name"]
url = "http://google.com/search?q=%s" % urllib.quote_plus(search_term)
#STEP ONE: Enter "search term" into Google Search
#req = urllib2.Request(url, None, {'User-Agent':'Google Chrome'} )
#res = urllib2.urlopen(req)
#dat = res.read()
#res.close()
#BeautifulSoup(dat)
#STEP TWO: Find Description
#if there is a wikipedia page for the entity:
#return first sentence of wikipedia page
#if other site:
#return all sentences that have the keyword "keyword" in them
#STEP THREE: Return Description as "google_search" variable
first_row["Company_Description"] = google_search
writer.writerow(first_row)
first_row = next_row
if __name__ == "__main__":
main()
附录
对于任何从事此工作或正在研究它的人,我想出了一个我仍在完成的次优解决方案。但我想我会发布它,以防它帮助其他任何来到这个页面的人。基本上,我没有处理查找要选择的网页的问题,而是做了一个初始步骤,它在维基百科中进行所有搜索。这不是我想要的,但至少它会使获取实体子集变得更容易。代码位于两个文件中(Wikipedia.py 和 wiki_test.py):
#Wikipedia.py
from BeautifulSoup import BeautifulSoup
import csv
import urllib
import urllib2
import wiki_test
input_csv = "Name.csv"
output_csv = "WIKIPEDIA.csv"
def main():
with open(input_csv, "rb") as infile:
input_fields = ("A", "C", "E", "M", "O", "N", "P", "Y")
reader = csv.DictReader(infile, fieldnames = input_fields)
with open(output_csv, "wb") as outfile:
output_fields = ("A", "C", "E", "M", "O", "N", "P", "Y", "Description")
writer = csv.DictWriter(outfile, fieldnames = output_fields)
writer.writerow(dict((h,h) for h in output_fields))
next(reader)
first_row = next(reader)
for next_row in reader:
print(next_row)
print(first_row["A"])
search_term = first_row["A"]
#print(search_term)
result = wiki_test.wiki(search_term)
first_row["Description"] = result
writer.writerow(first_row)
first_row = next_row
if __name__ == "__main__":
main()
还有一个基于这篇文章的帮助模块从维基百科文章中提取第一段(Python):
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
def wiki(article):
article = urllib.quote(article)
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Google Chrome')] #wikipedia needs this
resource = opener.open("http://en.wikipedia.org/wiki/" + article)
#try:
# urllib2.urlopen(resource)
#except urllib2.HTTPError, e:
# print(e)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p
我只需要修复它以处理 HTTP 404 错误(即未找到页面),此代码适用于任何想要查找维基百科上提供的基本公司信息的人。再说一次,我宁愿有一些可以在谷歌搜索上工作的东西,并找到相关网站和网站中提到“关键字”的相关部分,但至少这个当前的程序让我们得到了一些东西。