OPs 问题的替代解决方案:
解决方案概要:
- 通过带有“GET”的 HTML 表单发送用户输入
- 处理来自发送到 shell 脚本的 url 编码值“GET”中的值
- Shell 脚本解析值并保存它们,将参数传递给 Python 脚本,同时调用它运行。
Javascript 和 php 可以很好地与此设置配合使用,并允许从那里使用 mysql 等。
使用“GET”,我们使用 shell 脚本将用户的输入从客户端发送到服务器端来处理我们的数据。
示例 Index.php
<!DOCTYPE html>
<html>
<head>
<title>Google Email Search</title>
</head>
<body>
<h1>Script Options</h1>
<form action="/cgi-bin/call.sh" method="get">
<TABLE BORDER="1">
<TR>
<TD>Keyword:</TD>
<TD><input type="text" name="query" value="Query"></TD>
</TR>
<TR>
<TD># of Pages:</TD>
<TD><input type="text" name="pages" value="1"></TD>
</TR>
<TR>
<TD>Output File Name:</TD>
<TD><input type="text" name="output_name" value="results"></TD>
</TR>
<TR>
<TD>E-mail Address:</TD>
<TD><input type="text" name="email_address" value="example@gmail.com">
</TD>
</TR>
<TR>
<TD><input type="submit" value="Submit"></TD>
</TR>
</TABLE>
</form>
</body>
</html>
用于调用 python 脚本的示例 shell 脚本,该脚本将位于您的 cgi-bin 或其他指定的“可执行”允许目录中。
#!/bin/bash
# Runs the cgi-script, using the shell, using 'get' results from the index html form we parse it to the options in the python script.
echo "Content-type: text/html"
echo ""
echo '<html>'
echo '<head>'
echo '<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'
echo '<title></title>'
echo '</head>'
echo '<body>'
query=`echo "$QUERY_STRING" | sed -n 's/^.*query=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
pages=`echo "$QUERY_STRING" | sed -n 's/^.*pages=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
output_name=`echo "$QUERY_STRING" | sed -n 's/^.*output_name=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
email_address=`echo "$QUERY_STRING" | sed -n 's/^.*email_address=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
echo '<h1>'
echo 'Running...'
echo '</h1>'
DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
cd "$DIR"
python main.py -query $query -pages $pages -o $output_name
echo ''
echo '</body>'
echo '</html>'
示例 Python 脚本。
从 shell 脚本调用:
#!/usr/bin/env python
from xgoogle.search import GoogleSearch
import urllib2, re, csv, os
import argparse
class ScrapeProcess(object):
emails = [] # for duplication prevention
def __init__(self, filename):
self.filename = filename
self.csvfile = open(filename, 'wb+')
self.csvwriter = csv.writer(self.csvfile)
def go(self, query, pages):
search = GoogleSearch(query)
search.results_per_page = 10
for i in range(pages):
search.page = i
results = search.get_results()
for page in results:
self.scrape(page)
def scrape(self, page):
try:
request = urllib2.Request(page.url.encode("utf8"))
html = urllib2.urlopen(request).read()
except Exception, e:
return
emails = re.findall(r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
for email in emails:
if email not in self.emails: # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
parser = argparse.ArgumentParser(description='Scrape Google results for emails')
parser.add_argument('-query', type=str, default='test', help='a query to use for the Google search')
parser.add_argument('-pages', type=int, default=10, help='number of Google results pages to scrape')
parser.add_argument('-o', type=str, default='emails.csv', help='output filename')
args = parser.parse_args()
args.o = args.o+'.csv' if '.csv' not in args.o else args.o # make sure filename has .csv extension
s = ScrapeProcess(args.o)
s.go(args.query, args.pages)
完整的工作示例位于此处:
https ://github.com/mhenes/Google-EmailScraper
免责声明这是我的 git - 使用分叉项目来展示此功能。