使用Python递归获取站点wget -P ... -r -l ...
,并进行并行处理(要点在这里):
import multiprocessing, subprocess, re
def getSiteRecursive(id, url, depth=2):
cmd = "wget -P " + id + " -r -l " + str(depth) + " " + url
subprocess.call(cmd, shell=True)
input_file = "site_list.txt"
jobs = []
max_jobs = multiprocessing.cpu_count() * 2 + 1
with open(input_file) as f:
for line in f:
id_url = re.compile("\s+").split(line)
if len(id_url) >= 2:
try:
print "Grabbing " + id_url[1] + " into " + id_url[0] + " recursively..."
if len(jobs) >= max_jobs:
jobs[0].join()
del jobs[0]
p = multiprocessing.Process(target=getSiteRecursive,args=(id_url[0],id_url[1],2,))
jobs.append(p)
p.start()
except Exception, e:
print "Error for " + id_url[1] + ": " + str(e)
pass
for j in jobs:
j.join()
使用 Python将单个页面放入命名文件:
import urllib2, re
input_file = "site_list.txt"
#open the site list file
with open(input_file) as f:
# loop through lines
for line in f:
# split out the id and url
id_url = re.compile("\s+").split(line)
print "Grabbing " + id_url[1] + " into " + id_url[0] + ".html..."
try:
# try to get the web page
u = urllib2.urlopen(id_url[1])
# save the GET response data to the id file (appended with "html")
localFile = open(id_url[0]+".html", 'wb+')
localFile.write(u.read())
localFile.close()
print "got " + id_url[0] + "!"
except:
print "Could not get " + id_url[0] + "!"
pass
示例 site_list.txt:
id_345 http://www.stackoverflow.com
id_367 http://stats.stackexchange.com
输出:
Grabbing http://www.stackoverflow.com into id_345.html...
got id_345!
Grabbing http://stats.stackexchange.com into id_367.html...
got id_367!
目录列表:
get_urls.py
id_345.html
id_367.html
site_list.txt
如果您更喜欢命令行或 shell 脚本,您可以使用awk
默认的空格分隔来读取每一行,将其通过管道传输到循环并使用反引号执行:
awk '{print "wget -O " $1 ".html " $2}' site_list.txt | while read line ; do `$line` ; done
分解...
awk '{print "wget -O " $1 ".html " $2}' site_list.txt |
- 使用该
awk
工具读取 site_list.txt 文件的每一行,并将空格处的每一行(默认)拆分为变量($1
、$2
、$3
等),这样您的 id 就在$1
其中,而您的 url 在$2
.
- 添加
print
AWK 命令以构造对wget
.
- 添加管道运算符
|
以将输出发送到下一个命令
接下来我们wget
调用:
while read line ; do `$line` ; done
- 逐行循环之前的命令输出,将其存储到
$line
变量中,并使用反引号运算符执行它以解释文本并将其作为命令运行