text - wget：从带有 id 编号和 url 的列表中读取

Question

在一个 .txt 文件中，我有 500 行包含一个 ID 号和一个网站主页 URL，方式如下

id_345  http://www.example1.com
id_367  http://www.example2.org
...
id_10452 http://www.example3.net

使用 wget 和 -i 选项，我试图以递归方式下载这些网站的一部分，但我想以与 id 号链接的方式存储文件（将文件存储在名为 id 号的目录中，或 - 最好的选择，但我认为最难实现 - 将 html 内容存储在一个名为 id 号的单个 txt 文件中）。不幸的是，选项 -i 无法读取我正在使用的文件。如何将网站内容与其连接的 ID 链接起来？

谢谢

Ps：我想这样做我必须从wget“出去”，并通过脚本调用它。如果是这样，请考虑到我是这个领域的新手（只是一些 python 经验），特别是我还不能理解 bash 脚本中的逻辑和代码：因此对傻瓜的逐步解释非常欢迎。

score 1 · Accepted Answer

使用Python递归获取站点wget -P ... -r -l ...，并进行并行处理（要点在这里）：

import multiprocessing, subprocess, re

def getSiteRecursive(id, url, depth=2):
  cmd =  "wget -P " + id + " -r -l " + str(depth) + " " + url
  subprocess.call(cmd, shell=True)

input_file = "site_list.txt"
jobs = []
max_jobs = multiprocessing.cpu_count() * 2 + 1
with open(input_file) as f:
  for line in f:
    id_url = re.compile("\s+").split(line)
    if len(id_url) >= 2:
      try:
        print "Grabbing " + id_url[1] + " into " + id_url[0] + " recursively..."
        if len(jobs) >= max_jobs:
          jobs[0].join()
          del jobs[0]
        p = multiprocessing.Process(target=getSiteRecursive,args=(id_url[0],id_url[1],2,))
        jobs.append(p)
        p.start()
      except Exception, e:
        print "Error for " + id_url[1] + ": " + str(e)
        pass
  for j in jobs:
    j.join()

使用 Python将单个页面放入命名文件：

import urllib2, re
input_file = "site_list.txt"
#open the site list file
with open(input_file) as f:
  # loop through lines
  for line in f:
    # split out the id and url
    id_url = re.compile("\s+").split(line)
    print "Grabbing " + id_url[1] + " into " + id_url[0] + ".html..."
    try:
      # try to get the web page
      u = urllib2.urlopen(id_url[1])
      # save the GET response data to the id file (appended with "html")
      localFile = open(id_url[0]+".html", 'wb+')
      localFile.write(u.read())
      localFile.close()
      print "got " + id_url[0] + "!"
    except:
      print "Could not get " + id_url[0] + "!"
      pass

示例 site_list.txt：

id_345  http://www.stackoverflow.com
id_367  http://stats.stackexchange.com

输出：

Grabbing http://www.stackoverflow.com into id_345.html...
got id_345!
Grabbing http://stats.stackexchange.com into id_367.html...
got id_367!

目录列表：

get_urls.py
id_345.html
id_367.html
site_list.txt

如果您更喜欢命令行或 shell 脚本，您可以使用awk默认的空格分隔来读取每一行，将其通过管道传输到循环并使用反引号执行：

awk '{print "wget -O " $1 ".html " $2}' site_list.txt | while read line ; do `$line` ; done

分解...

awk '{print "wget -O " $1 ".html " $2}' site_list.txt |

使用该awk工具读取 site_list.txt 文件的每一行，并将空格处的每一行（默认）拆分为变量（$1、$2、$3等），这样您的 id 就在$1其中，而您的 url 在$2.
添加printAWK 命令以构造对wget.
添加管道运算符|以将输出发送到下一个命令

接下来我们wget调用：

while read line ; do `$line` ; done

逐行循环之前的命令输出，将其存储到$line变量中，并使用反引号运算符执行它以解释文本并将其作为命令运行

text - wget：从带有 id 编号和 url 的列表中读取

1 回答 1

分解...

Related

Reference