linux - wget spider 返回所有 URL 两次——错误在哪里？

Question

我正在寻找一个脚本来为站点地图创建一个 URL 列表并找到了这个：

wget --spider --force-html -r -l1 http://sld.tld 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.\(css\|js\|png\|gif\|jpg\|ico\|txt\)$' \
  > urllist.txt

结果是：

http://sld.tld/
http://sld.tld/
http://sld.tld/home.html
http://sld.tld/home.html
http://sld.tld/news.html
http://sld.tld/news.html
...

每个 URL 条目都会保存两次。应该如何更改脚本来解决这个问题？

score 0 · Accepted Answer

如果您在使用--spider标志时查看 wget 的输出，它看起来像：

Spider mode enabled. Check if remote file exists.
--2013-04-12 22:01:03--  http://www.google.com/intl/en/about/products/
Connecting to www.google.com|173.194.75.103|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2013-04-12 22:01:03--  http://www.google.com/intl/en/about/products/
Reusing existing connection to www.google.com:80.
HTTP request sent, awaiting response... 200 OK

它检查链接是否存在（因此打印出 a --），然后必须下载它以查找其他链接（因此是第二个--）。这就是为什么在您使用--spider.

将其与没有进行比较--spider：

Location: http://www.google.com/intl/en/about/products/ [following]
--2013-04-12 22:00:49--  http://www.google.com/intl/en/about/products/
Reusing existing connection to www.google.com:80.

因此，您只会得到以 . 开头的一行--。

您可以删除该--spider标志，但您仍然可以得到重复。如果您真的不想要任何重复项，| sort | uniq请在命令中添加：

wget --spider --force-html -r -l1 http://sld.tld 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.\(css\|js\|png\|gif\|jpg\|ico\|txt\)$' \
  | sort | uniq > urllist.txt

linux - wget spider 返回所有 URL 两次——错误在哪里？

1 回答 1

Related