如果您在使用--spider
标志时查看 wget 的输出,它看起来像:
Spider mode enabled. Check if remote file exists.
--2013-04-12 22:01:03-- http://www.google.com/intl/en/about/products/
Connecting to www.google.com|173.194.75.103|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain links to other resources -- retrieving.
--2013-04-12 22:01:03-- http://www.google.com/intl/en/about/products/
Reusing existing connection to www.google.com:80.
HTTP request sent, awaiting response... 200 OK
它检查链接是否存在(因此打印出 a --
),然后必须下载它以查找其他链接(因此是第二个--
)。这就是为什么在您使用--spider
.
将其与没有进行比较--spider
:
Location: http://www.google.com/intl/en/about/products/ [following]
--2013-04-12 22:00:49-- http://www.google.com/intl/en/about/products/
Reusing existing connection to www.google.com:80.
因此,您只会得到以 . 开头的一行--
。
您可以删除该--spider
标志,但您仍然可以得到重复。如果您真的不想要任何重复项,| sort | uniq
请在命令中添加:
wget --spider --force-html -r -l1 http://sld.tld 2>&1 \
| grep '^--' | awk '{ print $3 }' \
| grep -v '\.\(css\|js\|png\|gif\|jpg\|ico\|txt\)$' \
| sort | uniq > urllist.txt