3

The goal

I want to mirror a website, such that I can host the static files anywhere (localhost, S3, etc.) and the URLs will appear just like the original to the end user.

The command

This is almost perfect for my needs (...but not quite):

wget --mirror -nH -np -p -k -E -e robots=off http://mysite

What this does do

  • --mirror : Recursively download the entire site
  • -p : Download all necessary page requisites
  • -k : Convert the URL's to relative paths so I can host them anywhere

What this doesn't do

  • Prevent duplicate downloads
  • Maintain (exactly) the same URL structure

The problem

Some things are being downloaded more than once, which results in myfile.html and myfile.1.html. This wouldn't be bad, except that when wget rewrites the hyperlinks, it is writing it with the myfile.1.html version, which is changing the URLs and therefore has SEO considerations (Google will index ugly looking URL's).

The -nc option would prevent this, but as of wget-v1.13, I cannot use -k and -nc at the same time. Details for this are here.

Help?!

I was hoping to use wget, but I am now considering looking into using another tool, like httrack, but I don't have any experience with that yet.

Any ideas on how to achieve this (with wget, httrack or anything else) would be greatly appreciated!

4

2 回答 2

2

httrack得到了我大部分的帮助,它所做的唯一 URL 修改是使链接指向/folder/index.html而不是/folder/.

使用任何一个httrackwget似乎都不会产生完美的 URL 结构,因此我们最终编写了一个运行爬虫的小 bash 脚本,然后sed清理一些 URL(裁剪index.html来自链接,替换bla.1.htmlbla.html等)

于 2013-08-19T03:03:00.450 回答
0

wget description and help

According to this (and a quick experiment of my own) you should have no problems using -nc and -k options together to gather the pages you are after.

What will cause an issue is using -N with -nc (Does not work at all, incompatible) so you won't be able to compare files by timestamp and still no-clobber them, and with the --mirror option you are including -N inherently.

Rather than use --mirror try instead replacing it with "-r -l inf" which will enable recursive downloading to an infinite level but still allow your other options to work.

An example, based on your original:

wget -r -l inf -k -nc -nH -p -E -e robots=off http://yoursite

Notes: I would suggest using -w 5 --random-wait --limit-rate=200k in order to avoid DOSing the server and be a little less rude, but obviously up to you.

Generally speaking I try to avoid using option groupings like --mirror because of conflicts like this being harder to trace.

I know this is an answer to a very old question but I think it should be addressed - wget is a new command for me but so far proving to be invaluable and I would hope others would feel the same.

于 2015-01-25T21:15:31.033 回答