The goal
I want to mirror a website, such that I can host the static files anywhere (localhost, S3, etc.) and the URLs will appear just like the original to the end user.
The command
This is almost perfect for my needs (...but not quite):
wget --mirror -nH -np -p -k -E -e robots=off http://mysite
What this does do
--mirror
: Recursively download the entire site-p
: Download all necessary page requisites-k
: Convert the URL's to relative paths so I can host them anywhere
What this doesn't do
- Prevent duplicate downloads
- Maintain (exactly) the same URL structure
The problem
Some things are being downloaded more than once, which results in myfile.html
and myfile.1.html
. This wouldn't be bad, except that when wget rewrites the hyperlinks, it is writing it with the myfile.1.html
version, which is changing the URLs and therefore has SEO considerations (Google will index ugly looking URL's).
The -nc
option would prevent this, but as of wget-v1.13, I cannot use -k
and -nc
at the same time. Details for this are here.
Help?!
I was hoping to use wget, but I am now considering looking into using another tool, like httrack, but I don't have any experience with that yet.
Any ideas on how to achieve this (with wget, httrack or anything else) would be greatly appreciated!