I think by default, your key can be something like 1000 bytes. Are you really going to have urls larger than that? Worst comes to worse, I'm pretty sure this is a hardcoded constant that you could change.
On your other points:
There are lots of content synonyms, and you don't know this by crawling just one page.
- Huh? Do you mean that a site might be duplicated, with only nuanced differences in content focused around keyphrases and you want to avoid indexing those?
What to do for HTTP 301, 302, 303, 307, etc. Store the original URL or the new location? This is especially an issue for link shorteners.
- I would think the destinations...what if someone has shortened the same destination multiple times? What if the shortened link expires, or the shortener is taken offline? I would think those are far more likely than the same thing happening with the destination url.
"The last.fm" problem. lastfm.com == last.fm ~= lastfm.it (etc.) and the site doesn't use a 30x result code to indicate. It just serves the content from multiple domains.
- Could you write a simple algorithm to check domains that might be similar? Last.fm contains 6/9 of the characters lastfm.com does, and the first 6 are identical. If you were to also store a bit of meta data, you could check to see if a match with a high level of relevance may be an identical document.
Given any URL that may or may not be in the database, let me query to find out if I've previously crawled that document before, with reasonable accuracy.
- See last point
Hope this helps!