0

我将编写一个爬虫,将结果存储在数据库(MongoDB)中。

当然,使用 URL 作为一种可能的查询参数很重要。但是,它也有问题:

  • URL 可以很长,而 MongDB 的最大密钥长度是有限的
  • 有很多内容同义词,仅爬取一页您不会知道这一点。
  • HTTP 301、302、303、307 等怎么办。存储原始 URL 还是新位置?对于链接缩短器来说,这尤其是一个问题。
  • “最后一个.fm”问题。lastfm.com == last.fm ~= lastfm.it (等等)并且该站点不使用 30x 结果代码来表示。它只提供来自多个域的内容。

该数据库的目标:

  • 给定任何可能在数据库中或不在数据库中的 URL,让我查询以了解我以前是否以合理的准确性爬过该文档。

当然,除了“只是去抓取它并存储确切的 URL 而不必担心重复”之外的任何方案都会有一些误报。误报是我认为与之前抓取的 URL 相同但实际上不同的 URL。

4

1 回答 1

1

I think by default, your key can be something like 1000 bytes. Are you really going to have urls larger than that? Worst comes to worse, I'm pretty sure this is a hardcoded constant that you could change.

On your other points:

There are lots of content synonyms, and you don't know this by crawling just one page. - Huh? Do you mean that a site might be duplicated, with only nuanced differences in content focused around keyphrases and you want to avoid indexing those?

What to do for HTTP 301, 302, 303, 307, etc. Store the original URL or the new location? This is especially an issue for link shorteners. - I would think the destinations...what if someone has shortened the same destination multiple times? What if the shortened link expires, or the shortener is taken offline? I would think those are far more likely than the same thing happening with the destination url.

"The last.fm" problem. lastfm.com == last.fm ~= lastfm.it (etc.) and the site doesn't use a 30x result code to indicate. It just serves the content from multiple domains. - Could you write a simple algorithm to check domains that might be similar? Last.fm contains 6/9 of the characters lastfm.com does, and the first 6 are identical. If you were to also store a bit of meta data, you could check to see if a match with a high level of relevance may be an identical document.

Given any URL that may or may not be in the database, let me query to find out if I've previously crawled that document before, with reasonable accuracy. - See last point

Hope this helps!

于 2011-04-12T23:27:55.347 回答