nlp - Crawling The Internet

Question

I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.

How do I implement a crawler?

I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)

Are there others?

What opinions does everyone have?

-Jason

score 10 · Accepted Answer

该主题的优秀介绍性文本是信息检索简介（在线提供全文）。它有一章是关于Web 抓取的，但也许更重要的是，它为您想要对抓取的文档执行的操作提供了基础。

_{（来源：stanford.edu）}

score 8 · Accepted Answer

There's a good book on the subject I can recommend called Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL.

score 5 · Accepted Answer

无论您做什么，请做一个好公民并遵守robots.txt文件。您可能想查看 wikipedia 页面上关于重点爬虫的参考资料。刚刚意识到我认识Topical Web Crawlers: Evaluating Adaptive Algorithms的作者之一。小世界。

score 4 · Accepted Answer

查看Scrapy。它是一个用 Python 编写的开源网络爬虫框架（我听说它与 Django 类似，只是它不提供下载页面的服务）。它易于扩展，分布式/并行，看起来很有前途。

我会使用 Scrapy，因为这样我可以节省我的优势来做一些更琐碎的事情，比如如何从抓取的内容等中提取正确的数据并插入到数据库中。

score 3 · Accepted Answer

我认为网络爬虫部分将是任务中最简单的部分。困难的部分将是决定访问哪些站点以及如何发现您要访问的站点上的事件。也许您想了解使用Google或Yahoo API来获取您想要的数据。他们已经完成了在 Internet 上抓取大量页面的工作——无论如何，在我看来，您可以专注于更棘手的问题，即筛选数据以获取您正在寻找的事件。

score 2 · Accepted Answer

Actually writing a scale directed crawler is quite a challenging task. I implemented one at work and maintained it for quite a while. There are a lot of problem that you don't know exist until you write one and hit the problems. Specifically dealing with CDNs and friendly crawling of sites. Adaptive algorithms are very important or you will trip DOS filters. Actually you will anyhow without knowing it if your crawl is big enough.

Things to think about:

What's except able throughput?
How do you deal with site outages?
What happens if you are blocked?
Do you want to engage in stealth crawling (contreversial and actually quite hard to get right)?

I have actually written some stuff up that if I ever get around to it I might put online about crawler construction since building a proper one is much tougher than people will tell you. Most of the open source crawlers work well enough for most people so if you can I recommend you use one of those. Which one is a feature/platform choice.

score 1 · Accepted Answer

如果您发现爬网成为一项任务，您可能需要考虑构建一个RSS 聚合器并订阅流行事件站点（如 craigslist 和coming.org）的 RSS 提要。

这些站点中的每一个都提供本地化的、可搜索的事件。RSS 为您提供（少数）标准化格式，而不必使用构成网络的所有格式错误的 html...

有像ROME (java) 这样的开源库可以帮助使用 RSS 提要。

score 0 · Accepted Answer

按照Kevin对 RSS 提要的建议，您可能想查看Yahoo 管道。我还没有尝试过，但我认为它们允许您处理多个 RSS 提要并生成网页或更多 RSS 提要。

score 0 · Accepted Answer

有语言特定要求吗？

我花了一些时间玩玩 Chilkat Spider Lib's for .net 一段时间后进行个人实验，

最后我在那里检查了蜘蛛库，被许可为免费软件，（据我所知，Altho 不是开源的：（）

似乎他们有python Lib。

http://www.example-code.com/python/pythonspider.asp #Python http://www.example-code.com/csharp/spider.asp #.Net

score 0 · Accepted Answer

0

Nutch 爬行者

于 2009-06-15T19:45:51.483 回答

nlp - Crawling The Internet

10 回答 10

Related

Reference