0

I'm looking for an implementation of a web crawler or link scraper in C# that I can modify to suit our needs. We need something we can run on-demand to spider a list of our websites to keep an eye out for certain links. The spider doesn't need to store copies of the sites, download images or anything of the sort -- it just needs to report back any pages that link to certain sites that match a small list of substrings.

I've seen crawler implementations like arachnode.net (and a myriad of other examples) but they all contain a massive amount of code revolving around saving the content. We don't need to do that. We just need to parse out every page linked in and report back any that contain a link that meets certain criteria (it will be a simple substring match).

Can anyone recommend a framework or example that might help me get started? It seems like there are a number of ways to do it (especially with .NET 4 and the HTML Agility Pack) but since we'll need to run it on a regular schedule, a high-performing threaded or parallel processing implementation is important.

[edit]

I may have been unclear -- this will have to run on the desktop, not as part of an ASP.Net website. The company-owned sites span many domains, servers and geographic locations so it can't be a server-side solution.

4

1 回答 1

1

SEO 命名空间可以在这里提供帮助吗?您正在寻找的 WebCrawler 类是什么:

http://msdn.microsoft.com/en-us/library/microsoft.web.management.seo.crawler.webcrawler(v=VS.90).aspx

于 2012-06-10T18:20:53.167 回答