我有这个代码:
private List<string> webCrawler(string url, int levels)
{
HtmlAgilityPack.HtmlDocument doc;
HtmlWeb hw = new HtmlWeb();
List<string> webSites;
List<string> csFiles = new List<string>();
csFiles.Add("temp string to know that something is happening in level = " + levels.ToString());
csFiles.Add("current site name in this level is : "+url);
doc = hw.Load(url);
webSites = getLinks(doc);
if (levels == 0)
{
return csFiles;
}
else
{
int actual_sites = 0;
for (int i = 0; i < webSites.Count() && i< 20; i++) {
string t = webSites[i];
if ( (t.StartsWith("http://")==true) || (t.StartsWith("https://")==true) ) {
actual_sites++;
csFiles.AddRange(webCrawler(t, levels - 1));
Texts(richTextBox1, "Level Number " + levels + " " + t + Environment.NewLine, Color.Red);
}
}
return csFiles;
}
}
getLinks() 是:
private List<string> getLinks(HtmlAgilityPack.HtmlDocument document)
{
List<string> mainLinks = new List<string>();
var linkNodes = document.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
foreach (HtmlNode link in linkNodes)
{
var href = link.Attributes["href"].Value;
mainLinks.Add(href);
}
}
return mainLinks;
}
问题是,例如,我爬进了 google.com,所以几次之后它就进入了该网站:
http://picasa.google.co.il/intl/iw/#utm_source=iw-all-more&utm_campaign=iw-pic&utm_medium=et
然后我得到了异常:
doc = hw.Load(url);
错误是:无法解析远程名称:'picasa.google.co.il'
例外是:
System.Net.WebException was unhandled
Message=The remote name could not be resolved: 'picasa.google.co.il'
Source=System
StackTrace:
at System.Net.HttpWebRequest.GetResponse()
at HtmlAgilityPack.HtmlWeb.Get(Uri uri, String method, String path, HtmlDocument doc, IWebProxy proxy, ICredentials creds) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlWeb.cs:line 1446
at HtmlAgilityPack.HtmlWeb.LoadUrl(Uri uri, String method, WebProxy proxy, NetworkCredential creds) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlWeb.cs:line 1563
at HtmlAgilityPack.HtmlWeb.Load(String url, String method) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlWeb.cs:line 1152
at HtmlAgilityPack.HtmlWeb.Load(String url) in C:\Source\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlWeb.cs:line 1107
at GatherLinks.Form1.webCrawler(String url, Int32 levels) in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Form1.cs:line 79
at GatherLinks.Form1.webCrawler(String url, Int32 levels) in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Form1.cs:line 108
at GatherLinks.Form1.webCrawler(String url, Int32 levels) in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Form1.cs:line 108
at GatherLinks.Form1..ctor() in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Form1.cs:line 31
at GatherLinks.Program.Main() in D:\C-Sharp\GatherLinks\GatherLinks\GatherLinks\Program.cs:line 18
at System.AppDomain._nExecuteAssembly(Assembly assembly, String[] args)
at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
InnerException:
我该如何修复/修复/解决它?
谢谢你。