c# - C# URL Crawler 没有获得足够的链接？

Question

我有以下代码，但是，当我启动它时，我只会缝合以返回一些 URL。

while (stopFlag != true)
{
    WebRequest request = WebRequest.Create(urlList[i]);
    using (WebResponse response = request.GetResponse())
    {
        using (StreamReader reader = new StreamReader
           (response.GetResponseStream(), Encoding.UTF8))
        {
            string sitecontent = reader.ReadToEnd();
            //add links to the list
            // process the content
            //clear the text box ready for the HTML code
            //Regex urlRx = new Regex(@"((https?|ftp|file)\://|www.)[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*", RegexOptions.IgnoreCase);
            Regex urlRx = new Regex(@"(?<url>(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)", RegexOptions.IgnoreCase);

            MatchCollection matches = urlRx.Matches(sitecontent);
            foreach (Match match in matches)
            {
                string cleanMatch = cleanUP(match.Value);
                urlList.Add(cleanMatch);

                updateResults(theResults, "\"" + cleanMatch + "\",\n");

            }
        }
    }
}

我认为错误在正则表达式中。

我想要实现的是拉一个网页，然后从该页面获取所有链接 - 将这些链接添加到列表中，然后为每个列表项获取下一页并重复该过程。

score 3 · Accepted Answer

我建议不要尝试使用正则表达式来解析 HTML，而是使用一个好的 HTML 解析器 - HTML Agilty Pack是一个不错的选择：

什么是 Html Agility Pack (HAP)？

这是一个敏捷的 HTML 解析器，它构建一个读/写 DOM 并支持普通的 XPATH 或 XSLT（实际上你不必了解 XPATH 或 XSLT 就可以使用它，不用担心......）。它是一个 .NET 代码库，允许您解析“网络之外”的 HTML 文件。解析器对“真实世界”格式错误的 HTML 非常宽容。对象模型与 System.Xml 的提议非常相似，但用于 HTML 文档（或流）。

c# - C# URL Crawler 没有获得足够的链接？

1 回答 1

Related

Reference