c# - 如何解析网站 HTML 内容

Question

我正在尝试解析网站的 HTML，比如 CNN.com，但每次我使用 WebBrowser 对象导航时，我的对象都会得到一堆空值。我没有使用 HTML 敏捷包。每当我调用 Navigate 方法时，mywebBrowser 都包含 null 和空白值。如何让 tagCollection 填充？我尝试执行 webClient.DownloadString 只是为了获取 HTML 页面的所有内容，但我不能使用它，因为我需要找到所有标签并且手动执行非常麻烦。我也不能使用 HTML Agility Pack。

        using (WebClient webClient = new WebClient())
        {
            webClient.Encoding = Encoding.UTF8;
            HtmlString = webClient.DownloadString(textBox1.Text);
        }

        WebBrowser mywebBrowser = new WebBrowser();
        Uri address = new Uri("http://www.cnn.com/");
        mywebBrowser.Navigate(address);

        //HtmlString does contain all the HTML from Page
        mywebBrowser.DocumentText = HtmlString; 
        //DocumentText only has "<HTML></HTML> after assignment


        HtmlDocument doc = mywebBrowser.Document;
        HtmlElementCollection tagCollection;
        tagCollection = doc.GetElementsByTagName("<div");

score 0 · Accepted Answer

WebBrowser Class允许您在不依赖任何外部库的情况下做很多事情。你缺少的是DocumentCompleted Event; 它是 WebBrowser 基本定义的一部分：在到达这部分之前，页面没有完全加载，因此相应的信息是错误的（或为空）。还要记住，GetElementsByTagName您只需输入标签的名称（不带“<”）。显示这一点的示例代码：

 WebBrowser mywebBrowser;
 private void Form1_Load(object sender, EventArgs e)
 {
     mywebBrowser = new WebBrowser();
     mywebBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(mywebBrowser_DocumentCompleted);

     Uri address = new Uri("http://www.cnn.com/");
     mywebBrowser.Navigate(address);
 }

 private void mywebBrowser_DocumentCompleted(Object sender, WebBrowserDocumentCompletedEventArgs e)
 {
    //Until this moment the page is not completely loaded
     HtmlDocument doc = mywebBrowser.Document;
     HtmlElementCollection tagCollection;
     tagCollection = doc.GetElementsByTagName("div");
 }

c# - 如何解析网站 HTML 内容

1 回答 1

Related

Reference