c# - HTMLAgilityPack 使用 C# 解析 HTML 的问题

Question

我只是想了解 HTMLAgilityPack 和 XPath，我正在尝试从 NASDAQ 网站获取（HTML 链接）公司列表；

http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx

我目前有以下代码；

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // Create a request for the URL.        
        WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
        // Get the response.
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        // Get the stream containing content returned by the server.
        Stream dataStream = response.GetResponseStream();
        // Open the stream using a StreamReader for easy access.
        StreamReader reader = new StreamReader(dataStream);
        // Read the content.
        string responseFromServer = reader.ReadToEnd();
        // Read into a HTML store read for HAP
        htmlDoc.LoadHtml(responseFromServer);

        HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[@id='indu_table']/tbody/tr[*]/td/b/a");
        foreach (HtmlAgilityPack.HtmlNode node in tl)
        {
            Debug.Write(node.InnerText);
        }            

        // Cleanup the streams and the response.
        reader.Close();
        dataStream.Close();
        response.Close();

我使用 Chrome 的 XPath 插件来获取 XPath；

//*table[@id='indu_table']/tbody/tr[*]/td/b/a

运行我的项目时，我收到一个关于它是无效令牌的 xpath 未处理异常。

我有点不确定它有什么问题，我试图在上面的 tr[*] 部分输入一个数字，但我仍然得到同样的错误。

我最近一个小时一直在看这个，这很简单吗？

谢谢

score 3 · Accepted Answer

由于数据来自 javascript，您必须解析 javascript 而不是 html，因此 Agility Pack 并没有太大帮助，但它使事情变得更容易一些。以下是如何使用 Agility Pack 和Newtonsoft JSON.Net来解析 Javascript。

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
  //Using Regex here to get just the array we're interested in...
  string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
  JArray jArray = JArray.Parse(stockArray);
  foreach (JToken token in jArray.Children())
  {
    listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
  }
}

为了更详细地解释一下，数据来自页面上的一个大 javascript 数组var table_body = [...。每个股票都是数组中的一个元素，并且本身就是一个数组。

["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]

因此，通过解析数组并获取第一个元素并附加修复 url，我们得到与 javascript 相同的结果。

score 0 · Accepted Answer

为什么不直接使用Descendants("a")方法？它更简单，更面向对象。你只会得到一堆对象。您可以从这些对象中获取“href”属性。

示例代码：

htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value

如果您只需要某个网页的链接列表，这种方法就可以了。

score 0 · Accepted Answer

如果您查看该 URL 的页面源，实际上并没有带有id=indu_table. 它似乎是动态生成的（即在 javascript 中）；直接从服务器加载时获得的 html 不会反映客户端脚本更改的任何内容。这可能是它不起作用的原因。

c# - HTMLAgilityPack 使用 C# 解析 HTML 的问题

3 回答 3

Related

Reference