c# - Select all links from a Html table using XPath (and HtmlAgilityPack)

Question

What I am trying to achieve is to extract all links with a href attribute that starts with http://, https:// or /. These links lie within a table (tbody > tr > td etc) with a certain class. I thought I could specify just the the a element without the whole path to it but it does not seem to work. I get a NullReferenceException at the line that selects the links:

var table = doc.DocumentNode.SelectSingleNode("//table[@class='containerTable']");
if (table != null)
{
    foreach (HtmlNode item in table.SelectNodes("a[starts-with(@href, 'https://')]"))
    {
        //not working

I don't know about any recommendations or best practices when it comes to XPath. Do I create overhead when I query the document two times?

score 3 · Accepted Answer

使用：

 //tbody/descendant::a[starts-with(@href,'https://')
                     or
                       starts-with(@href,'http://')
                     or
                       starts-with(@href,'./') 
                      ]

您仍然会遇到问题，除非您更正代码以反映XmlNode.SelectNodes()实例方法的返回类型为XmlNodeList, not的事实HtmlNode。

score 2 · Accepted Answer

问题是您正在选择桌子，然后立即尝试选择锚，就好像它们是直接死者一样。中间有tr和td标签。

因此，如果您将 xpath 更改为以下内容，则应该可以：

"tbody/tr/td/a[starts-with(@href, 'https://')]"

如果您的锚点被包裹在其他东西中，这将不起作用，因此您可以选择当前节点集（即表）中的所有锚点：

"//a[starts-with(@href, 'https://')]"

有关 xpath 语法的更多详细信息，请参阅this。

c# - Select all links from a Html table using XPath (and HtmlAgilityPack)

2 回答 2

Related

Reference