3

What I am trying to achieve is to extract all links with a href attribute that starts with http://, https:// or /. These links lie within a table (tbody > tr > td etc) with a certain class. I thought I could specify just the the a element without the whole path to it but it does not seem to work. I get a NullReferenceException at the line that selects the links:

var table = doc.DocumentNode.SelectSingleNode("//table[@class='containerTable']");
if (table != null)
{
    foreach (HtmlNode item in table.SelectNodes("a[starts-with(@href, 'https://')]"))
    {
        //not working

I don't know about any recommendations or best practices when it comes to XPath. Do I create overhead when I query the document two times?

4

2 回答 2

3

使用

 //tbody/descendant::a[starts-with(@href,'https://')
                     or
                       starts-with(@href,'http://')
                     or
                       starts-with(@href,'./') 
                      ]

您仍然会遇到问题,除非您更正代码以反映XmlNode.SelectNodes()实例方法的返回类型为XmlNodeList, not的事实HtmlNode

于 2010-03-21T04:37:28.667 回答
2

问题是您正在选择桌子,然后立即尝试选择锚,就好像它们是直接死者一样。中间有trtd标签。

因此,如果您将 xpath 更改为以下内容,则应该可以:

"tbody/tr/td/a[starts-with(@href, 'https://')]"

如果您的锚点被包裹在其他东西中,这将不起作用,因此您可以选择当前节点集(即表)中的所有锚点:

"//a[starts-with(@href, 'https://')]"

有关 xpath 语法的更多详细信息,请参阅this

于 2010-03-20T22:28:02.647 回答