1

[这个问题有一个亲戚:使用 HTMLAgilityPack 和 XPath 进行选择性屏幕抓取]

我有一些要解析的 HTML,其一般外观如下:

...
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
...

我正在寻找一种可以将其解析为有意义的块的方法,如下所示:

(1), (2), (3), (4), (5), (6), {1} CRLF
(1), (2), (3), (4), (5), (6) , {1}CRLF
等等

我尝试了两种方式:
方式1:

var dataList = currentDoc.DocumentNode.Descendants("tr")
                .Select
                 (
                  tr => tr.Descendants("td").Select(td => td.InnerText).ToList()
                 ).ToList();

它获取了tds 的内部文本,但无法获取链接 {1}。在这里,创建了一个包含许多列表的列表。我可以使用嵌套的 foreach 来管理它。

方式2:

var dataList = currentDoc.DocumentNode
               .SelectNodes("//tr//td//text()|//tr//td//a//@href");

which does get me the link {1} and all data but it becomes unorganized. Here, all the data is present in big chunk. Since, the data in one tr is relative, I now loose that relation.

So, how can I solve this problem?

4

1 回答 1

0

Following query selects a element with non-empty href attribute from each cell. If there is no such element, then inner text of cell is used:

var dataList = 
     currentDoc.DocumentNode.Descendants("tr")
               .Select(tr => from td in tr.Descendants("td")
                             let a = td.SelectSingleNode("a[@href!='']")
                             select a == null ? td.InnerText : 
                                                a.Attributes["href"].Value);

Feel free to add ToList() calls.

于 2013-03-14T09:12:24.283 回答