[这个问题有一个亲戚:使用 HTMLAgilityPack 和 XPath 进行选择性屏幕抓取]
我有一些要解析的 HTML,其一般外观如下:
...
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
...
我正在寻找一种可以将其解析为有意义的块的方法,如下所示:
(1), (2), (3), (4), (5), (6), {1} CRLF
(1), (2), (3), (4), (5), (6) , {1}CRLF
等等
我尝试了两种方式:
方式1:
var dataList = currentDoc.DocumentNode.Descendants("tr")
.Select
(
tr => tr.Descendants("td").Select(td => td.InnerText).ToList()
).ToList();
它获取了td
s 的内部文本,但无法获取链接 {1}。在这里,创建了一个包含许多列表的列表。我可以使用嵌套的 foreach 来管理它。
方式2:
var dataList = currentDoc.DocumentNode
.SelectNodes("//tr//td//text()|//tr//td//a//@href");
which does get me the link {1} and all data but it becomes unorganized. Here, all the data is present in big chunk. Since, the data in one tr
is relative, I now loose that relation.
So, how can I solve this problem?