vb.net - httpagility 包在损坏的标签之间抓取

Question

我需要刮掉后面有 h3 标签但没有结束 p 标签的 ap 标签。它看起来像这样：

<script ad>asdasdasd</script>
<p>Translation companies are
-----------------------
-----------------------
<h3 class="this_class">mind blown site</h3>

没有 </p> 标记，所以我无法完全解析它。现在我有两个问题：

1）这可以使用 httpagility xpath 解析吗？

2）我有一个函数来查找两个字符串之间的文本（getbetween）。但我有一个疑问 - 如果我使用“asdasdasd”并且“vb.net 是否总是 100% 会使用 h3 上方的脚本标签，因为有 2-3 行相同的行 - “asdasdasd”

3）你们知道的任何其他方法吗？

（必须写在代码中，所以 html 不会搞砸）

问候，

score 1 · Accepted Answer

发布一些更“真实”的 html 来真正帮助您可能是一个好主意，至少在h3和p. 无论如何，这应该让您p从标签中获得 -Tag h3-。

HtmlDocument doc = new HtmlDocument();
doc.Load(... //Load the Html...

//Either of these lines will do
HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[@class='this_class']/preceding-sibling::p");
//HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[contains(text(),'mind blown site')]/preceding-sibling::p");

string pInnerHtml = pNode.NextSibling.InnerHtml; //Has the text "Translation companies are...."

score 0 · Accepted Answer

所以一般来说，要让所有节点从开始p标签到你不想要的标签的开头，你可以这样做：

var p = doc.DocumentNode.SelectSingleNode("//p");
var h3 = p.SelectSingleNode("following-sibling::h3[@class='this_class']");
var following = new List<string>();
for (var current = p.NextSibling; current != h3; current = current.NextSibling)
{
    following.Add(current.InnerText);
}
var innerText = String.Concat(following);

vb.net - httpagility 包在损坏的标签之间抓取

2 回答 2

Related

Reference