csquery - HTML
节点 InnerText 包括 CsQuery 中的锚文本

Question

我正在使用 CsQuery 解析一些 wordpress 博客文章，对它们进行一些文本聚类分析。我想从相关<p>节点中删除文本。

var content = dom["div.entry-content>p"];
if (content.Length == 1)
{
    System.Diagnostics.Debug.WriteLine(content[0].InnerHTML);
    System.Diagnostics.Debug.WriteLine(content[0].InnerText);
}

在其中一篇文章中，InnerHTML看起来像这样：

An MIT Europe project that attempts to <a title="Wired News: Gizmo Puts Cards 
on the Table" href="http://www.wired.com/news/technology/0,1282,61265,00.html?
tw=rss.TEK">connect two loved ones seperated by distance</a> through the use 
of two tables, a bunch of RFID tags and a couple of projectors.

和相应InnerText的这样

一个 MIT Europe 项目试图通过使用两张桌子、一堆 RFID 标签和几个投影仪。

即内部文本缺少锚文本。我可以自己解析 HTML，但我希望有办法让 CsQuery 给我

一个麻省理工学院欧洲项目，试图通过使用两张桌子、一堆 RFID 标签和几台投影仪来连接两个相距甚远的亲人。

（我的斜体。）我应该怎么得到这个？

score 4 · Accepted Answer

   string result = dom["div.entry-content>p"].Text();

文本函数将包括下面的所有内容 p 包括 p 标记。

score 1 · Accepted Answer

尝试使用HtmlAgilityPack

using HAP = HtmlAgilityPack;
...
var doc = new HAP.HtmlDocument();
doc.LoadHtml("Your html");
var node = doc.DocumentNode.SelectSingleNode(@"node xPath");
Console.WriteLine(node.InnerText());

xPath 是页面上节点的路径。

例如：在谷歌浏览器中，按 F12 并选择您的节点，右键单击并选择“复制 xPath”

本主题标头 xPath: //*[@id="question-header"]/h1/a

csquery - HTML节点 InnerText 包括 CsQuery 中的锚文本

2 回答 2

Related

Reference

csquery - HTML
节点 InnerText 包括 CsQuery 中的锚文本