c# - 从网页中提取内容

Question

我正在尝试使用HTMLagilitypack从网页中提取所有内容。

foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    sb.AppendLine(node.Text);
}

当我尝试使用上面的代码解析 google.com 时，我得到了很多 javascript。我想要的只是提取网页中的内容，例如 inh或p标签。就像在此页面上提出问题、答案、评论并删除其他所有内容一样。

我真的是 XPath 的新手，不知道该往哪里走。所以任何帮助将不胜感激。

score 0 · Accepted Answer

您可以按名称过滤不需要的标签并将它们从文档中删除。

        doc = page.Load("http://www.google.com");
        doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style").ToList().ForEach(n => n.Remove());

score 0 · Accepted Answer

您可以使用这个 XPath 表达式：

//body//*[local-name() != 'script']/text()

它只需要里面的元素body并跳过script元素

c# - 从网页中提取内容

2 回答 2

Related

Reference