c# - 如何从C#中的网页获取所有显示文本

Question

嗨，我正在使用 C# 开发数据抓取应用程序。

其实我想得到所有的显示文本，而不是 html 标签。

这是我的代码

HtmlWeb web  = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.
   Load(@"http://dawateislami.net/books/bookslibrary.do#!section:bookDetail_521.tr");
string str =  doc.DocumentNode.InnerText;

这个内部 html 也返回了一些标签和脚本，但我只想获取用户可见的显示文本。请帮我。谢谢

score 2 · Accepted Answer

[我相信这会解决你的问题][1]

方法 1 – 在内存中剪切和粘贴

使用 WebBrowser 控件对象处理网页，然后从控件中复制文本……</p>

使用以下代码下载网页：复制代码

//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed   
wb.DocumentCompleted +=
    new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;

使用以下事件代码处理下载的网页文本：折叠 | 复制代码

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}

方法 2 – 在内存中选择对象

这是处理下载的网页文本的第二种方法。似乎只需要更长的时间（差异很小）。但是，它避免了使用剪贴板以及与之相关的限制。折叠 | 复制代码

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{   //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}

方法 3 – 优雅、简单、更慢的 XmlDocument 方法

一位好朋友与我分享了这个例子。我是简单的超级粉丝，这个例子毫无疑问赢得了简单比赛。不幸的是，与其他两种方法相比，它非常慢。

XmlDocument 对象只需 3 行简单的代码即可加载/处理 HTML 文件：复制代码

XmlDocument document = new XmlDocument();
document.Load(“www.yourwebsite.com”);
string allText = document.InnerText;

你有它！三种简单的方法来只从网页中抓取显示的文本，而不涉及外部“包”。套餐

score 0 · Accepted Answer

要从字符串中删除所有 html 标签，您可以使用：

String output = inputString.replaceAll("<[^>]*>", "");

要删除特定标签：

String output = inputString.replaceAll("(?i)<td[^>]*>", "");

希望能帮助到你：）

score 0 · Accepted Answer

要删除 javascript 和 css：

foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    style.Remove();

要删除评论（未经测试）：

foreach(var comment in doc.DocumentNode.Descendants("//comment()").ToArray())
    comment.Remove()

c# - 如何从C#中的网页获取所有显示文本

3 回答 3

Related

Reference