方法 1 – 在内存中剪切和粘贴
使用 WebBrowser 控件对象处理网页,然后从控件中复制文本……</p>
//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;
使用以下事件代码处理下载的网页文本:折叠 | 复制代码
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand(“SelectAll”, false, null);
wb.Document.ExecCommand(“Copy”, false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
方法 2 – 在内存中选择对象
这是处理下载的网页文本的第二种方法。似乎只需要更长的时间(差异很小)。但是,它避免了使用剪贴板以及与之相关的限制。折叠 | 复制代码
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{ //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand(“SelectAll”, false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
方法 3 – 优雅、简单、更慢的 XmlDocument 方法
XmlDocument 对象只需 3 行简单的代码即可加载/处理 HTML 文件:复制代码
XmlDocument document = new XmlDocument();
string allText = document.InnerText;