当我使用 HTML Agility Pack 抓取 H3 标记的 InnerText 时,与源代码相比,我拾取了额外的字符 (Â)。
我不确定这些字符来自哪里或如何删除它们。
提取字符串:
 Week 1
HTML 源代码:
<h3>
<span> </span>Week 1</h3>
当前代码:
private void getWeekNumber(string url)
{
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(new System.IO.StringReader(url));
foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
{
MessageBox.Show(h3.InnerText);
}
}
当前的解决方法(从stackoverflow上的某个地方被盗,丢失了链接):
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "GET";
using (var stream = request.GetResponse().GetResponseStream())
using (var reader = new System.IO.StreamReader(stream, Encoding.UTF8))
{
result = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(new System.IO.StringReader(result));
foreach (HtmlAgilityPack.HtmlNode h3 in htmlDoc.DocumentNode.SelectNodes("//h3"))
{
MessageBox.Show(h3.InnerText);
}