3

我使用以下代码将 HTTP 响应流转换为 XmlDocument。

HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
Stream responseStream = response.GetResponseStream();
StreamReader responseReader = new StreamReader(responseStream);
String responseString = responseReader.ReadToEnd();
Console.WriteLine(responseString);
Int32 htmlTagIndex = responseString.IndexOf("<html",
   StringComparison.OrdinalIgnoreCase);
XmlDocument responseXhtml = new XmlDocument();
responseString = responseString.Substring(htmlTagIndex); // MARK 1
responseString = responseString.Replace("&nbsp", " "); // MARK 2
responseXhtml.LoadXml(responseString);
return responseXhtml;

MARK 1行是跳过 DOC Type 定义行。

MARK 2行是为了避免错误Reference to undeclared entity 'nbsp'

有没有更好的方法来做到这一点?上面代码中的字符串操作太多了。

谢谢!

4

1 回答 1

6

我会直接使用HtmlAgilityPack来解析 html。即使你必须将html转换为xml,你也可以使用它。

using (WebClient wc = new WebClient())
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(wc.DownloadString("http://www.google.com"));
    doc.OptionOutputAsXml = true;

    StringWriter writer = new StringWriter();
    doc.Save(writer);

    var xDoc = XDocument.Load(new StringReader(writer.ToString()));
}
于 2012-10-10T15:17:30.773 回答