c# - XmlDocument.Load 失败，LoadXml 工作：

Question

在回答这个问题时，我遇到了一个我不明白的情况。OP 试图从以下位置加载 XML：http ://www.google.com/ig/api?weather=12414&hl=it

显而易见的解决方案是：

string m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
XmlDocument myXmlDocument = new XmlDocument();
myXmlDocument.Load(m_strFilePath); //Load NOT LoadXml

然而，这失败了

XmlException : 给定编码中的无效字符。第 1 行，位置 499。

它似乎在窒息à。Umidità

OTOH，以下工作正常：

var m_strFilePath = "http://www.google.com/ig/api?weather=12414&hl=it";
string xmlStr;
using(var wc = new WebClient())
{
    xmlStr = wc.DownloadString(m_strFilePath);
}
var xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlStr);

我对此感到困惑。谁能解释为什么前者失败，但后者工作正常？

值得注意的是，文档的 xml 声明省略了编码。

score 13 · Accepted Answer

使用WebClientHTTP 响应标头中的编码信息来确定正确的编码（在这种情况下，ISO-8859-1基于 ASCII，即每个字符 8 位）

看起来XmlDocument.Load没有使用此信息，并且由于 xml 声明中也缺少编码，因此它必须猜测编码并弄错。一些挖掘让我相信它选择了 UTF-8。

如果我们想获得真正的技术性，它抛出的字符是“à”，在 ISO-8859-1 编码中是 0xE0，但这不是一个有效的字符UTF-8- 特别是这个字符的二进制表示是：

11100000

如果您在UTF-8 Wikipedia 文章中进行挖掘，我们可以看到这表示一个代码点（即字符），总共由 3 个字节组成，采用以下格式：

Byte 1      Byte 2      Byte 3
----------- ----------- -----------
1110xxxx    10xxxxxx    10xxxxxx

但是如果我们回顾一下文档，接下来的两个字符是“:”，在 ISO-8859-1 中是 0x3A 和 0x20。这意味着我们最终得到的是：

Byte 1      Byte 2      Byte 3
----------- ----------- -----------
11100000    00111010    00100000

序列的第 2 个或第 3 个字节都没有10两个最高有效位（这将表明一个延续），因此这个字符在 UTF-8 中没有意义。

score 2 · Accepted Answer

作为节点内部文本的 Umidità 字符串必须在 < ! [ CDATA [ Umidità ] ] > 这不会在 XmlDocument.Load 中给出任何错误。

c# - XmlDocument.Load 失败，LoadXml 工作：

2 回答 2

Related

Reference