c# - c#从字节数组中检测xml编码？

Question

好吧，我有一个字节数组，我知道它是字节数组中的一个 xml 序列化对象，有什么方法可以从中获取编码吗？

我不打算对它进行反序列化，但我将它保存在 sql 服务器上的 xml 字段中......所以我需要将它转换为字符串？

score 14 · Accepted Answer

与此问题类似的解决方案可以通过在字节数组上使用 Stream 来解决此问题。然后你就不必在字节级别上摆弄了。像这样：

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

score 7 · Accepted Answer

您可以查看前 40 个字节¹。它们应该包含文档声明（假设它有一个文档声明），它应该包含编码，或者您可以假设它是 UTF-8 或 UTF-16，这应该从您对这<?xml部分的理解中显而易见。（只需检查两种模式。）

实际上，您是否期望得到除 UTF-8 或 UTF-16 之外的任何东西？如果不是，您可以检查在这两种模式开始时获得的模式，如果它不遵循任何一种模式，则抛出异常。或者，如果您想再次尝试，您可以随时尝试将文档解码为 UTF-8，重新编码并查看是否返回相同的字节。这并不理想，但它可能只是工作。

我敢肯定有更严格的方法可以做到这一点，但它们可能很挑剔:)

¹很可能比这个少。我认为 20 个字符应该足够了，在 UTF-16 中是 40 个字节。

score 7 · Accepted Answer

前 2 或 3 个字节可能是字节顺序标记 (BOM)，它可以告诉您流是 UTF-8、Unicode-LittleEndian 还是 Unicode-BigEndian。

UTF-8 BOM 是 0xEF 0xBB 0xBF Unicode-Bigendian 是 0xFE 0xFF Unicode-LittleEndiaon 是 0xFF 0xFE

如果这些都不存在，那么您可以使用 ASCII 进行测试<?xml（请注意，大多数现代 XML 生成都遵循在 xml 声明之前没有空格的标准）。

ASCII 被用完，直到?>你可以找到 encoding= 的存在并找到它的值。如果 encoding 不存在或<?xmldeclare 不存在，那么您可以假设 UTF-8。

score 7 · Accepted Answer

W3C XML 规范有一节介绍如何确定字节字符串的编码。

首先检查 Unicode 字节顺序标记

BOM 只是另一个字符；它是：

'零宽度无间断空间' (U+FEFF)

例如：

NWNBSP<?xml vers
"\ufeff<xml vers"
"\ufeff\u003c\u003f\u0078\u006d\u006c\u0020\u0076\u0065\u0072\u0073"
U+FEFFU+003CU+003FU+0078U+006DU+006CU+0020U+0076U+0065U+0072U+0073

字符U+FEFF以及文件中的所有其他字符都使用适当的编码方案进行编码：

00 00 FE FF：UCS-4，大端机（1234顺序）
FF FE 00 00: UCS-4, little-endian machine (4321 order)
00 00 FF FE：UCS-4，不寻常的八位字节顺序 (2143)
FE FF 00 00：UCS-4，不寻常的八位字节顺序 (3412)
FE FF ## ##：UTF-16，大端
FF FE ## ##：UTF-16，小端
EF BB BF: UTF-8

where## ##可以是任何东西 - 除了都为零

U+FEFFU+003CU+003FU+0078U+006DU+006CU+0020U+0076U+0065U+0072U+0073
ff fe3c 003f 0078 006d 006c 0020 0076 0065 0072 0073 00
ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 65 00 72 00 73 00

因此，首先检查任何这些签名的初始字节。如果找到其中之一，则返回该代码页标识符

UInt32 GuessEncoding(byte[] XmlString)
{
   if BytesEqual(XmlString, [00, 00, $fe, $ff]) return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
   if BytesEqual(XmlString, [$ff, $fe, 00, 00]) return 1200;  //"utf-32" - Unicode UTF-32, little endian byte order
   if BytesEqual(XmlString, [00, 00, $ff, $fe]) throw new Exception("Nobody supports 2143 UCS-4");
   if BytesEqual(XmlString, [$fe, $ff, 00, 00]) throw new Exception("Nobody supports 3412 UCS-4");
   if BytesEqual(XmlString, [$fe, $ff])
   {
      if (XmlString[2] <> 0) && (XmlString[3] <> 0)
         return 1201;  //"unicodeFFFE" - Unicode UTF-16, big endian byte order
   }
   if BytesEqual(XmlString, [$ff, $fe])
   {
      if (XmlString[2] <> 0) && (XmlString[3] <> 0)
         return 1200;  //"utf-16" - Unicode UTF-16, little endian byte order
   }
   if BytesEqual(XmlString, [$ef, $bb, $bf])    return 65001; //"utf-8" - Unicode (UTF-8)

或者寻找 <?xml

如果 XML 文档没有字节顺序标记字符，则继续查找每个 XML 文档必须具有的前五个字符：

<?xml

知道这一点很有帮助

<是#x0000003C
?是#x0000003F

这样我们就可以查看前四个字节了：

00 00 00 3C：UCS-4，大端机（1234顺序）
3C 00 00 00: UCS-4, little-endian machine (4321 order)
00 00 3C 00：UCS-4，不寻常的八位字节顺序 (2143)
00 3C 00 00：UCS-4，不寻常的八位字节顺序 (3412)
00 3C 00 3F：UTF-16，大端
3C 00 3F 00：UTF-16，小端
3C 3F 78 6D: UTF-8
4C 6F A7 94:一些EBCDIC的味道

所以我们可以在我们的代码中添加更多内容：

   if BytesEqual(XmlString, [00, 00, 00, $3C])    return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
   if BytesEqual(XmlString, [$3C, 00, 00, 00])    return 1200;  //"utf-32" - Unicode UTF-32, little endian byte order
   if BytesEqual(XmlString, [00, 00, $3C, 00])    throw new Exception("Nobody supports 2143 UCS-4");
   if BytesEqual(XmlString, [00, $3C, 00, 00])    throw new Exception("Nobody supports 3412 UCS-4");
   if BytesEqual(XmlString, [00, $3C, 00, $3F])   return return 1201;  //"unicodeFFFE" - Unicode UTF-16, big endian byte order
   if BytesEqual(XmlString, [$3C, 00, $3F, 00])   return 1200;  //"utf-16" - Unicode UTF-16, little endian byte order
   if BytesEqual(XmlString, [$3C, $3F, $78, $6D]) return 65001; //"utf-8" - Unicode (UTF-8)
   if BytesEqual(XmlString, [$4C, $6F, $A7, $94])
   {
      //Some variant of EBCDIC, e.g.:
      //20273   IBM273  IBM EBCDIC Germany
      //20277   IBM277  IBM EBCDIC Denmark-Norway
      //20278   IBM278  IBM EBCDIC Finland-Sweden
      //20280   IBM280  IBM EBCDIC Italy
      //20284   IBM284  IBM EBCDIC Latin America-Spain
      //20285   IBM285  IBM EBCDIC United Kingdom
      //20290   IBM290  IBM EBCDIC Japanese Katakana Extended
      //20297   IBM297  IBM EBCDIC France
      //20420   IBM420  IBM EBCDIC Arabic
      //20423   IBM423  IBM EBCDIC Greek
      //20424   IBM424  IBM EBCDIC Hebrew
      //20833   x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended
      //20838   IBM-Thai    IBM EBCDIC Thai
      //20866   koi8-r  Russian (KOI8-R); Cyrillic (KOI8-R)
      //20871   IBM871  IBM EBCDIC Icelandic
      //20880   IBM880  IBM EBCDIC Cyrillic Russian
      //20905   IBM905  IBM EBCDIC Turkish
      //20924   IBM00924    IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
      throw new Exception("We don't support EBCDIC. Sorry");
   }

   //Otherwise assume UTF-8, and fail to decode it anyway
   return 65001; //"utf-8" - Unicode (UTF-8)

   //Any code is in the public domain. No attribution required.
}

c# - c#从字节数组中检测xml编码？

4 回答 4

首先检查 Unicode 字节顺序标记

或者寻找 <?xml

Related

Reference