compression - GZipStream 没有检测到损坏的数据（甚至 CRC32 通过）？

Question

我正在使用 GZipStream 压缩/解压缩数据。我之所以选择它而不是 DeflateStream，是因为文档指出 GZipStream 还添加了一个 CRC 来检测损坏的数据，这是我想要的另一个功能。我的“肯定”单元测试运行良好，因为我可以压缩一些数据，保存压缩的字节数组，然后再次成功解压缩。.NET GZipStream 压缩和解压缩问题帖子帮助我意识到我需要在访问压缩或解压缩数据之前关闭 GZipStream。

接下来，我继续编写“否定”单元测试，以确保可以检测到损坏的数据。我以前使用MSDN 中的 GZipStream 类的示例来压缩文件，用文本编辑器打开压缩文件，更改一个字节以破坏它（好像用文本编辑器打开它还不够糟糕！），保存它然后解压缩它以确保我得到了预期的 InvalidDataException。

当我编写单元测试时，我选择了一个要损坏的任意字节（例如，compressedDataBytes[50] = 0x99）并得到一个 InvalidDataException。到目前为止，一切都很好。我很好奇，所以我选择了另一个字节，但令我惊讶的是我没有得到异常。这可能没问题（例如，我碰巧碰到了数据块中未使用的字节），只要数据仍然可以成功恢复。但是，我也没有得到正确的数据！

为了确定“不是我”，我从.NET GZipStream 压缩和解压缩问题的底部提取了清理后的代码，并将其修改为顺序损坏压缩数据的每个字节，直到它无法正确解压缩。以下是更改（请注意，我使用的是 Visual Studio 2010 测试框架）：

// successful compress / decompress example code from:
//    https://stackoverflow.com/questions/1590846/net-gzipstream-compress-and-decompress-problem
[TestMethod]
public void Test_zipping_with_memorystream_and_corrupting_compressed_data()
{
   const string sample = "This is a compression test of microsoft .net gzip compression method and decompression methods";
   var encoding = new ASCIIEncoding();
   var data = encoding.GetBytes(sample);
   string sampleOut = null;
   byte[] cmpData;

   // Compress 
   using (var cmpStream = new MemoryStream())
   {
      using (var hgs = new GZipStream(cmpStream, CompressionMode.Compress))
      {
         hgs.Write(data, 0, data.Length);
      }
      cmpData = cmpStream.ToArray();
   }

   int corruptBytesNotDetected = 0;

   // corrupt data byte by byte
   for (var byteToCorrupt = 0; byteToCorrupt < cmpData.Length; byteToCorrupt++)
   {
      // corrupt the data
      cmpData[byteToCorrupt]++;

      using (var decomStream = new MemoryStream(cmpData))
      {
         using (var hgs = new GZipStream(decomStream, CompressionMode.Decompress))
         {
            using (var reader = new StreamReader(hgs))
            {
               try
               {
                  sampleOut = reader.ReadToEnd();

                  // if we get here, the corrupt data was not detected by GZipStream
                  // ... okay so long as the correct data is extracted
                  corruptBytesNotDetected++;

                  var message = string.Format("ByteCorrupted = {0}, CorruptBytesNotDetected = {1}",
                     byteToCorrupt, corruptBytesNotDetected);

                  Assert.IsNotNull(sampleOut, message);
                  Assert.AreEqual(sample, sampleOut, message);
               }
               catch(InvalidDataException)
               {
                  // data was corrupted, so we expect to get here
               }
            }
         }
      }

      // restore the data
      cmpData[byteToCorrupt]--;
   }
}

当我运行这个测试时，我得到：

Assert.AreEqual failed. Expected:<This is a compression test of microsoft .net gzip compression method and decompression methods>. Actual:<>. ByteCorrupted = 11, CorruptBytesNotDetected = 8

因此，这意味着实际上有 7 次损坏数据没有任何区别（字符串已成功恢复），但损坏字节 11 既没有抛出异常，也没有恢复数据。

我错过了什么或做错了什么？谁能看到为什么没有检测到损坏的压缩数据？

score 7 · Accepted Answer

gzip 格式中有一个 10 字节的标头，可以更改最后 7 个字节而不会导致解压缩错误。因此，您注意到的七个没有腐败的案例是预期的。

在流中的其他任何地方都没有检测到损坏的错误应该是非常罕见的。大多数情况下，解压缩器会检测到压缩数据格式的错误，甚至不会检查 crc。如果它确实到了检查 crc 的地步，那么该检查几乎总是会因为输入流损坏而失败。（“几乎一直”是指大约 1 - 2^-32 的概率。）

我刚刚使用您的示例字符串尝试了它（在 C 中使用 zlib），它产生了一个 84 字节的 gzip 流。增加 84 个字节中的每一个，使其余部分保持不变，就像您所做的那样，导致：两次错误的标头检查，一种无效的压缩方法，七次成功，一种无效的块类型，四个无效的距离设置，七个无效的代码长度设置，四个丢失块结束，11 个无效位长度重复，3 个无效位长度重复，2 个无效位长度重复，2 个意外的流结束，36 个不正确的数据检查（这是实际的 CRC 错误），和 4 个不正确的长度检查（另一个检查以 gzip 格式获取正确的未压缩数据长度）。在任何情况下都没有检测到损坏的压缩流。

因此，在您的代码或类中一定有一个错误。

更新：

看来类中存在错误。

值得注意的是（或者可能不是很明显），微软已经得出结论，他们不会修复这个错误！

compression - GZipStream 没有检测到损坏的数据（甚至 CRC32 通过）？

1 回答 1

Related

Reference