zip - Zip 格式的 JPG+Zip 文件组合问题

Question

希望您听说过可以将 JPG 和 Zip 文件合并到一个文件中的巧妙技巧，并且它对于两种格式都是有效的（或至少可读的）文件。好吧，我意识到，由于 JPG 在末尾允许任意内容，而 ZIP 在开头，您可以在其中添加一种格式 - 在中间。出于这个问题的目的，假设中间数据是任意二进制数据，保证不与 JPG 或 ZIP 格式冲突（这意味着它不包含神奇的 zip 标头 0x04034b50）。插图：

0xFFD8 <- start jpg data end -> 0xFFD9 ... ARBITRARY BINARY DATA ... 0x04034b50 <- start zip file ... EOF

我猫是这样的：

cat "mss_1600.jpg" filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea filea fileb filea fileb "null.bytes" "randomzipfile.zip" > temp.zip

这将生成一个 6,318 KB 的文件。它不会在 7-Zip 中打开。但是，当我减少一个“双”时（而不是 13 个 filea 和 b，12 个）：

cat "mss_1600.jpg" filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb filea fileb "null.bytes" "randomzipfile.zip" > temp.zip

它会生成一个 5,996 KB 的文件，该文件在 7-Zip中打开。

所以我知道我的任意二进制数据没有神奇的 Zip 文件头来搞砸它。我有工作 jpg+data+zip和非工作 jpg+data+zip的参考文件（另存为因为浏览器认为它们是图像，并自己添加 zip 扩展名）。

我想知道为什么它在 13 种组合中失败而在 12 种组合中失败。对于奖励积分，我需要以某种方式解决这个问题。

score 22 · Accepted Answer

我下载了 7-Zip 的源代码并找出了导致这种情况发生的原因。

在 CPP/7zip/UI/Common/OpenArchive.cpp 中，您将看到以下内容：

// Static-SFX (for Linux) can be big.
const UInt64 kMaxCheckStartPosition = 1 << 22;

这意味着只会在文件的前 4194304 个字节中搜索标题。如果在那里没有找到，7-Zip 会认为它是一个无效文件。

1 << 22您可以通过更改为将该限制加倍1 << 23。我通过重建 7-Zip 测试了该更改并且它有效。

编辑：要解决此问题，您可以下载源代码，进行上述更改并构建它。我使用 VS 2008 构建它。打开 VS 命令提示符，导航到提取的源位置\CPP\7zip\Bundles 并输入“nmake”。然后在 Alone 目录中运行“7za t nonworking.jpg”，您应该会看到“一切正常”。

score 10 · Accepted Answer

实际上这是一个两部分的答案：）

首先，无论人们怎么说 zip 文件在技术上都不能被逐字放在文件的末尾。中央目录记录的结尾有一个值，表示从当前磁盘开始的字节偏移量（如果您只有一个 .zip 文件，则表示当前文件）。现在很多处理器都忽略了这一点，尽管 Windows 的 zip 文件夹没有，因此您需要更正该值以使其在 Windows 资源管理器中工作（您可能不关心；P）有关文件格式的信息，请参阅Zip APPNOTE。基本上，您可以在十六进制编辑器（或编写工具）中找到“中央目录相对于起始磁盘编号的起始偏移量”值。然后找到第一个“中央文件头签名”（504b0102 的十六进制）并将值设置为该偏移量。

现在唉，这不能修复 7zip，但这是由于 7zip 尝试猜测文件格式的方式。基本上，它只会在第一个 ~4MiB 中搜索二进制序列 504b0304，如果没有找到它，它会假定它不是 Zip 并尝试其其他存档格式。这显然是为什么再添加一个文件会破坏事情，它将它推到搜索的限制之上。

现在要修复它，您需要做的是将该十六进制字符串添加到 jpeg 而不会破坏它。一种方法是在 FFD8 JPEG SOI 标头之后添加以下十六进制数据 FFEF0005504B030400 。这会在您的序列中添加一个自定义块并且是正确的，因此 jpeg 标头应该忽略它。

score 4 · Accepted Answer

所以对于其他发现这个问题的人来说，故事如下：

是的，Andy 关于为什么 7-Zip 在文件上失败的原因确实是正确的，但这对我的问题没有帮助，因为我不能完全让人们使用我的 7-Zip 版本。

然而，泰伦给了我解决方案。

首先，按照他的建议，在 JPG 中添加一个小字节串会让 7-Zip 打开它。但是，它与有效的 JPG 片段略有不同，它需要为 FFEF00 07 504B030400 - 长度相差 2 个字节。
这让 7-Zip 打开它，但不提取文件，它会默默地失败。这是因为中央目录中的条目具有指向文件条目的内部指针/偏移量。既然你在那之前放了一堆东西，你需要纠正所有这些指针！
要使用 Windows 内置的 zip 支持打开 zip，正如 tyranid 所说，您需要更正“中央目录相对于起始磁盘号的起始偏移量”。这是一个执行最后两个的python脚本，虽然它是一个片段，而不是copypasta-ready-to-use


#Now we need to read the file and rewrite all the zip headers.  Fun!
torewrite = open(magicfilename, 'rb')
magicdata = torewrite.read()
torewrite.close()

#Change the Central Repository's Offset
offsetOfCentralRepro = magicdata.find('\x50\x4B\x01\x02') #this is the beginning of the central repo
start = len(magicdata) - 6 #it so happens, that on my files, the point is stored 2 bytes from the end.  so datadatadatdaata OF FS ET !! 00 00 EOF where OFFSET!! is the 4 bytes 00 00 are the last two bytes, then EOF
magicdata = magicdata[:start] + pack('I', offsetOfCentralRepro) + magicdata[start+4:]

#Now change the individual offsets in the central directory files
startOfCentralDirectoryEntry = magicdata.find('\x50\x4B\x01\x02', 0) #find the first central directory entry
startOfFileDirectoryEntry = magicdata.find('\x50\x4B\x03\x04', 10) #find the first file entry (we start at 10 because we have to skip past the first fake entry in the jpg)
while startOfCentralDirectoryEntry > 0:
    #Now I move a magic number of bytes past the entry (really! It's 42!)
    startOfCentralDirectoryEntry = startOfCentralDirectoryEntry + 42

    #get the current offset just to output something to the terminal
    (oldoffset,) = unpack('I', magicdata[startOfCentralDirectoryEntry : startOfCentralDirectoryEntry+4])
    print "Old Offset: ", oldoffset, " New Offset: ", startOfFileDirectoryEntry , " at ", startOfCentralDirectoryEntry
    #now replace it
    magicdata = magicdata[:startOfCentralDirectoryEntry] + pack('I', startOfFileDirectoryEntry) + magicdata[startOfCentralDirectoryEntry+4:]

    #now I move to the next central directory entry, and the next file entry
    startOfCentralDirectoryEntry = magicdata.find('\x50\x4B\x01\x02', startOfCentralDirectoryEntry)
    startOfFileDirectoryEntry = magicdata.find('\x50\x4B\x03\x04', startOfFileDirectoryEntry+1)

#Finally write the rewritten headers' data
towrite = open(magicfilename, 'wb')
towrite.write(magicdata)
towrite.close()

score 2 · Accepted Answer

您可以使用DotNetZip生成混合 JPG+ZIP 文件。DotNetZip 可以保存到流中，并且它足够智能，可以在开始将 zip 内容写入其中之前识别预先存在的流的原始偏移量。因此，在伪代码中，您可以通过以下方式获取 JPG+ZIP：

 open stream on an existing JPG file for update
 seek to the end of that stream
 open or create a zip file
 call ZipFile.Save to write zip content to the JPG stream
 close

所有偏移量都正确计算。相同的技术用于生成自解压档案。您可以在 EXE 上打开流，然后搜索到最后，并将 ZIP 内容写入该流。如果您这样做，所有偏移量都会正确计算。

另一件事-关于另一篇文章中的一个评论... ZIP 可以在文件的开头和结尾包含任意数据。据我所知，没有要求 zip 中央目录需要位于文件的末尾，尽管这是典型的。

zip - Zip 格式的 JPG+Zip 文件组合问题

4 回答 4

Related

Reference