c# - 从 PDF 嵌入字体解码 Flate 的问题

Question

好的，在我们开始之前。我在一家公司工作，该公司拥有从各种出版商以任何媒体形式重新分发 PDF 文件的许可证。因此，话虽如此，从给定的 PDF 文件中提取嵌入字体不仅是合法的，而且对演示文稿也至关重要。

我正在使用在这个网站上找到的代码，但是我不记得作者了，当我找到它时，我会参考它们。我在包含嵌入字体的 PDF 文件中找到了流，我已将此编码流隔离为字符串，然后将其隔离为byte[]. 当我使用以下代码时出现错误

Block length does not match with its complement.

代码（错误出现在while下面一行）：

private static byte[] DecodeFlateDecodeData(byte[] data)
{
    MemoryStream outputStream;
    using (outputStream = new MemoryStream())
    {
        using (var compressedDataStream = new MemoryStream(data))
        {
            // Remove the first two bytes to skip the header (it isn't recognized by the DeflateStream class)
            compressedDataStream.ReadByte();
            compressedDataStream.ReadByte();

            var deflateStream = new DeflateStream(compressedDataStream, CompressionMode.Decompress, true);
            var decompressedBuffer = new byte[compressedDataStream.Length];
            int read;

            // The error occurs in the following line
            while ((read = deflateStream.Read(decompressedBuffer, 0, decompressedBuffer.Length)) != 0)
            {
                outputStream.Write(decompressedBuffer, 0, read);
            }
            outputStream.Flush();
            compressedDataStream.Close();
        }

        return ReadFully(outputStream);
    }
}

使用常用工具（Google、Bing、此处的存档）后，我发现发生这种情况的大部分时间是当一个人没有消耗编码流的前两个字节时——但这是在这里完成的，所以我找不到源这个错误。下面是编码流：

H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlÇ±“ºu“°tƒ¦t0ÊD¶jˆ
Ö   m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ýÝ‡Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü    ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü

请帮忙，我在这里撞墙！

注意：上面的流是 Arial Black 的编码版本 - 根据 PDF 中的规范：

661 0 obj
<< 
/Type /FontDescriptor 
/FontFile3 662 0 R 
/FontBBox [ -194 -307 1688 1083 ] 
/FontName /HLJOBA+ArialBlack 
/Flags 4 
/StemV 0 
/CapHeight 715 
/XHeight 518 
/Ascent 0 
/Descent -209 
/ItalicAngle 0 
/CharSet (/space/T/e/s/t/a/k/i/n/g/S/r/E/x/m/O/u/l)
>> 
endobj
662 0 obj
<< /Length 1700 /Filter /FlateDecode /Subtype /Type1C >> 
stream
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlÇ±“ºu“°tƒ¦t0ÊD¶jˆ
Ö   m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ýÝ‡Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü    ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü

score 1 · Accepted Answer

Is there a particular reason why you're not using the GetStreamBytes() method that is provided with iText? What about data? Are you sure you are looking at the correct bytes? Did you create the PRStream object correctly and did you get the bytes with PdfReader.GetStreamBytesRaw()? If so, why decode the bytes yourself? Which brings me to my initial counter-question: is there a particular reason why you're not using the GetStreamBytes() method?

score 1 · Accepted Answer

Looks like GetStreamBytes() might solve your problem out right, but let me point out that I think you're doing something dangerous concerning end-of-line markers. The PDF Specification in 7.3.8.1 states that:

The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone.

In your code it looks like you always skip two bytes while the spec says it could be either one or two (CR LF or LF).

You should be able to catch whether you are running into this by comparing the exact number of bytes you want to decode with the value of the (Required) "Length" key in the stream dictionary.

score 0 · Accepted Answer

Okay, for anyone who might stumble across this issue themselves allow me to warn you - this is a rocky road without a great deal of good solutions. I eventually moved away from writing all of the code to extract the fonts myself. I simply downloaded MuPDF (open source) and then made command line calls to mutool.exe:

    mutool extract C:\mypdf.pdf

This pulls all of the fonts into the folder mutool resides in (it also extracts some images (these are the fonts that could not be converted (usually small subsets I think))). I then wrote a method to move those from that folder into the one I wanted them in.

Of course, to convert these to anything usable is a headache in itself - but I have found it to be doable.

As a reminder, font piracy IS piracy.

c# - 从 PDF 嵌入字体解码 Flate 的问题

3 回答 3

Related

Reference