c# - 使用 iTextSharp 提取 FlateDecode 图像

Question

我想从 PDF 中提取图像。我现在正在使用 iTextSharp。一些图像可以正确提取，但大多数图像没有正确的颜色并且失真。我用不同的 PixelFormats 做了一些实验，但我没有找到解决问题的方法......

这是分隔图像类型的代码：

if (filter == "/FlateDecode")
{
   // ...
   int w = int.Parse(width);
   int h = int.Parse(height);
   int bpp = tg.GetAsNumber(PdfName.BITSPERCOMPONENT).IntValue;

   byte[] rawBytes = PdfReader.GetStreamBytesRaw((PRStream)tg);
   byte[] decodedBytes = PdfReader.FlateDecode(rawBytes);
   byte[] streamBytes = PdfReader.DecodePredictor(decodedBytes, tg.GetAsDict(PdfName.DECODEPARMS));

   PixelFormat[] pixFormats = new PixelFormat[23] { 
         PixelFormat.Format24bppRgb,
         // ... all Pixel Formats
    };
    for (int i = 0; i < pixFormats.Length; i++)
    {
        Program.ToPixelFormat(w, h, pixFormats[i], streamBytes, bpp, images));
    }
}

这是将图像保存在 MemoryStream 中的代码。将图像保存在文件夹中是稍后实现的。

private static void ToPixelFormat(int width, int height, PixelFormat pixelformat, byte[] bytes, int bpp, IList<Image> images)
{
    Bitmap bmp = new Bitmap(width, height, pixelformat);
    BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, width, height),
       ImageLockMode.WriteOnly, pixelformat);
    Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
    bmp.UnlockBits(bmd);
    using (var ms = new MemoryStream())
    {
       bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Tiff);
       bytes = ms.GetBuffer();
    }
    images.Add(bmp);
}

请帮我。

score 3 · Accepted Answer

即使您找到了问题的解决方案，我也建议您在上面修复您的代码。

我相信失真问题是由于行数据边界不匹配引起的。PdfReader 以字节边界返回数据。例如，对于 20 像素宽的灰度图像，每个图像行将获得 20 个字节的数据。位图类适用于 32 位边界。创建宽度为 20 像素的位图时，Bitmap 类将生成步幅（字节宽度）=32 字节的灰度位图。这意味着您不能像在 ToPixelFormat() 中那样使用 Marshal.Copy() 方法简单地将检索到的字节从 PdfReader 复制到新位图中。

源字节数组中的第一个像素位于第 21 个字节，但目标位图需要它作为第 33 个字节，因为位图的 32 位边界。为了解决这个问题，我必须创建大小考虑每个数据行的 32 位边界的字节数组。

将数据从从 PdfReader 检索到的字节数组逐行复制到新的字节数组中，并考虑 32 位行边界。现在我有了边界匹配位图类边界的数据字节，因此我可以使用 Marshal.Copy() 将其复制到新位图。

score 2 · Accepted Answer

我找到了解决我自己问题的方法。要提取所有页面上的所有图像，无需实现不同的过滤器。iTextSharp 有一个图像渲染器，它将所有图像保存为其原始图像类型。

只需执行以下操作：http: //kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx 您不需要实现 HttpHandler ...

score 1 · Accepted Answer

PDF 支持多种图像格式。我不认为我会采用您在这里选择的这种方法。您需要从流本身的字节中确定图像格式。例如，JPEG 通常以 ASCII 字节 JFIF 开头。

.NET (3.0+) 确实提供了一种尝试选择正确解码器的方法：BitmapDecoder.Create。请参阅http://msdn.microsoft.com/en-us/library/system.windows.media.imaging.bitmapdecoder.aspx

如果这不起作用，您可能需要考虑一些第三方成像库。我使用过 ImageMagick.NET 和 LeadTools（价格过高）。

c# - 使用 iTextSharp 提取 FlateDecode 图像

3 回答 3

Related

Reference