java - 使用 PDFBox 插入 NULL 字符

Question

让我们考虑这段代码：

public class Test1{

    public static void CreatePdf(String src) throws IOException, COSVisitorException{
    PDRectangle rec= new PDRectangle(400,400);
    PDDocument document= null;
    document = new PDDocument();
    PDPage page = new PDPage(rec);
    document.addPage(page);
    PDDocumentInformation info=document.getDocumentInformation();
 PDStream stream= new PDStream(document);
    info.setAuthor("PdfBox");
    info.setCreator("Pdf");
    info.setSubject("Stéganographie");
    info.setTitle("Stéganographie dans les documents PDF");
    info.setKeywords("Stéganographie, pdf");
    content= new PDPageContentStream(document, page, true, false );
    font= PDType1Font.HELVETICA;

String hex = "4C0061f";  // shows "La"
//Notice that we have 00 between 4C and 61 where 00 =null character


       StringBuilder sb = new StringBuilder();
        for (int count = 0; count < hex.length() - 1; count += 2)
    {
        String output = hex.substring(count, (count + 2));
        int decimal = Integer.parseInt(output, 16);
        StringBuilder ae= sb.append((char)decimal);
    }
        String tt=sb.toString();
    content.beginText();
    content.setFont(font, 12);
    content.appendRawCommands("15 385 Td\n");
   content.appendRawCommands("("+tt+")"+"Tj\n");
    content.endText();
   content.close();
    document.save("doc.pdf");
    document.close();       
    }

我的问题是：为什么PDF文档中的“00”被替换为空格而不是空字符？请注意，我得到了这个空字符的宽度 0.0，但它在 PDF 文档中显示为一个空格！因此我得到：“ La”而不是“La”

score 1 · Accepted Answer

为什么PDF文档中的“00”被替换为空格而不是空字符？

如果您查看您的 PDF，您会发现用于文本的字体定义为：

9 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>
endobj

因此，您使用带有WinAnsiEncoding的字体。如果您查看PDF 规范附件 D 中该编码的定义，您会发现 32（十进制）以下的代码没有映射到任何内容。因此，您要做的是在手头的编码中使用未定义的字符。因此，行为没有定义；Acrobat Reader 似乎对那些未定义的代码点使用正宽度。

如果你想确保你的隐藏字符根本不会引起任何位移，你应该在你的字体字典中添加一个显式的宽度数组，参见。PDF 规范中的第 9.6.2 节，并确保您的不可见字符的宽度为 0。（顺便说一句，在这里您还会看到不嵌入宽度数组 - 就像 PDFBox 所做的那样 - 早在几年前就已弃用）。

请注意，我得到了这个空字符的宽度 0.0

一旦您处于未定义的范围内，任何事情都可能发生，并且不同的程序有不同的假设。

PS一些代码......在你的字里行间

font= PDType1Font.HELVETICA;

和

String hex = "4C0061f";  // shows "La"

我添加了以下代码：

InputStream afmStream = ResourceLoader.loadResource("org/apache/pdfbox/resources/afm/Helvetica.afm");
AFMParser afmParser = new AFMParser(afmStream);
afmParser.parse();
FontMetric afmMetrics = afmParser.getResult();
List<Float> newWidths = new ArrayList<Float>();
for (CharMetric charMetric : afmMetrics.getCharMetrics())
{
    if (charMetric.getCharacterCode() < 0)
        continue;
    while (charMetric.getCharacterCode() >= newWidths.size())
        newWidths.add(0f);
    newWidths.set(charMetric.getCharacterCode(), charMetric.getWx());
}
font.setFirstChar(0);
font.setLastChar(newWidths.size() - 1);
font.setWidths(newWidths);

此代码应读取 PDFBox 中包含的 Helvetica.afm 字体指标资源，并从中创建FirstChar、LastChar和Widths条目。它在这里工作正常，但如果它不在您的安装中，只需从 PDFBox jar 中提取 afm 文件并使用FileInputStream.

出于某种原因，00 字符似乎仍然认为它有一些宽度，但可以使用低于 32（十进制）的其他字符，例如

String hex = "4C0461f";

显示“La”没有间隙。如果我正确解释您以前（现已删除）关于 1C 和 1D 的问题，这已经可以帮助您继续。

PPS：关于评论中的问题：

你能告诉我这种方法的所有缺点吗？以及为什么这种方法不匹配重音字符，例如（Lé），你的代码只匹配没有重音的字符，但是当我们有重音时，我们得到 L é 而不是 Le ..我只想知道有什么缺点你的代码:)

我不能告诉所有（因为我对字体问题真的没有那么深入），但本质上，上述方法有些不完整。

如开头所述，您使用带有WinAnsiEncoding的字体，其中没有低于 32（十进制）的代码映射到任何内容。通过添加FirstChar、LastChar和Widths条目，我们尝试为代码低于 32 的字符定义零宽度。

尽管如此，我们既不关心这些代码的编码信息（编码仍然是纯粹的WinAnsiEncoding），也不考虑字体是否实际上包含这些代码的任何信息。此外，为了让事情变得更难以控制，我们正在谈论Helvetica，即 PDF 浏览器无论如何都必须携带自己的信息的标准 14 种字体之一。无论明确给出的信息与查看者带来的信息相矛盾，PDF 查看者都可能倾向于偏向于他们自己的信息。

为什么特别是重音字符会出现问题？我不确定。不过，我想这与字体通常不会将重音字符作为单独的实体带来，而是将重音字符和非重音字符结合在一起这一事实有关。也许在内部，查看器使用的字体具有映射在 32 以下的代码点的此类组合字符的一些信息，因此，当您的显式代码低于 32 和字体对此类代码的隐式使用并排发生时，显示变得古怪。

本质上，我通常会建议不要做这样的事情。对于普通的 PDF 文档，根本不需要。

但是，在您的情况下，由于您将文档命名为Stéganographie dans les documents PDF，您显然确实希望以某种方式隐藏 PDF 中的信息。使用不可见的、不可打印的字符似乎是一种方法。因此，您可以朝那个方向进行实验。但是 PDF 确实提供了更多方法来将任意数量的信息放入 PDF 中，而无需直接可见。

因此，根据您的具体目标，我认为其他方法可能会更安全地隐藏信息，例如私有PieceInfo部分或其他一些字典中的自定义标签......

score 0 · Accepted Answer

最终代码：

public class Test4 {

    public static final String src="...";

    public static void CreatePdf(String src) throws IOException, COSVisitorException{
        PDRectangle rec= new PDRectangle(400,400);
        PDDocument document=null;
        document= new PDDocument();
        PDPage page= new PDPage(rec);
        document.addPage(page);
        PDPageContentStream canvas= new PDPageContentStream(document,page,true,false);
        PDFont font= PDType1Font.HELVETICA;
        String hex = "4C1D61f";
        InputStream afmStream = ResourceLoader.loadResource("org/apache/pdfbox/resources/afm/Helvetica.afm");
        AFMParser afmParser = new AFMParser(afmStream);
        afmParser.parse();
        FontMetric afmMetrics = afmParser.getResult();
        List<Float> newWidths = new ArrayList<Float>();
        for (CharMetric charMetric : afmMetrics.getCharMetrics())
{
     if (charMetric.getCharacterCode() < 0)
         continue;
      while (charMetric.getCharacterCode() >= newWidths.size())
          newWidths.add(0f);

      newWidths.set(charMetric.getCharacterCode(), charMetric.getWx());

}

        font.setFirstChar(0);

        font.setLastChar(newWidths.size() - 1);
        font.setWidths(newWidths);



     StringBuilder sb = new StringBuilder();
        for (int count = 0; count < hex.length() - 1; count += 2)
    {
        String output = hex.substring(count, (count + 2));
        int decimal = Integer.parseInt(output, 16);
        StringBuilder ae= sb.append((char)decimal);
    }
        String tt=sb.toString();
    canvas.beginText();
    canvas.setFont(font, 12);
    canvas.appendRawCommands("15 385 Td\n");
   canvas.appendRawCommands("("+tt+")"+"Tj\n");
    canvas.endText();
   canvas.close();
    document.save("doc.pdf");
    document.close();       
    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException, COSVisitorException {
        // TODO code application logic here
        Test4 tes= new Test4();
        tes.CreatePdf(src);
    }
}

java - 使用 PDFBox 插入 NULL 字符

2 回答 2

Related

Reference