2

It's a very important question and i am very interested to get any help of you.

I used PDFBox to create a simple PDF document. That i'am trying to do, is to read the existing document and then re-write the same text into it, and in the same position.

1) Firstly i create a PDF named "Musique.pdf".

2)Read this existing document.

3)extract the text into the document with PDFTextStripper.

3)Find the position of each character in the document (x, y, width, fs, etc. ).

4)create a table that must contain the x and y of each character, for example tabel1 [0]=x1 tabel1[1]=y1 , table1[2]=x2, table1[3]=y2 , etc.

5) Then create a boucle of PDFContentStream to re-write each character in the correct position.

The problem is:

the first line is completely wrote but the problem is with the second line.

"I notice that if we have for example a text formed of 3 lines and if we assume that it contains 225 characters,,so if we get the length of this text, we will put a length equal to 231,,so we can notice that it adds 2 spaces of the end of each line,, but when we search the position of each character, the program does not consider these added spaces"

Please run my below code and tell me how to resolve this problem, please.

My code until now:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package test;

import java.io.IOException;
import java.io.OutputStream;
import java.util.List;
import org.apache.pdfbox.cos.COSInteger;
import org.apache.pdfbox.cos.COSStream;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdfwriter.ContentStreamWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.util.PDFOperator;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;


public class Test extends PDFTextStripper{
private static final String src="...";
    private static int i;
    private static float[] table1;
    private static PDPageContentStream content;
    private static float jjj;

public Test() throws IOException {
        super.setSortByPosition(true);
    }


public static void createPdf(String src) throws IOException, COSVisitorException{


 //create  document named "Musique.pdf"

PDRectangle rec= new PDRectangle(400,400);
PDDocument document= null;
document= new PDDocument();
PDPage page= new PDPage(rec);
document.addPage(page);
PDFont font= PDType1Font.HELVETICA;
PDPageContentStream canvas1= new PDPageContentStream(document,page,true,true);
canvas1.setFont(font, 10);
canvas1.beginText();
canvas1.appendRawCommands("15 385 Td");
canvas1.appendRawCommands("(La musique est très importante dans notre vie moderne. Sans la musique, non)Tj\n");
canvas1.endText();
canvas1.close();
PDPageContentStream canvas2= new PDPageContentStream(document,page,true,true);
canvas2.setFont(font, 11);
canvas2.beginText();
canvas2.appendRawCommands("15 370 Td");
canvas2.appendRawCommands("(Donc il est très necessaire de jouer chaque jours la musique.)Tj\n");
canvas2.endText();
canvas2.close();
document.save("Musique.pdf");
document.close();

                 }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException, COSVisitorException {

Test tes= new Test();
tes.createPdf(src);

//read the existing document
PDDocument doc;
doc= PDDocument.load("Musique.pdf");
List pages = doc.getDocumentCatalog().getAllPages(); 
PDPage page = (PDPage) pages.get(0);
//extract the text existed in the document
PDFTextStripper stripper =new PDFTextStripper();
String texte=stripper.getText(doc);
PDStream contents = page.getContents();

  if(contents!=null){

      i=1;
      table1=new float[texte.length()*2]; 
      table1[0]=(float)15.0;
      //the function below call the processTextPosition procedure in order to find the position of each character and put each value in a case of table1
      tes.processStream(page, page.findResources(), page.getContents().getStream()); 

      //after execution of processTextPosition, the analysing of code continue to the below code:

 int iii=0;
int kkk=0;
//create a boucle of PDPageContentStream in order to re-write completly the text in the document
//when you run this code, you must notice a problem with the second line, so how to resolve this problem ?
PDFont font= PDType1Font.HELVETICA;
while(kkk<table1.length){
    content = new PDPageContentStream(doc,page,true,true);
    content.setFont(font, 10);
    content.beginText();
    jjj = 400-table1[kkk+1];
    content.appendRawCommands(""+table1[kkk]+" "+jjj+" Td");
    content.appendRawCommands("("+texte.charAt(iii)+")"+" Tj\n");
    content.endText();
    content.close();
    iii=iii+1;
    kkk=kkk+2;

}

  }
  //save the modified document
  doc.save("Modified-musique.pdf");
  doc.close();

}

      /**
     * @param text The text to be processed
     */

    public void processTextPosition(TextPosition text) {

        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());

         if(i>1){
        table1[i]=text.getXDirAdj();
        System.out.println(table1[i]);
        i=i+1;
        table1[i]=text.getYDirAdj();
        System.out.println(table1[i]);
         i=i+1;
        }
        else{
        table1[i]=text.getYDirAdj(); 
        System.out.println(table1[i]);
        i=i+1;
         }    
    } 
}

Best Regards,

Liszt.

4

1 回答 1

7

您的概念和代码存在缺陷。

首先是概念:您的两个项目编号为3

3)使用PDFTextStripper将文本提取到文档中。

3)查找文档中每个字符的位置(x、y、width、fs等)。

在我看来,将这两个步骤分开是一个坏主意,因为通常你很难从文本提取中识别出分别对应的字符和从内容中识别出字形。

通常会很困难,因为例如e内容中的哪个字形对应e于文本中的哪个字符?依靠内容流中出现的顺序与解析文本中的顺序相同,仅适用于非常简单的页面内容。

然后还有替换带来的其他问题:例如,文本提取很可能会扩展连字,例如给你ff一个.

此外,还有在字体编码和字符串编码之间来回切换的问题,这可能会非常有损

此外,文本提取可能会将空白字符添加到内容中不存在的文本中。例如,它可以在识别到y方向跳转的地方添加换行符,或者在识别到x方向跳转的地方添加空格。

顺便说一句,这很可能是您观察的原因:

我注意到,例如,如果我们有一个由 3 行组成的文本,并且假设它包含 225 个字符,那么如果我们得到这个文本的长度,我们将把长度等于 231,所以我们可以注意到它在每行末尾添加 2 个空格,但是当我们搜索每个字符的位置时,程序不会考虑这些添加的空格。

此外,您的代码使 PDF 大小爆炸

5)然后创建一个 PDFContentStream 的圆环,将每个字符重新写入正确的位置。

while(kkk<table1.length){
    content = new PDPageContentStream(doc,page,true,true);
    ...
}

我建议至少只创建一个额外的内容流......

从这样的事情开始怎么样:

// read the existing document
PDDocument doc;
doc = PDDocument.load(musiqueFileName);
List<?> pages = doc.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) pages.get(0);

PDPageContentStream content = new PDPageContentStream(doc, page, true, true);

TestRewriter rewriter = new TestRewriter(content);
rewriter.processStream(page, page.findResources(), page.getContents().getStream());

content.close();

// save the modified document
doc.save(modifiedMusiqueFileName);
doc.close();

这里的 TestRewriter 也是 PDFTextStripper 的子类:

public static class TestRewriter extends PDFTextStripper
{
    final PDPageContentStream canvas;

    public TestRewriter(PDPageContentStream canvas) throws IOException
    {
        this.canvas = canvas;
    }

    /**
     * @param text
     *            The text to be processed
     */
    public void processTextPosition(TextPosition text)
    {
        try
        {
            PDFont font = PDType1Font.HELVETICA;
            canvas.setFont(font, 10);
            canvas.beginText();
            canvas.appendRawCommands("" + (text.getXDirAdj()) + " " + (400 - text.getYDirAdj()) + " Td");
            canvas.appendRawCommands("(" + text.getCharacter() + ")" + " Tj\n");
            canvas.endText();
        }
        catch(IOException e)
        {
            e.printStackTrace();
        }
    }
}

这仍然远非完美,但可以帮助您继续......

如果您需要并行解析实际文本,请集成更多PDFTextStripper方法processTextPosition来组合功能。

于 2013-08-27T09:29:34.880 回答