2

I recently tried to use PDFBox to extract text from a PDF file. It works fine for most PDFs, but for one PDF (which unfortunately I am not permitted to share), all of the periods in the sentences do not get extracted out. Instead, I get phrases like the following:

...what it would be It’ll be important later on...

It looks like instead of a period-space it is just a space, but it's not (at least on Mac OS X). If you copy the text into a text editor and start moving the text cursor through the phrase, there is an "empty character" right after the "t" in "feet". To reproduce:

  • Place the cursor right before the letter "t" in "feet" and press the right arrow key. The cursor moves one step to the right.
  • Press the right arrow key again, you stay right where you are.
  • Press the right arrow key one more time, you continue to the other side of the space.
  • Continuing to press the right arrow key behaves as expected

It appears that PDFBox extracted some sort of "empty character" in place of a period. I've tried to replace it a few different ways but have had no luck:

String oldText = text;
text = text.replace('\u0000', '.'); //Unicode null
text = text.replace('\0', '.'); //C null
System.out.println(oldText.equals(text)); //Returns true
//Also tried text.replace(null, '.'), but it doesn't compile

What is this "empty character" and how can I replace it with the text that is supposed to be there?

EDIT: This answer suggested that the character might be a character such as \uFEFF, but trying to replace it with a regex as suggested did not work.

4

1 回答 1

2

在意识到字符不是\uFEFF\u0000,其他 Stack Overflow 用户遇到的两个 unicode 值之后,我决定运行一个测试来弄清楚代码实际上是什么。使用此答案中的代码来确定 unicode 值是什么,我发现神秘字符是\u0008,即“退格”。为什么会从 PDF 中提取,我不知道,但text = text.replace('\u0008', '.')现在用缺少的句点替换它。

于 2013-03-27T01:43:49.817 回答