java - An "Empty" Character Extracted from a PDF

Question

I recently tried to use PDFBox to extract text from a PDF file. It works fine for most PDFs, but for one PDF (which unfortunately I am not permitted to share), all of the periods in the sentences do not get extracted out. Instead, I get phrases like the following:

...what it would be It’ll be important later on...

It looks like instead of a period-space it is just a space, but it's not (at least on Mac OS X). If you copy the text into a text editor and start moving the text cursor through the phrase, there is an "empty character" right after the "t" in "feet". To reproduce:

Place the cursor right before the letter "t" in "feet" and press the right arrow key. The cursor moves one step to the right.
Press the right arrow key again, you stay right where you are.
Press the right arrow key one more time, you continue to the other side of the space.
Continuing to press the right arrow key behaves as expected

It appears that PDFBox extracted some sort of "empty character" in place of a period. I've tried to replace it a few different ways but have had no luck:

String oldText = text;
text = text.replace('\u0000', '.'); //Unicode null
text = text.replace('\0', '.'); //C null
System.out.println(oldText.equals(text)); //Returns true
//Also tried text.replace(null, '.'), but it doesn't compile

What is this "empty character" and how can I replace it with the text that is supposed to be there?

EDIT: This answer suggested that the character might be a character such as \uFEFF, but trying to replace it with a regex as suggested did not work.

score 2 · Accepted Answer

在意识到字符不是\uFEFF或\u0000，其他 Stack Overflow 用户遇到的两个 unicode 值之后，我决定运行一个测试来弄清楚代码实际上是什么。使用此答案中的代码来确定 unicode 值是什么，我发现神秘字符是\u0008，即“退格”。为什么会从 PDF 中提取，我不知道，但text = text.replace('\u0008', '.')现在用缺少的句点替换它。

java - An "Empty" Character Extracted from a PDF

1 回答 1

Related

Reference