I recently tried to use PDFBox to extract text from a PDF file. It works fine for most PDFs, but for one PDF (which unfortunately I am not permitted to share), all of the periods in the sentences do not get extracted out. Instead, I get phrases like the following:
...what it would be It’ll be important later on...
It looks like instead of a period-space it is just a space, but it's not (at least on Mac OS X). If you copy the text into a text editor and start moving the text cursor through the phrase, there is an "empty character" right after the "t" in "feet". To reproduce:
- Place the cursor right before the letter "t" in "feet" and press the right arrow key. The cursor moves one step to the right.
- Press the right arrow key again, you stay right where you are.
- Press the right arrow key one more time, you continue to the other side of the space.
- Continuing to press the right arrow key behaves as expected
It appears that PDFBox extracted some sort of "empty character" in place of a period. I've tried to replace it a few different ways but have had no luck:
String oldText = text;
text = text.replace('\u0000', '.'); //Unicode null
text = text.replace('\0', '.'); //C null
System.out.println(oldText.equals(text)); //Returns true
//Also tried text.replace(null, '.'), but it doesn't compile
What is this "empty character" and how can I replace it with the text that is supposed to be there?
EDIT: This answer suggested that the character might be a character such as \uFEFF
, but trying to replace it with a regex as suggested did not work.