linux - Inaccurate pdf to text conversion

Question

I have tried almost every pdf to text converter available on Linux, but some parts of text are corrupted/inaccurate. Like some characters are replaced with others, some words are missing from text which are present in the pdf. For some words converted text contains semicolons etc.

I also tried aspell so that i can correct the words, but aspell remains silent on some words.

NOTE: The pdf contains swedish language text.

So, Is there any solution to fix this inaccuracy in pdf to text conversion?

score 1 · Accepted Answer

不。我认为没有适用于所有 pdf 文件的有效解决方案，因为显示的可视文本下的实际文本可以以各种风格存储。

例如，当 LaTeX 生成 pdf 时，它取决于几个配置选项，以及如何嵌入一些非 ascii 字符。有时我得到:o而不是ö，有时o:有时字符是直接嵌入的。这些变体中的每一个都显示为ö好像。

如果您使用您喜欢的 pdf 查看器复制并粘贴文本或尝试搜索损坏的单词，您可能会看到相同的效果。

为了解决这些问题，可以使用 ocr 软件——在识别这些工具方面存在所有缺点。

linux - Inaccurate pdf to text conversion

1 回答 1

Related

Reference