python-3.x - 使用python从扫描的pdf中提取pdf数据

Question

我正在通过 tesseract ocr 从扫描的 pdf 中提取数据，并且能够提取数据，但准确性不佳。在许多地方，它显示错误的数据，所以我可以通过 python 获得 100% 准确的数据。

首先我将 pdf 转换为 jpg 格式，然后我使用 tesseract 模块从图像中提取数据。

from PIL import Image
import pytesseract

text=(pytesseract.image_to_string(Image.open(r"C:\Users\sumesh\Desktop\ip\ip\pdf11.jpg")))
text=repr(text)
text=text.replace(r"\n","")
print(text)

我期望来自 pdf 的正确数据，但我得到不同的数据，例如 z 显示 2,5 是 s，1 是 I 等

score -1 · Accepted Answer

请在文件路径后使用“DPI=500”，它可能会有所帮助。有关更多信息，您可以按照我在此处发布的答案如何使用 Python 将 .png 图像转换为可搜索的 PDF/word

score -1 · Accepted Answer

希望下面的小改动对您有所帮助。

from PIL import Image
import pytesseract

text=str(pytesseract.image_to_string(Image.open(r"C:\Users\sumesh\Desktop\ip\ip\pdf11.jpg"),lang='eng'))

text=text.replace("\n","")

print(text)

python-3.x - 使用python从扫描的pdf中提取pdf数据

2 回答 2

Related

Reference