python - 如何在 Python 中从 PDF 文件中提取文本？

Question

如何在 Python 中从 PDF 文件中提取文本？

我尝试了以下方法：

import sys
import pyPdf

def convertPdf2String(path):
      content = ""
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      for i in range(0, pdf.getNumPages()):
          content += pdf.getPage(i).extractText() + " \n"
          content = " ".join(content.replace(u"\xa0", u" ").strip().split())
      return content

f = open('a.txt','w+')

f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace"))
f.close()

但结果如下，而不是可读的文本：

728;ˇ^~ ˚ˇˇ!""˘ˇ^˙^˝˛˛˛˛^~^^ ^˘^˛˙^"^˘"^^^#$˙^^^ %&^ ˘˛^~'˙˙% * _ _ ˝+,-3˙^/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3^07%4!˘"6 ˛ ^ ˝^ ^˘&/&4"9^ %6ˇ%4%4&5˘2)˘˘˛%:6(

score 22 · Accepted Answer

如果您运行的是 linux 或 mac，您可以在代码中使用ps2ascii命令：

import os

input="someFile.pdf"
output="out.txt"
os.system(("ps2ascii %s %s") %( input , output))

python - 如何在 Python 中从 PDF 文件中提取文本？

728;ˇ^~ ˚ˇˇ!""˘ˇ^˙^˝˛˛˛˛^~^^ ^˘^˛˙^"^˘"^^^#$˙^^^ %&^ ˘˛^~'˙˙% * _ _ ˝+,-3˙^/0245)6#57+82,55)6#57+,+2,+ /!#!!&˘˘1"%˘20˛˛3^07%4!˘"6 ˛ ^ ˝^ ^˘&/&4"9^ %6ˇ%4%4&5˘2)˘˘˛%:6(

1 回答 1

Related

Reference