python - Python - 如何将许多单独的 PDF 转换为文本？

Question

问题：如何使用 Python 包“slate”在同一路径中读取多个 PDF？

我有一个包含 600 多个 PDF 的文件夹。

我知道如何使用 slate 包将单个 PDF 转换为文本，使用以下代码：

migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
     doc = slate.PDF(f)

 len(doc)

但是，这将您一次限制为一个 PDF，由“migFiles[0]”指定 - 0 是我的路径文件中的第一个 PDF。

如何一次将多个 PDF 读取为文本，并将它们保留为单独的字符串或 txt 文件？我应该使用另一个包吗？如何创建一个“for 循环”来读取路径中的所有 PDF？

score 0 · Accepted Answer

你可以做的是使用一个简单的循环：

docs = []
for filename in migFiles:
   with open(filename) as f:
     docs.append(slate.pdf(f)) 
     # or instead of saving file to memory, just process it now

然后，docs[i] 将保存第 (i+1) 个 pdf 文件的文本，您可以随时对文件执行任何操作。或者，您可以在 for 循环内处理文件。

如果要转换为文本，可以执行以下操作：

docs = []
separator = ' ' # The character you want to use to separate contents of
#  consecutive pages; if you want the contents of each pages to be separated 
# by a newline, use separator = '\n'
for filename in migFiles:
   with open(filename) as f:
     docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text

或者

separator = ' ' 
for filename in migFiles:
   with open(filename) as f:
     txtfile = open(filename[:-4]+".txt",'w')
     # if filename="abc.pdf", filename[:-4]="abc"
     txtfile.write(separator.join(slate.pdf(f)))
     txtfile.close()

score 0 · Accepted Answer

试试这个版本：

import glob
import os

import slate

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
        with open(txt_file,'w') as txt:
             txt.write(slate.pdf(pdf))

这将在与转换后的内容的 pdf 文件相同的目录中创建一个与 pdf 同名的文本文件。

或者，如果您想保存内容 - 试试这个版本；但请记住，如果翻译的内容很大，您可能会耗尽可用内存：

import glob
import os

import slate

pdf_as_text = {}

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        file_without_extension = os.path.splitext(pdf_file)[0]
        pdf_as_text[file_without_extension] = slate.pdf(pdf)

现在您可以使用pdf_as_text['somefile']来获取文本内容。

python - Python - 如何将许多单独的 PDF 转换为文本？

2 回答 2

Related

Reference