python - PDF 抓取：如何为 Python 中抓取的每个 pdf 自动创建 txt 文件？

Question

这就是我想要做的：一个程序，它将一个 pdf 文件列表作为其输入，并为列表中的每个文件返回一个 .txt 文件。

例如，给定一个 listA = ["file1.pdf", "file2.pdf", "file3.pdf"]，我希望 Python 创建三个 txt 文件（每个 pdf 文件一个），比如说“file1.txt”， “file2.txt”和“file3.txt”。

多亏了这个家伙，我的转换部分工作顺利。我所做的唯一更改是在 maxpages 语句中，我在其中分配了 1 而不是 0，以便仅提取第一页。正如我所说，我的这部分代码运行良好。这是代码。

def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
#maxpages = 0
maxpages = 1
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str

问题是我似乎无法让 Python 返回我，这就是我在第二段中所说的。我试过以下代码：

def save(lst):
i = 0

while i < len(lst):
    txtfile = "enegep"+str(i)+".txt" #enegep is like the identifier of the files
    artigo = convert_pdf_to_txt(lst[0])
    with open(txtfile, "w") as textfile:
        textfile.write(artigo)
    i += 1

我使用包含两个 pdf 文件的列表作为输入运行了该保存功能，但它只生成了一个 txt 文件，并且运行了几分钟而没有生成第二个 txt 文件。实现我的目标的更好方法是什么？

score 1 · Accepted Answer

您不更新i，因此您的代码陷入无限循环，您需要i += 1：

def save(lst):
    i = 0   # set to 0 but never changes
    while i < len(lst):
        txtfile = "enegep"+str(i)+".txt" #enegep is like the identifier of the files
        artigo = convert_pdf_to_txt(lista[0])
        with open(txtfile, "w") as textfile:
            textfile.write(artigo)
     i += 1 # you need to  increment i

更好的选择是简单地使用range：

def save(lst):
    for i in range(len(lst)): 
        txtfile = "enegep{}.txt".format(i) #enegep is like the identifier of the files
        artigo = convert_pdf_to_txt(lista[0])
        with open(txtfile, "w") as textfile:
            textfile.write(artigo)

您也只使用lista[0]，因此您可能还希望更改该代码以在每次迭代时在列表中移动。

如果 lst 实际上是 lista 你可以使用enumerate：

   def save(lst):
        for i, ele in enumerate(lst): 
            txtfile = "enegep{}.txt".format(i) #enegep is like the identifier of the files
            artigo = convert_pdf_to_txt(ele)
            with open(txtfile, "w") as textfile:
                textfile.write(artigo)

python - PDF 抓取：如何为 Python 中抓取的每个 pdf 自动创建 txt 文件？

1 回答 1

Related

Reference