1

For converting pdf to text I am using the following command:

pdf2txt.py -o text.txt example.pdf # It will convert example.pdf to text.txt

But I have more than 1000 pdf files which I need to convert to text file first and then do the analysis.

Is there a way through which I can use this command to iterate over the pdf files and convert all of them?

4

2 回答 2

3

我建议你有一个shell脚本:

for f (*.pdf) {pdf2txt.py -o $f $f.txt}

然后使用 python 读取所有.txt文件进行分析。

仅使用 python 进行转换:

from subprocess import call
import glob

for pdf_file in glob.glob('*.pdf'): 
    call(["pdf2txt.py", "-o", pdf_file, pdf_file[:-3]+"txt"])
于 2015-06-03T15:51:21.683 回答
0

我的 win1o 操作系统上的 python 代码出错了(OSError: [WinError 193] %1 is not a valid Win32 application),for循环应该是:

for pdf_file in glob.glob('*.pdf'):
    call(['python.exe','pdf2txt.py','-o',pdf_file[:-3]+'txt',pdf_file])

注意,文件 i/o 的参数是相反的,如果你保持相同的顺序,你的文件会被空文件覆盖......

仍然感谢 Gurupad Hegde,告诉我秘密文件的方法,它有很大帮助!

于 2016-08-25T16:15:42.887 回答