python - 如何在 Python/Django 中逐行读取 pdf 文件？

Question

我正在处理文本和 pdf 文件等于或小于5KB. 如果文件是文本文件，我会从表单中获取一个文件，并在字符串中获取所需的输入以进行总结：

 file = file.readlines()
 file = ''.join(file)
 result = summarize(file, num_sentences)

这很容易完成，但对于 pdf 文件，事实证明这并不容易。有没有办法像我在 Python/Django 中处理我的 txt 文件一样将 pdf 文件的句子作为字符串获取？

score 3 · Accepted Answer

我认为不可能像使用 txt 文件那样阅读 pdf，您需要将 pdf 转换为 txt 文件（请参阅Python 模块以将 PDF 转换为文本）然后处理它。你也可以参考这个来轻松地将pdf转换为txt http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

score 0 · Accepted Answer

在 Django 中，您可以这样做：

视图.py：

def upload_pdf():
     if request.method == 'POST' and request.FILES['myfile']:
        pdfFileObj = request.FILES['myfile'].read() 
        pdfReader = PyPDF2.PdfFileReader(io.BytesIO(pdfFileObj))
        NumPages = pdfReader.numPages
        i = 0
        content = []
        while (i<NumPages):
            text = pdfReader.getPage(i)
            content.append(text.extractText())
            i +=1
       # depends on what you want to do with the pdf parsing results
       return render(request, .....)

html部分：

<form method="post" enctype="multipart/form-data" action="/url">
    {% csrf_token %}
      <input  type="file" name="myfile"> # the name is the same as the one you put in FILES['myfile']
    <button class="butto" type="submit">Upload</button>
</form>

在 Python 中，您可以简单地执行以下操作：

fileName = "path/test.pdf"
pdfFileObj = open(fileName,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
NumPages = pdfReader.numPages

i = 0
content = []
while (i<NumPages):
    text = pdfReader.getPage(i)
    content.append(text.extractText())
    i +=1

python - 如何在 Python/Django 中逐行读取 pdf 文件？

2 回答 2

Related

Reference