python - 如何在 Python 中解锁“安全”（读保护）PDF？

Question

在 Python 中，我使用pdfminer从 pdf 中读取文本，并使用此消息下方的代码。我现在收到一条错误消息：

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>

当我用 Acrobat Pro 打开这个 pdf 文件时，发现它是安全的（或“读保护”）。然而，从这个链接中，我了解到有许多服务可以轻松禁用这种读取保护（例如pdfunlock.com。当深入研究 pdfminer 的源代码时，我看到上面的错误是在这些行上生成的。

if check_extractable and not doc.is_extractable:
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)

由于有许多服务可以在一秒钟内禁用这种读保护，我认为这很容易做到。看起来这.is_extractable是一个简单的属性doc，但我不认为它像更改.is_extractable为 True..那样简单。

有人知道如何使用 Python 禁用 pdf 的读取保护吗？欢迎所有提示！

=================================================

您将在下面找到我目前从非读保护中提取文本的代码。

def getTextFromPDF(rawFile):
    resourceManager = PDFResourceManager(caching=True)
    outfp = StringIO()
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
    interpreter = PDFPageInterpreter(resourceManager, device)

    fileData = StringIO()
    fileData.write(rawFile)
    for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    fileData.close()
    device.close()

    result = outfp.getvalue()

    outfp.close()
    return result

score 45 · Accepted Answer

我在尝试让 qpdf 在我的程序中运行时遇到了一些问题。我找到了一个有用的库pikepdf，它基于 qpdf 并自动将 pdf 转换为可提取的。

使用它的代码非常简单：

import pikepdf

pdf = pikepdf.open('unextractable.pdf')
pdf.save('extractable.pdf')

score 27 · Accepted Answer

据我所知，在大多数情况下，PDF 的全部内容实际上是加密的，使用密码作为加密密钥，因此简单地设置.is_extractable为True对您没有帮助。

根据这个线程：

是否存在以编程方式从 PDF 中删除密码的库？

我建议使用命令行工具删除读取保护，例如qpdf（易于安装，例如在 Ubuntu 上使用apt-get install qpdf，如果您还没有的话）：

qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf

然后打开解锁的文件pdfminer并做你的事情。

对于纯 Python 解决方案，您可以尝试使用PyPDF2和它的.decrypt()方法，但它不适用于所有类型的加密，所以真的，你最好只使用qpdf- 请参阅：

https://github.com/mstamy2/PyPDF2/issues/53

score 5 · Accepted Answer

我使用pikepdf使用下面的代码并且能够覆盖。

import pikepdf

pdf = pikepdf.open('filepath', allow_overwriting_input=True)
pdf.save('filepath')

score 2 · Accepted Answer

在我的情况下，没有密码，但只需设置即可check_extractable=False绕过PDFTextExtractionNotAllowed有问题文件的异常（在其他查看器中打开正常）。

score 1 · Accepted Answer

'check_extractable=True' 参数是设计使然。一些 PDF 明确禁止提取文本，PDFMiner 遵循该指令。您可以覆盖它（给出 check_extractable=False），但风险自负。

score 1 · Accepted Answer

如果您想解锁文件夹中的所有 pdf 文件而不重命名它们，您可以使用以下代码：

import glob, os, pikepdf

p = os.getcwd()
for file in glob.glob('*.pdf'):
   file_path = os.path.join(p, file).replace('\\','/')
   init_pdf = pikepdf.open(file_path)
   new_pdf = pikepdf.new()
   new_pdf.pages.extend(init_pdf.pages)
   new_pdf.save(str(file))

在pikepdf库中，无法通过以相同名称保存现有文件来覆盖现有文件。相反，您想将页面复制到新创建的空 pdf 文件中，然后保存。

score 1 · Accepted Answer

完全披露，我是pdfminer.six的维护者之一。它是用于 python 3 的 pdfminer 的社区维护版本。

此问题已在 2020 年check_extractable通过默认禁用来修复。它现在显示警告而不是引发错误。

类似的问答在这里。

score 0 · Accepted Answer

我也遇到了解析安全 pdf 的同样问题，但使用 pikepdf 库已经解决了。我在我的 jupyter notebbok 和 windows os 上尝试了这个库，但它给出了错误，但它在 Ubuntu 上运行顺利

python - 如何在 Python 中解锁“安全”（读保护）PDF？

8 回答 8

Related

Reference