python - Python、pyPdf、Adobe PDF OCR 错误：不支持的过滤器/lzwdecode

Question

我的东西：python 2.6 64 位（安装了 pyPdf-1.13.win32.exe）。翼IDE。视窗 7 64 位。

我收到以下错误：

NotImplementedError：不支持的过滤器/LZWDecode

当我运行以下代码时：

from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re

path = 'C:\\Users\\Homer\\Documents\\' # This is where I put my pdfs

filelist = os.listdir(path)

has_text_list = []
does_not_have_text_list = []

for pdf_name in filelist:
    pdf_file_with_directory = os.path.join(path, pdf_name)
    pdf = pyPdf.PdfFileReader(open(pdf_file_with_directory, 'rb'))

    for i in range(0, pdf.getNumPages()):
        content = pdf.getPage(i).extractText() #this is the line what done it
        does_it_have_text = re.findall(r'\w{2,}', content) 
        if does_it_have_text == []:
            does_not_have_text_list.append(pdf_name)
            print pdf_name
        else:
            has_text_list.append(pdf_name)

print does_not_have_text_list

这里有一点背景。该路径充满了pdf。有些是使用 Adobe pdf 打印机从文本文档中保存的（至少我认为他们是这样做的）。有些被扫描为图像。我想将它们和 OCR 分开，那些是图像（非图像是完美的，不应该被弄乱）。

几天前我在这里问过如何做到这一点：

PDF 的批量 OCR 程序

我得到的唯一回应是在 VB 中，我只说 python。所以我想我会尝试为我自己的问题写一个答案。我的策略（反映在上面的代码中）是这样的。如果它只是一个图像，那么该正则表达式将返回一个空列表。如果它有文本，则正则表达式（表示任何具有 2 个或更多字母数字字符的单词）将返回一个填充有 u'word' 之类的内容的列表（在 python 中，我认为这是一个 unicode 字符串）。

所以代码应该可以工作，我们可以迈出第一步，使用开源软件完成其他线程（将 ocrd 与图像 pdf 分开），但我不知道如何处理这个过滤器错误并且谷歌搜索不是有帮助。因此，如果有人知道，那将很有帮助。

我真的不知道如何使用这些东西。我不确定 pyPdf speak 中的 filter 是什么意思。我认为它说它不能真正阅读 pdf 或其他东西，即使它是 ocrd。有趣的是，我将非 ocrd pdf 和 ocrd pdf 之一与 python 文件放在同一个文件夹中，这仅适用于没有 for 循环的那个，所以我不知道为什么要使用创建的 for 循环过滤器错误。我将在下面发布单个代码。谢谢。

from pyPdf import PdfFileWriter, PdfFileReader
import sys, os, pyPdf, re

pdf = pyPdf.PdfFileReader(open(my_ocrd_file.pdf', 'rb'))

has_text_list = []
does_not_have_text_list = []

for i in range(0, pdf.getNumPages()):
    content = pdf.getPage(i).extractText()
    does_it_have_text = re.findall(r'\w{2,}', content)
      print does_it_have_text

它会打印东西，所以我不知道为什么我会在一个而不是另一个上得到过滤器错误。当我对目录中的另一个文件（不是 ocrd 的文件）运行此代码时，输出是一行的空字符串和下一行的空字符串，如下所示：

[]
[]

所以我也不认为这是非 ocrd pdf 的过滤器问题。这就像我的头，我需要一些帮助。

编辑：

谷歌搜索找到了这个，但我不知道该怎么做：

http://vaitls.com/treas/pdf/pyPdf/filters.py

score 2 · Accepted Answer

将 pyPdf 的 filter.py 替换为pyPdf 源文件夹中的http://vaitls.com/treas/pdf/pyPdf/filters.py。这对我有用。

score 1 · Accepted Answer

LZW是一种用于 GIF 和有时用于 PDF 的压缩格式。如果您查看可用的过滤器，pyPdf.filters您会发现 LZW 不存在，因此NotImplementedError.您发布的链接是在有人实现 LZW 过滤器的 subversion 存储库中编码。

python - Python、pyPdf、Adobe PDF OCR 错误：不支持的过滤器/lzwdecode

2 回答 2

Related

Reference