python - 用于遍历目录中的 PDF 并找到匹配行的 Python 脚本

Question

目前，我通过电子邮件将所有报告作为 pdf 附件发送给我。我所做的是设置 Outlook 每天自动将这些文件下载到某个目录。有时，这些 pdf 文件中没有任何数据，仅包含“没有与选择标准匹配的数据要呈现”这一行。我想创建一个 python 程序，它遍历该目录中的每个 pdf 文件，打开它并查找这些单词，如果它们包含该短语，则删除该特定 pdf。如果他们不这样做，那么什么也不做。通过 reddit 的帮助，我拼凑了以下代码：

import PyPDF2
import os

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open("{}/{}".format(directory,file), 'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        pageObj = pdfReader.getPage(0)
        if "There is no data to present that matches the selection criteria" in pageObj.extractText():
            print("{} was removed.".format(file))
            os.remove(file)

我已经测试了 3 个文件，其中一个包含匹配的短语。无论文件如何命名或顺序如何，它都会失败。我已经使用名为 3.pdf 的目录中的一个文件对其进行了测试。下面是错误代码。

FileNotFoundError: [WinError 2] 系统找不到指定的文件：>'3.pdf'

这将大大减少我的工作量，对我这个新手来说是一个很好的学习例子。欢迎所有帮助/批评。

score 2 · Accepted Answer

见下文：

import PyPDF2
import os

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open(os.path.join(directory,file), 'rb') as pdfFileObj:  # Changes here
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        pageObj = pdfReader.getPage(0)
        if "There is no data to present that matches the selection criteria" in pageObj.extractText():
            print("{} was removed.".format(file))
            os.remove(file)

python - 用于遍历目录中的 PDF 并找到匹配行的 Python 脚本

1 回答 1

Related

Reference