pdf - pdf如何批量列出带有注释的pdf？qpdf? pdf信息？

Question

当我打印一个我用 Okular 注释的 pdf 时，我感到很惊讶，尽管它确实显示在屏幕上，但它没有注释。我必须将带注释的文件保存为打印的 pdf，然后打印。

问题：如何列出至少一页上至少有一个注释的所有 pdf？

显然，当有注释时，pdfinfo 返回 Acroform

            find -type f -iname "*.pdf" -exec pdfinfo {} \;

但不显示文件名。

我不熟悉 qpdf，但它似乎没有提供此信息

谢谢

score 0 · Accepted Answer

使用poppler-utils你可以说pdfinfo，

find . -type f -iname '*.pdf' | while read -r pn
do  pdfinfo "$pn" |
    grep -q '^Form: *AcroForm' && printf '%s\n' "$pn"
done

列出pdfinfo报告的 PDF 文件的名称：

Form:           AcroForm

但是，在我的测试中，它遗漏了几个带有文本注释的 PDF，并列出了几个没有，所以我会避免在这项工作中使用它。以下是 2 种选择：qpdf支持所有注释子类型， python3-poppler-qt5仅支持一个子集，但可以更快。

（对于非 POSIX shell，请调整本文中的命令。）

编辑：find构造编辑以避免不安全和依赖 GNU {}s。

自 8.3.0 起的qpdf版本支持非内容 PDF 数据的json表示，如果您使用的是带有 jq JSON 处理器的系统，您可以将唯一的 PDF 注释类型列为制表符分隔值（在这种情况下丢弃输出和仅使用退出代码）：

find . -type f -iname '*.pdf' | while read -r pn
do  qpdf --json --no-warn -- "$pn" |
    jq -e -r --arg typls '*' -f annots.jq > /dev/null && 
    printf '%s\n' "$pn"
done

在哪里

--arg typls '*'指定所需的注释子类型，例如* 为所有（默认），或Text,FreeText,Link为选择
-e如果没有输出（未找到注释），则设置退出代码 4
-r产生原始（非 JSON）输出
jq脚本文件annots.jq包含以下内容

#! /usr/bin/env jq-1.6
def annots:
    ( if ($typls | length) > 0 and $typls != "*"
      then $typls
      else
        # annotation types, per Adobe`s PDF Reference 1.7 (table 8.20)
        "Text,Link,FreeText,Line,Square,Circle,Polygon"
        + ",PolyLine,Highlight,Underline,Squiggly,StrikeOut"
        + ",Stamp,Caret,Ink,Popup,FileAttachment,Sound,Movie"
        + ",Widget,Screen,PrinterMark,TrapNet,Watermark,3D"
      end | split(",")
    ) as $whitelist
    | .objects
    | .[]
    | objects
    | select( ."/Type" == "/Annot" )
    | select( ."/Subtype" | .[1:] | IN($whitelist[]) )
    | ."/Subtype" | .[1:]
    ;
[ annots ] | unique as $out
| if ($out | length) > 0 then ($out | @tsv) else empty end

出于许多目的，很容易将python-3.x与 python3-poppler-qt5 一起使用来一次性处理整个文件列表，

find . -type f -iname '*.pdf' -exec python3 path/to/script -t 1,7 {} '+'

根据poppler 文档，该-t选项列出了所需的注释子类型；1 是7 是。没有选择 poppler 已知的所有子类型（0 到 14），即不支持所有现有子类型。ATextALink-t

#! /usr/bin/env python3.8
import popplerqt5

def gotAnnot(pdfPathname, subtypls):
    pdoc = popplerqt5.Poppler.Document.load(pdfPathname)
    for pgindex in range(pdoc.numPages()):
        annls = pdoc.page(pgindex).annotations()
        if annls is not None and len(annls) > 0:
            for a in annls:
                if a.subType() in subtypls:
                    return True
    return False

if __name__ == "__main__":
    import sys, getopt
    typls = range(14+1)         ## default: all subtypes
    opts, args = getopt.getopt(sys.argv[1:], "t:")
    for o, a in opts:
        if o == "-t" and a != "*":
            typls = [int(c) for c in a.split(",")]
    for pathnm in args:
        if gotAnnot(pathnm, typls):
            print(pathnm)

pdf - pdf如何批量列出带有注释的pdf？qpdf? pdf信息？

1 回答 1

Related

Reference