linux - 用于检查 PDF 的 BASH 脚本是 ocr'd

Question

真的不知道从哪里开始

我有一个超过 8000 个 PDf 的 linux 服务器，需要知道哪些 PDF 已经被 ocr'd 和哪些没有。

正在考虑某种脚本调用 XPDF 来检查 pdf，但老实说不确定这是否可能

提前感谢您的帮助

score 4 · Accepted Answer

确保您已pdffonts安装命令行工具。（这有两个版本：一个作为的一部分运送xpdf-utils，另一个作为的一部分poppler-utils。）

所有仅包含扫描页面的 PDF 都不会使用任何字体（既不是嵌入的，也不是非嵌入的）。

命令行

pdffonts /path/to/scanned.pdf

然后将不会显示该文件的任何字体信息。

这可能已经足以让您将文件分成两个不同的集合。

如果您的 PDF 混合了扫描页面和“正常”页面（或 sanned-and-ocr'ed 页面），那么您将不得不扩展和改进上述简单化的方法。请参阅man pdffonts或pdffonts --help了解更多信息。

score 4 · Accepted Answer

问题pdffonts是有时它什么也不返回，如下所示：

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

有时它会返回：

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
[none]                               Type 3            yes no  no     266  0
[none]                               Type 3            yes no  no       9  0
[none]                               Type 3            yes no  no     297  0
[none]                               Type 3            yes no  no     341  0
[none]                               Type 3            yes no  no     381  0
[none]                               Type 3            yes no  no     394  0
[none]                               Type 3            yes no  no     428  0
[none]                               Type 3            yes no  no     441  0
[none]                               Type 3            yes no  no     451  0
[none]                               Type 3            yes no  no     480  0
[none]                               Type 3            yes no  no     492  0
[none]                               Type 3            yes no  no     510  0
[none]                               Type 3            yes no  no     524  0
[none]                               Type 3            yes no  no     560  0
[none]                               Type 3            yes no  no     573  0
[none]                               Type 3            yes no  no     584  0
[none]                               Type 3            yes no  no     593  0
[none]                               Type 3            yes no  no     601  0
[none]                               Type 3            yes no  no     644  0

考虑到这一点，让我们编写一个小文本工具来从 pdf 中获取所有字体：

pdffonts my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

如果您的 pdf 未经过 OCR 处理，则不会输出任何内容或[none].

如果您希望它运行得更快，请使用该-l标志仅分析前 5 页：

pdffonts -l 5 my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

现在将其包装在 bash 脚本中，例如is-pdf-ocred.sh：

#!/bin/bash
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "NOT OCR'ed: $1"
else 
    echo "$1 is OCR'ed."
fi

最后，我们希望能够搜索 pdf。该find命令不知道您在中的别名或函数.bashrc，因此我们需要为其提供脚本的路径。在您选择的目录中运行它，如下所示：

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \;

我假设 pdf 文件以结尾.pdf，尽管这并不总是您可以做出的假设。您可能希望将其通过管道传输到 less 或将其输出到文本文件中：

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; | less
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; > pdfs.txt

-l 5使用该标志，我能够在 10 多秒内完成大约 200 个 pdf 。

linux - 用于检查 PDF 的 BASH 脚本是 ocr'd

2 回答 2

Related

Reference