shell - 如何编写用于查找 PDF 中页数的 shell 脚本？

Question

我正在动态生成 PDF。如何使用 shell 脚本检查 PDF 中的页数？

score 60 · Accepted Answer

没有任何额外的包装：

strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
    | sort -rn | head -n 1

使用 pdfinfo：

pdfinfo file.pdf | awk '/^Pages:/ {print $2}'

使用 pdftk：

pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'

您还可以通过 pdfinfo 递归汇总所有 PDF 中的总页数，如下所示：

find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
    awk '/^Pages:/ {n += $2} END {print n}'

score 10 · Accepted Answer

imagemagick 库提供了一个名为 identify 的工具，它与计算输出行数相结合，可以让你得到你想要的……imagemagick 是一个在 osx 上使用 brew 轻松安装的工具。

这是一个功能性 bash 脚本，它将其捕获到 shell 变量并将其转储回屏幕......

#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"

运行它的输出......

$ ./countPages.sh aSampleFile.pdf 
Processing aSampleFile.pdf
The number of pages is: 2
$

score 9 · Accepted Answer

The pdftotext utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f' ):

NAME
       pdftotext - Portable Document Format (PDF) to text converter.

SYNOPSIS
       pdftotext [options] [PDF-file [text-file]]

DESCRIPTION
       Pdftotext converts Portable Document Format (PDF) files to plain text.

       Pdftotext  reads  the PDF file, PDF-file, and writes a text file, text-file.  If text-file is
       not specified, pdftotext converts file.pdf to file.txt.  If text-file is  ´-',  the  text  is
       sent to stdout.

There are many combinations to solve your problem, choose one of them:

1) pdftotext + grep:

$ pdftotext file.pdf - | grep -c $'\f'

2) pdftotext + awk (v1):

$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'

3) pdftotext + awk (v2):

$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'

4) pdftotext + awk (v3):

$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'

Hope it Helps!

score 8 · Accepted Answer

Here is a version for the command line directly (based on pdfinfo):

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done

score 3 · Accepted Answer

Here is a total hack using pdftoppm, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):

# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

How does this work? Well, if you specify a first page which is larger than the pages in the PDF (I specify page number 1000000, which is too large for all known PDFs), it will print the following error to stderr:

Wrong page range given: the first page (1000000) can not be after the last page (142).

So, I pipe that stderr msg to stdout with 2>&1, as explained here, then I pipe that to grep to match the (142). part with this regular expression (([0-9]*)\.$), then I pipe that to grep again with this regular expression ([0-9]*) to find just the number, which is 142 in this case. That's it!

Wrapper functions and speed testing

Here are a couple wrapper functions to test these:

# get the total number of pages in a PDF; technique 1.
# See this ans here: https://stackoverflow.com/a/14736593/4561887
# Usage (works on ALL PDFs--whether password-protected or not!):
#       num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
    _pdf="$1"

    _num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
        | sort -rn | head -n 1)"

    echo "$_num_pgs"
}

# get the total number of pages in a PDF; technique 2.
# See my ans here: https://stackoverflow.com/a/66963293/4561887
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
#       num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
    _pdf="$1"
    _password="$2"

    if [ -n "$_password" ]; then
        _password="-upw $_password"
    fi

    _num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
        | grep -o '[0-9]*')"

    echo "$_num_pgs"
}

Testing them with the time command in front shows that the strings one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo technique in Ocaso's answer below is also very fast--the same as the pdftoppm one.

#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
#       pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'

[[ "$#" != "1" ]] && {
   printf "ERROR: No file specified\n"
   exit 1
}

numpages=0
while read line; do
   num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
   (( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages

score 2 · Accepted Answer

2

mupdf/mutool solution:

mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2

于 2020-11-10T03:41:24.863 回答

score 1 · Accepted Answer

If you're on macOS you can query pdf metadata like this:

mdls -name kMDItemNumberOfPages -raw file.pdf

as seen here https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal

score 0 · Accepted Answer

I made a few improvement in Marius Hofert tip to sum the returned values.

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'

score 0 · Accepted Answer

To build on Marius Hofert's answer, this command uses a bash for loop to show you the number of pages, display the filename, and it will ignore the case of the file extension.

for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done

shell - 如何编写用于查找 PDF 中页数的 shell 脚本？

10 回答 10

Wrapper functions and speed testing

See also

shell - 如何编写用于查找 PDF 中页数的 shell 脚本？

10 回答 10

Wrapper functions and speed testing

See also

Related

Reference