pdftotext - Remove a page number, header and footer from pdf file

Question

I want to parse a pdf file, for that I am using pdftotext utility which converts pdf file into text file, now I want to remove a page number, header and footer from text file.

I am converting a pdf file using following syntax:

pdftotext -layout input.pdf output.txt

Can anyone help me on this?

score 11 · Accepted Answer

您需要使用参数 -H -W -y -x 进行裁剪，至少 -H -W -y。

例子：

pdftotext -y 80 -H 650 -W 1000 -nopgbrk -eol unix example.pdf


-y 80   -> crop 80 pixels after the top of file (remove header);
-H 650  -> crop 650 pixels after the -y has cropped (remove footer);
-W 1000 -> hight value to crop nothing (need especify something);

您需要为每个 PDF 调整 -y 和 -H，有时减少 -y 并增加 -H 以适应页眉和页脚；

score 0 · Accepted Answer

搜索显示您有页码或页眉、页脚的模式！例如，当我使用 pdftotext 将 pdf 文件转换为文本时，我意识到数字页在文本中是独立的，因此我使用正则表达式来替换它们，如下所示：

for root, dirs, files in os.walk(src, topdown=False):
    for name in files:
        if name.endswith('.txt'):
            with open(os.path.join(root, name), "r") as fin:
                 data = fin.read()    
                 new_text = re.sub(r'\n\d+\n\s','',data,re.DOTALL)

因为每个页码都在一行中（没有任何其他文本），并且在该页码之后我有一个新行。我对pdf文件的页眉和页脚做了同样的事情。

pdftotext - Remove a page number, header and footer from pdf file

2 回答 2

Related

Reference