1

问题陈述

  1. 阅读pdf并搜索单词。
  2. 如果找到单词,请注释该单词并在 pdf 文件中的注释文本周围裁剪区域。
  3. 每个裁剪的图像应该只有一个注释。

库和版本

  1. python-3.6
  2. 菲茨-0.0.1.dev2
  3. pymupdf-1.17.5

面临的问题

对于前两次迭代,注释是完美的,并且裁剪也可以按预期完美地工作。但是通过从文本实例中迭代搜索词的下一次出现,然后在该区域周围裁剪,并且搜索词的注释失败。找不到此问题的解决方案。

def cropPdf( pdfName,word):
    c=0
    # opening the pdf file using fitz
    fitz_doc=fitz.open(pdfName)

    # getting first page of the doc
    fitz_page=fitz_doc[0]
    # finding all instances where the searchword is found
    text_instances=fitz_page.searchFor(word)
    # Iterating through each text instances  
    for text_cord in text_instances:
        c=c+1
        pdfPath = "./" + pdfName + ".pdf"
        # To add highlight(Rectangle Annotation) around the search word
        highlight = fitz_page.addRectAnnot(text_cord)
        # getting the bounding box cordinate
        x0,y0,x1,y1=highlight.rect
        # here i set the cropping area around the annotated text
        fitz_page.setCropBox(fitz.Rect(x0+600,y0+600,x0-600,y0-600))
        #
        pix=fitz_page.getPixmap()
        print(fitz_page.number)
        base_name_highlight="output"+str(c)+".png"
        # saving the cropped area as png file
        pix.writeImage("./highlight_folder/"+base_name_highlight)
        # Deleting the marked annotation which helps me to avoid duplicate annotation inside a cropped area,
        # when starting to annotate the next occurence of the word to annotate while iterating.
        fitz_page.deleteAnnot(highlight)

cropPdf(pdfName="A4_4.pdf",word="INSULATION")

结果图像

  1. 所有裁剪图像的预期输出 在此处输入图像描述

  2. 裁剪时的假大小写

在此处输入图像描述

4

1 回答 1

0

对裁剪框的更改可能会影响页面的所有坐标。所以在进入注释循环之前,我应该在变量中指定cropbox的初始状态。并且在每次迭代结束时,我应该重置为初始裁剪框,这将有助于注释下一次出现而不改变坐标

于 2020-08-16T19:20:52.953 回答