0

我已经使用 magick-r 和 tesseract 的组合成功地从单个 pdf 中提取了文本,但是在尝试处理多个图像时遇到了障碍。(这是针对非营利组织的)

我欢迎 bash 中的答案,但要求它们是全面的,不要跳过 tesseract 组件。

这个问题的答案是在不使用OCR的情况下进行图像清理,所以不确定第一个答案如何在这里集成。

图像数据: 在此处输入图像描述

我的过程:

library(tesseract)
library(dplyr)
library(stringr)
library(pdftools)
library(readr)
library(magick)
library(purrr)
# original data
#pdf <- https://github.com/pembletonc/Project44_Text_Extraction/blob/master/test-data/001_0145.pdf

#image file (note that size here doesn't match processing below because of 2mb limit)[![enter image description here][2]][2]

file_name <- tools::list_files_with_exts(dir = "./test-data", exts = "pdf")
page_count <- pdf_info(file_name)$pages  

multi_files <- list(pdftools::pdf_convert(file_name, page = 1:page_count,
                                          filenames = paste0("./test-data/", "page", 1:page_count, ".png"),dpi = 250))

#or just get the file extensions for the file if already created[![enter image description here][1]][1]
#multi_files <- list(tools::list_files_with_exts(dir = "./test-data", exts = "png"))

要将图像读取为魔法文件:

multi_images <- map(multi_files, image_read)

which creates a tibble magick pointer object with the images sort of joined as a frame:

[[1]]
# A tibble: 5 x 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG     3243   2010 sRGB       FALSE        0 98x98  
2 PNG     3247   2013 sRGB       FALSE  4515441 98x98  
3 PNG     3243   2013 sRGB       FALSE  4559229 98x98  
4 PNG     3247   2010 sRGB       FALSE  4270145 98x98  
5 PNG     3247   2010 sRGB       FALSE  3212528 98x98  

如何在每个 PNG 上访问它,以便在 OCR 中进行清理和处理?

multi_text_clean <- function(images){

  Map(function(x) {
    x %>% 
      image_crop(geometry_area(width = 2200, height = 1600, y_off = 500, x_off = 650)) %>%  
      image_resize("2000x") %>%
      image_background("white", flatten = TRUE) %>% 
      image_noise(noisetype = "Uniform") %>%          # Reduce noise in image using a noise peak elimination filter
      image_enhance() %>%                             # Enhance image (minimize noise)
      image_normalize() %>% 
      image_convert(type = 'Grayscale') %>%
      image_trim(fuzz = 40) %>%
      image_contrast(sharpen = 1) %>%
      #image_deskew(threshold = 40) %>% 
      image_write(format = 'png', density = '300x300') %>%
      tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
  }, images)

}

这仅在第一张图像上运行:

text_list <-  multi_text_clean(multi_images)
(text_multi <- stringr::str_split(text_list, pattern = "\\s{5,}"))

[[1]]
 [1] "Weather clear all day. A small arms inspection held at 1400 hrs. A recce party went\njout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group\nheld in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown\nfor B Coy personnel by our YMCA Supervisor."                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 [2] ")\nWeather clear and cold all day. Personnel packed equipment early in the morning and |~\nwere ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR\n2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,,\nPW brought in by A Coy at 1800 hrs. PW was a deserter from 304 Regt 2 Pz division.\nNo other activity during the day. Patrols were sent out during the night by all coys}) u\nCold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chateawv .\n\\Vieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at\n\\Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP\nset up at 1900 hrs at MR 207753. Patrols sent out by all Coys."
 [3] "“y\neather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge MR\n204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce\npatrols sent out byall coys."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [4] "f\nWeather fair all day. No enemy was seen during the day. A Coy sent out patrols during\ntthe day and night but no opposition memt. B Coy moved forward to MR 195771. Orders\nGroup held at 2000 hrs and orders were given to have all personnel ready to move to\nnew location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp\nling but the CO cancelled it. Two Polish deserters from the German army walked into\n|A Coy lines."                                                                                                                                                                                                                                                                                                                          
 [5] "iz\nWeather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce\nnew location at 0830 hrs. Unit started to move to new location at 1200 hrs, Unit   Bs\narrived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to\ntheir areas and billets."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
 [6] "| 9\neather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.| |\nQuiet all day. No enemy activity during the day
 [7] "|\neather overcast and snowing. Intelligence Section set up another OP at MR 268814.\nNo enemy activity during the day. At 2300 hrs orders were received that all personnel\nere to be ready to move to new area on the morning of the 9th Jan
 [8] ":"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [9] "‘\nWeather clear and cold, Bm started to move at 0830 hrs. Bn reached Champlon
[10] "&\nFamenine, MR 3182 at 1230 hrs. Bn relieved the HLI. Coys immediately took up
[11] ":\npositions for all around defence
[12] "4\n"                                                                                                                                                                                                                             

我怎样才能通过那个魔法对象中的每个图像运行它?

4

2 回答 2

0

这是我放在一起的脚本(使用堆栈溢出中的许多示例),它处理目录中的多个.pdf(或仅一个..) - 也许它会有所帮助?

您可以在以下位置下载脚本:https ://drive.google.com/file/d/1fB9P0TQchE6vEr2MBug47aJIPc4yag45/view?usp=sharing

#!/bin/bash echo -ne "\033]0;CREATE SEARCHABLE PDF MULTIPLE .PDF FILES\007" # 设置终端标题

给用户的注意事项

# install ImageMagick
# install tesseract
# install pdftk
# install libtiff

获取当前目录

current_dir=$(pwd)

设置临时董事姓名

temp_dir_nme=pdf_OCR_temp_nkAIumgy430qIRVn3Np6ZQx

警告用户文件中没有空格;文件将被重命名:\e[1;31m(红色开启)| \033[0m": 颜色关闭

echo -e "\e[1;31mTAKE NOTE that directory names containing spaces are unsupported!\033[0m"
printf "\n"

echo -e "\e[1;31mTAKE NOTE FURTHER that your .pdf file/files will be renamed to replace any spaces in their names with underscores!\033[0m"
printf "\n"

给用户一个退出的机会...

read -p "Press enter to continue..."
printf "\n"

重命名 .pdf 文件以用“下划线”替换空格

for f in *.pdf
    do mv "$f" "${f// /_}"
done

# run script for each .pdf file in folder
for f in *.pdf
    do

    # establish path to input .pdf file
        path_to_file="$current_dir/"$f

    # make temp directory for operations
        mkdir $current_dir/$temp_dir_nme

    # copy .pdf file to temp directory
        cp $f $current_dir/$temp_dir_nme

    # change to temp directory to work the magick
        cd $current_dir/$temp_dir_nme

        no_pgs=$(pdftk $f dump_data | grep NumberOfPages | awk '{print $2}')
        pgs_per_vol=10 # for .pdf's of more than ten pages
        min_volumes=$(( no_pgs / pgs_per_vol )) # for .pdf's of more than ten pages
        fin_volume=$(( min_volumes+1 )) # for .pdf's of more than ten pages
        unlik_nme="nkAIumgy430qIRVn3Np6ZQx_" # give .pdf volumes unlikely names

        let "ss = $min_volumes * 10" 1> /dev/null
        let "tt = $ss + 1" 1> /dev/null

    # chop .pdf into volumes

    # chop .pdf into one volume with a new name if it has ten or less pages
        if [ $no_pgs -lt 11 ]
            then
            pdftk $f cat 1-$no_pgs output $unlik_nme.pdf
        fi

    # chop .pdf into multiple volumes if it has eleven or more pages, excluding the final volume
        if [ $no_pgs -gt 10 ]
            then
                echo Chopping $f into $fin_volume volumes, to a maximum of ten pages per volume...
    
                i=1
                j=1
                k=$pgs_per_vol

                pdftk $f cat 1-$pgs_per_vol output "${unlik_nme}${i}".pdf # concatenate variables

                while [ $i -ne $min_volumes ]
                    do  
                        j=$(( $j + pgs_per_vol ))
                        k=$(( $k + pgs_per_vol ))
                        i=$(( $i + 1 ))
        
                        pdftk $f cat $j-$k output "${unlik_nme}${i}".pdf
                    done

        # create final volume of whatever number of pages           
            pdftk $f cat $tt-end output "${fin_volume}${unlik_nme}".pdf 2> /dev/null
        fi

    # remove initial .pdf file
        rm $f

    # rename pdf volumes in directory sorted by modification time, oldest first: ls -tr
        n=0; ls -tr | while read a; do n=$(( n+i )); mv -- "$a" "$(printf '%03d' "$n")"_"$a"; done
    
        total_vols=$(( $min_volumes + 2 ))

    # loop over .pdf volumes in directory

        for files in *.pdf
            do
                echo Exporting $files to .png images...

            # export .pdf volume  to .png images
                pdftoppm -r 150 $files exported -png

            # delete .pdf volume
                rm $files
                echo Converting .png files to .jpg files...

            # convert .png files to .jpg files
                magick convert *.png %03d_converted.jpg

            # delete first .png images
                rm *.png
                
            # deskew images
                echo Deskewing text...
                magick convert *.jpg -deskew 90% %03d_deskewed.jpg

            # delete converted .jpf images
                rm *converted.jpg

            # enhance contrast
                echo Enhancing contrast...
                magick convert -brightness-contrast 0x10 *.jpg %03d_contrast.jpg

            # delete deskewed images
                rm *deskewed.jpg

            # crop and resize .jpg images to A4 ratio
                echo Resizing, and cropping...
                magick mogrify -format jpg -geometry "1680x2376^" -gravity center -extent 1680x2376 *.jpg

            # generate compressed .tiff image from .jpg images
                echo Converting resized and cropped .jpg"'s" to a compressed .tiff file...
                magick convert -compress lzw *.jpg images.tiff

            # delete .jpg images
                rm *.jpg
        
            # create .pdf
                magick convert images.tiff images.pdf
                printf "\n"
                echo Recognizing text...
                
            # recognize text
                tesseract images.tiff text -l eng -c textonly_pdf=1 pdf
                printf "\n"

            # delete .tiff to save space
                rm images.tiff

            # add "OCR_" to original .pdf name
                pdftk text.pdf multibackground images.pdf output "OC_"$files

            # compressing .pdf
                ps2pdf -dPDFSETTINGS=/ebook "OC_"$files "CR_"$files
        
        # end files loop
            done

# combine output .pdf files into "OCR"_original_file_name.pdf   
    pdftk CR_*.pdf output "OCR_"$f

# copy "OCR"_original_file_name.pdf to initial directory
    cp "OCR_"$f $current_dir

# change to previous directory
    cd $current_dir

# delete temporary directory, and temporary files
    rm -r $current_dir/$temp_dir_nme

    done

打印线

printf "\n"
printf '=%.s' {1..40}; echo
read -p "Press enter to exit..."
于 2022-02-09T14:54:08.780 回答
-1

您可以在 ImageMagick 中执行以下操作。

输入:

在此处输入图像描述

convert img.jpg -negate -lat 20x20+10% -negate img_lat.jpg


在此处输入图像描述

或者我有一个使用 ImageMagick 的 bash shell 脚本,称为textcleaner,它将执行以下操作:

textcleaner -f 20 -o 10 img.jpg img_textcleaner.jpg


在此处输入图像描述

于 2019-10-04T16:04:39.813 回答