2

但是,由于xPDF可以将 PDF 转换为 PNG,因此我跳过了 ImageMagick 转换步骤以及函数(i)过程的错误逻辑,因为 pdftopng 需要根名称,在这种情况下为“ocrbook-000001.png”,并在查找原始 PDF 文件名的 PNG 时引发错误。

我现在的问题是让 Tesseract 对我的 PNG 文件做任何事情。我得到错误:

Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in pixCreateNoInit: pix_malloc fail for data
Error in pixCreate: pixd not made
Error in pixReadStreamPng: pix not made
Error in pixReadStream: png: no pix returned
Error in pixRead: pix not read
Error during processing.

这是我的代码:

lapply(myfiles, function(i){

shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 ", i, " ocrbook")))
mypngs <- list.files(path = dest, pattern = "png", full.names = TRUE)
    lapply(mypngs, function(z){
    shell(shQuote(paste0("tesseract ", z, " out")))
    file.remove(paste0(z))
    })
})
4

1 回答 1

1

显然,问题是 DPI 设置得太高,Tesseract 无法处理。将 PDFtoPNG DPI 参数从 600 更改为 150 似乎已纠正了该问题。Tesseract 似乎有一个最大 DPI 可以理解并知道该怎么做。

我还将我的代码从静态命名约定更正为模仿文件原始名称的更动态的代码。

  dest <- "C:\\users\\YOURNAME\\desktop"

  files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 150 ", i,".pdf", " ",i)))
      })


  myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
    lapply(myppms, function(y){
      shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
      file.remove(paste0(y,".ppm"))
      })

  mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
    lapply(mytiffs, function(z){
      shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
      file.remove(paste0(z,".tif"))
      })
于 2017-11-06T19:42:34.103 回答