1

tesseract我正在尝试使用andpdftools包将一系列扫描的 PDF 转换为可搜索的 PDF 。我已经完成了两个步骤。现在我需要写回可搜索的pdf。

  1. 阅读扫描的 PDF
  2. 运行 OCR
  3. 写回可搜索的 PDF
eg <- download.file("https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf", "example.pdf", mode = "wb")

results <- tesseract::ocr_data("example.pdf", engine = "eng")
R> results
# A tibble: 406 x 3
   word        confidence bbox             
   <chr>            <dbl> <chr>            
 1 PFU               96.9 228,181,404,249  
 2 Business          96.2 459,180,847,249  
 3 report            96.2 895,182,1145,259 
 4 |                 52.5 3980,215,3984,222
 5 No.068            91.0 4439,163,4754,237
 6 New               96.0 493,503,1005,687 
 7 customer's        94.6 1069,484,2231,683
 8 development       96.5 2304,483,3714,732
 9 di                90.4 767,763,1009,959 
10 ing               96.3 1754,773,1786,807
# ... with 396 more rows

或者,我可以在 R for Windows 中调用另一个包或命令行工具吗?

4

0 回答 0