1

我必须从通过 URL 上传的 pdf 中提取数据。pdf 是 image/.png 格式,因此在使用 tesseract 包时,很少有行无法识别。

编码:

library(rvest)
library(dplyr)
library(pdftools)
library(tesseract)

url="https://www.hindustancopper.com/Page/PriceCircular"
links=url %>% 
  #reading the html of the url
  read_html()%>%
  #fetching out the nodes and the attributes
  html_nodes("#viewTable li:nth-child(1) a") %>% html_attr("href")%>%
  #replacing few strings
  str_replace("../..",'')
str(links)

#using pdftools to read the pdf
base_url <- 'https://www.hindustancopper.com'
# combine the base url with the event url
event_url <- paste0(base_url, links)
event_url

#since the link has a scan copy and not the pdf itself hence using tesseract package
pdf_convert(event_url, 
            pages = 1, 
            dpi = 850, 
            filenames = "page1.png")
# what does the data look like
text <- ocr("page1.png")
cat(text)

实际输出将产品列表及其价格读取为:

CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567 
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.

预期的输出应该是:

CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc

我已经尝试过多次更改 dpi 参数的值,但这并没有太大帮助。提前致谢!

4

1 回答 1

2

我将 Ubuntu 18.04 和 tesseract 5.0.0-alpha-647-g4a00 用于以下命令。

我下载了您的代码中提到的示例 pdf 之一。

https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf

然后我使用此命令将其转换为 png

pdftoppm 0-637189269505122500-AnnualReport.pdf report.png -png

然后通过使用 gimp,我旋转文档以使其平整。

然后我使用这个 tesseract 命令来翻译文档。

tesseract report.png stdout -l eng --oem 3 --psm 6 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789:.-/ "

结果如下:

HINDUSTAN COPPER LIMITED
A GOVT. OF INDIA ENTERPRISE
kK
Registered Head Office
Tamra Bhavan
1 Ashutosh Chowdhury Avenue
Kolkata - 700019
Ref: HCL/HO/MKTG/Cu-P/ 2019-2020
Date : 02-MAR-20
Sub: Basic Price of Cathodes and CC Rods for the month of MAR 2020.
The Basic Price of Copper Cathodes and CC Copper Rods for the month of MAR 2020 are as follows:
Basic Price Ex-Works /
Ex.Godown basis Rs. / MT
CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056
COPPER CATHODE CUT 437856
CONTINUOUS CAST COPPER WIRE ROD 8 MM 440078
CONTINUOUS CAST COPPER WIRE ROD 19.6 MM 444546
CONTINUOUS CAST COPPER WIRE ROD 12.5 MM 441567
Note : Monthly LME CSP Avg. : 5686.45 Monthly Avg. Exchange Rate : 71.59
The price ruling on the date of delivery will be applicable. irrespective of the date of making financial arrangements i.e.
advance payment/opening of letter of credit. GST other statutory levies will be extra as applicable.
For purchase against usance Letter of Credit the interest rate chargeable shall be 10 per annum for the credit
period up to 90/60/30 days.
Customers may note that the price and interest rate is subject to change without prior notice. The price and interest rate
ruling on the date of delivery will be applicable irrespective of the date of their making financial arrangements. All bank
charges of negotiating bank will be borne by us.
ADD YAS
Zl Bl rTeri68
S Parashar
DGM Commercial
于 2020-04-12T13:14:29.000 回答