我必须从通过 URL 上传的 pdf 中提取数据。pdf 是 image/.png 格式,因此在使用 tesseract 包时,很少有行无法识别。
编码:
library(rvest)
library(dplyr)
library(pdftools)
library(tesseract)
url="https://www.hindustancopper.com/Page/PriceCircular"
links=url %>%
#reading the html of the url
read_html()%>%
#fetching out the nodes and the attributes
html_nodes("#viewTable li:nth-child(1) a") %>% html_attr("href")%>%
#replacing few strings
str_replace("../..",'')
str(links)
#using pdftools to read the pdf
base_url <- 'https://www.hindustancopper.com'
# combine the base url with the event url
event_url <- paste0(base_url, links)
event_url
#since the link has a scan copy and not the pdf itself hence using tesseract package
pdf_convert(event_url,
pages = 1,
dpi = 850,
filenames = "page1.png")
# what does the data look like
text <- ocr("page1.png")
cat(text)
实际输出将产品列表及其价格读取为:
CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.
预期的输出应该是:
CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc
我已经尝试过多次更改 dpi 参数的值,但这并没有太大帮助。提前致谢!