我真的很难从 NTSB 的数千个 PDF 文件中提取正确的信息(具体是一些日期和数字);这些 PDF 不需要进行 OCRed,每个报告的长度和布局信息几乎相同。
我需要提取事故发生的日期和时间(第一页)以及其他一些信息,例如飞行员的年龄或飞行经验。我尝试过的内容适用于几个文件,但不适用于每个文件,因为我使用的代码写得不好。
# an example with a single file
library(pdftools)
library(readr)
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(),"/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
rows <-scan(textConnection(pdf),
what="character", sep = "\n")
# Extract the date of the accident based on the 'Date & Time' occurrence.
date <-rows[grep(pattern = 'Date & Time', x = rows, ignore.case = T, value = F)]
date <- strsplit(date, " ")
date[[1]][9] #this method is not desirable since the date will not be always in that position
# Pilot age
age <- rows[grep(pattern = 'Age', x = rows, ignore.case = F, value = F)]
age <- strsplit(age, split = ' ')
age <- age[[1]][length(age[[1]])] # again, I'm using the exact position in that list
age <- readr::parse_number(age) #
我遇到的主要问题是当我试图提取事故的日期和时间时。是否可以通过避免像我在这里所做的那样使用列表来提取确切的信息?