r - R：从 PDF 中提取日期和数字

Question

我真的很难从 NTSB 的数千个 PDF 文件中提取正确的信息（具体是一些日期和数字）；这些 PDF 不需要进行 OCRed，每个报告的长度和布局信息几乎相同。

我需要提取事故发生的日期和时间（第一页）以及其他一些信息，例如飞行员的年龄或飞行经验。我尝试过的内容适用于几个文件，但不适用于每个文件，因为我使用的代码写得不好。

# an example with a single file
library(pdftools)
library(readr)

# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(),"/example.pdf")
download.file(file, destfile)

pdf <- pdf_text(destfile)
rows <-scan(textConnection(pdf), 
            what="character", sep = "\n")

# Extract the date of the accident based on the 'Date & Time' occurrence.
date <-rows[grep(pattern = 'Date & Time', x = rows, ignore.case = T, value = F)]
date <- strsplit(date, "  ")
date[[1]][9] #this method is not desirable since the date will not be always in that position

# Pilot age 
age <- rows[grep(pattern = 'Age', x = rows, ignore.case = F, value = F)]
age <- strsplit(age, split = '  ')
age <- age[[1]][length(age[[1]])] # again, I'm using the exact position in that list
age <- readr::parse_number(age) #

我遇到的主要问题是当我试图提取事故的日期和时间时。是否可以通过避免像我在这里所做的那样使用列表来提取确切的信息？

score 1 · Accepted Answer

我认为实现您想要的最佳方法是使用regex. 在这种情况下，我使用stringr库。主要思想regex是找到所需的字符串模式，在这种情况下是日期'July 29, 2014, 11:15'

考虑到您必须检查每个 pdf 文件的日期格式

library(pdftools)
library(readr)
library(stringr)

# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(), "/example.pdf")
download.file(file, destfile)

pdf <- pdf_text(destfile)

## New code

# Regex pattern for date 'July 29, 2014, 11:15'
regex_pattern <- "[T|t]ime\\:(.*\\d{2}\\:\\d{2})"

# Getting date from page 1
grouped_matched <- str_match_all(pdf[1], regex_pattern)

# This returns a list with groups. You're interested in group 2
raw_date <- grouped_matched[[1]][2] # First element, second group
# Clean date
date <- trimws(raw_date)


# Using dplyr
library(dplyr)

date <- pdf[1] %>%
            str_match_all(regex_pattern) %>%
            .[[1]] %>% # First list element
            .[2] %>%   # Second group
            trimws()   # Remove extra white spaces

您可以制作一个函数来提取更改regex不同文件模式的日期

问候

r - R：从 PDF 中提取日期和数字

1 回答 1

Related

Reference