我试图在 R 中同时使用pdftools::pdf_text
和来抓取一个相当困难的 PDF tabulizer::extract_tables
。但是,在我的情况下,根据PDF的性质,这些似乎都没有太大帮助。PDF 包含“嵌套”信息,如图所示。
解决这个问题的最佳方法是什么?stringr::str_split_fixed
使用with 用空格分割n=3
给了我矩阵,但似乎很难创建一个正则表达式来检测每列中我想要的信息(仅在描述和事件日期/时间之后)。
我试图在 R 中同时使用pdftools::pdf_text
和来抓取一个相当困难的 PDF tabulizer::extract_tables
。但是,在我的情况下,根据PDF的性质,这些似乎都没有太大帮助。PDF 包含“嵌套”信息,如图所示。
解决这个问题的最佳方法是什么?stringr::str_split_fixed
使用with 用空格分割n=3
给了我矩阵,但似乎很难创建一个正则表达式来检测每列中我想要的信息(仅在描述和事件日期/时间之后)。
我认为正则表达式方法并不那么复杂:
library(pdftools)
library(tidyverse)
library(magrittr)
mylog <- "https://www.lsu.edu/police/files/crime-log/2021/jan2021.pdf"
pdf.text <- pdf_text(mylog)
map_dfr(pdf.text, ~ {
str_split(.x,"\\n") %>% unlist() -> vectors;
vectors %>% str_detect("^Case") %>% which %>% add(1) -> cases
vectors %>% str_detect("^Desc") %>% which %>% add(1) -> descriptions
vectors %>% str_detect("^Addr") %>% which %>% add(1) -> addresses
vectors[cases] %>% str_split("(\\s{2,}|\\s(?=[0-9]{1,2}/)|(?<=[AP]M)\\s+)") %>%
map_dfr(~setNames(.,c("Case.Number","Date.Report","Date.Incident","Case.Status")[seq_along(.)])) -> cases
vectors[descriptions] %>% str_split("\\s{2,}") %>%
map_dfr(~setNames(.,c("Description","Date.Incident.End")[seq_along(.)])) -> descriptions
bind_cols(cases,descriptions,data.frame(Address = vectors[addresses]))
})
# A tibble: 155 x 7
Case.Number Date.Report Date.Incident Case.Status Description Date.Incident.End Address
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 20210101-001 January 01, 20… 1/1/2021 10:28:0… Inactive COMPLAINT ANIMAL 1/1/2021 10:28:00AM UREC FIELDS
2 20210101-002 January 01, 20… 1/1/2021 2:48:00… Inactive 911 HNGUP/OP - 911 HANG-UP/O… 1/1/2021 2:48:00PM PMAC
3 20210101-003 January 01, 20… 1/1/2021 3:27:00… Pending UNAUTHORIZED ENTRY OF A PLAC… 1/1/2021 3:27:00PM COMPANION ANIMAL AL…
4 20210102-001 January 02, 20… 1/2/2021 5:12:00… Inactive SUSPICIOUS INCIDENT 1/2/2021 5:12:00PM TIGER STADIUM
5 20210103-001 January 03, 20… 12/23/2020 12:00… Pending HIT AND RUN 1/3/2021 9:15:00AM BROUSSARD HALL TRAF…
6 20210103-002 January 03, 20… 1/3/2021 9:28:46… Inactive DISTURBANCE 1/3/2021 9:28:00PM VET SCHOOL
7 20210104-001 January 04, 20… 11/23/2018 11:00… Inactive NONCRIMINAL INFORMATION ONLY 11/23/2018 11:00:0… Oaks Lot
8 20210104-002 January 04, 20… 1/4/2021 7:26:00… Inactive SUSPICIOUS INCIDENT 1/4/2021 7:26:00AM ECE
9 20210104-003 January 04, 20… 8/1/2017 12:00:0… Pending INVESTIGATN - INVESTIGATION 1/2/2021 3:00:00PM EAST CAMPUS APARTME…
10 20210104-004 January 04, 20… 1/4/2021 12:30:0… Pending HIT AND RUN 1/4/2021 12:30:00PM HIGHLAND ROAD @ STU…
# … with 145 more rows