1

我正在尝试从以下 PDF 创建数据框

library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)

但是,当我调用它时,tab1它只有一列:

      [,1]                                                                     
 [1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
 [2,] "AS OF JUNE 29, 2020 AT 3:00 PM"                                         
 [3,] "POSITIVE CASE STATUS OTHER TESTS"                                       
 [4,] "TOTAL"                                                                  
 [5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"                  
 [6,] "TOTAL 495 16 519 97 805"                                                
 [7,] "ADIRONDACK 0 0 0 75 0"                                                  
 [8,] "ALBION 0 0 0 0 2"                                                       
 [9,] "ALTONA 0 0 0 0 1"  

                                                 

我想提取应该是创建数据框的各个列(例如,对于第 7 行,我将其内容提取到以下列中: 设施(“Adirondack”)已恢复(0)已故(0)正(0)待定(75 ) 负数 (0) )。我认为最有效的方法是根据空格在 tab1 中进行切割,但这不起作用,因为某些设施中有多个单词,所以空间切割会搞砸。有没有人有解决方案的想法?谢谢您的帮助!

4

2 回答 2

2

以下是我将如何使用从 tabulizer 包中提取表格的“格”方法来处理这个问题。

#install.packages("tidyverse")
library(tidyverse)
#install.packages("janitor")
library(janitor)
#install.packages("tabulizer")
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- tabulizer::extract_tables(url, method = "lattice") %>% 
  as.data.frame() %>%
  dplyr::slice(-1,-2) %>% 
  janitor::row_to_names(row_number = 1)
于 2020-07-01T00:10:38.680 回答
1

这是一种解决方法:

library(tabulizer)

url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)

plouf <- tab1[[1]][6:dim(tab1[[1]])[1],] 
plouf <- gsub("([A-Z]+) ([A-Z]+)","\\1_\\2",plouf)
df <- read.table(text = paste0(t(plouf) ,collapse = "\n\r"),sep = " ")
names(df) <- strsplit(tab1[[1]][5,]," ")[[1]]

           FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE
1             TOTAL       495       16      519      97      805
2        ADIRONDACK         0        0        0      75        0
3            ALBION         0        0        0       0        2
4            ALTONA         0        0        0       0        1
5            ATTICA         2        0        2       1        7
6            AUBURN         0        0        0       0       10
7         BARE_HILL         0        0        0       0        6
8     BEDFORD_HILLS        43        1       44       5       53
9      CAPE_VINCENT         0        0        0       0        0
10           CAYUGA         0        0        0       2        1
11          CLINTON         1        0        1       0       25
12          COLLINS         1        0        1       0       13
13        COXSACKIE         1        0        1       0       57
14        DOWNSTATE         1        0        1       0       12
15          EASTERN        17        1       20       0       17
16        EDGECOMBE         0        0        0       0        0
17           ELMIRA         0        0        0       1       20
18         FISHKILL        78        5       83       4       98
19      FIVE_POINTS         0        0        0       0        4
20         FRANKLIN         1        0        1       0       24

我把表格放在标题后面,然后删除FACILITY名称之间的空格gsub(我实际上将它们替换为_,因此您可以根据需要重新更改为空格。您也可以使用str_replacefromstringr代替gsub)。

然后我使用 read.table,在每行之后强制文本以行尾结尾。我在之后添加名称(因为如果没有,它们会在 中更改gsub并且read.table无法正确读取它们)。

于 2020-06-30T22:07:42.663 回答