我正在尝试从以下 PDF 创建数据框
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
但是,当我调用它时,tab1
它只有一列:
[,1]
[1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
[2,] "AS OF JUNE 29, 2020 AT 3:00 PM"
[3,] "POSITIVE CASE STATUS OTHER TESTS"
[4,] "TOTAL"
[5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"
[6,] "TOTAL 495 16 519 97 805"
[7,] "ADIRONDACK 0 0 0 75 0"
[8,] "ALBION 0 0 0 0 2"
[9,] "ALTONA 0 0 0 0 1"
我想提取应该是创建数据框的各个列(例如,对于第 7 行,我将其内容提取到以下列中: 设施(“Adirondack”)已恢复(0)已故(0)正(0)待定(75 ) 负数 (0) )。我认为最有效的方法是根据空格在 tab1 中进行切割,但这不起作用,因为某些设施中有多个单词,所以空间切割会搞砸。有没有人有解决方案的想法?谢谢您的帮助!