r - 提取具有不同空白的 PDF 数据作为分隔符

Question

我正在考虑从此 PDF 中获取数据。

我遇到了一个问题，其中包含多个单词的位置名称（例如“北岛”）被放入不同的列中。

“read.table”中的“sep”参数似乎只能读取单个空格作为分隔符。理想情况下，我希望任何有多个空格的东西都可以作为分隔符。这是可能吗？


url <- "C:/Users/files/PSSS Weekly Bulletin - W1 2019 (Dec 31-Jan 06).pdf"

# Convert the PDF to a text string
txt <- pdf_text(url)

# get the working directory
wd <- getwd()

#write the file to the working directory
file_name <- paste0(wd, "/", "temp.txt")
write(txt, file = file_name, sep = "\t")

# Convert to a table. Data is located starting line 25, and lasts 25 lines
# P.S: I've tried this code with and without the "sep" argument. No change. 
dtaPCF <- read.table(file_name, skip = 24, nrows = 25, fill = TRUE, header = TRUE)

# Here is the text that I'd like to read.table with. Ideally, I'd want to keep the headers, but it's not a dealbreaker if that doesn't work.


Country/Area      No. sites  No. reported  % reported  AFR  Diarrhoea  ILI  PF  DLI

American Samoa   0          0             0%          0    0          0    0   0

Cook Islands     13         11            85%         0    3          3    0   0

FSM              4          3             75%         0    21         74   0   3

Fiji             0          0             0%          0    0          0    0   0

French Polynesia 31         16            52%         3    9          11   3   3

Guam             0          0             0%          0    0          0    0   0

Kiribati         7          7             100%        0    172        609  22  0

Marshall Islands 2          2             100%        0    4          0    2   0

N Mariana Is     7          7             100%        4    13         60   17  0

Nauru            0          0             0%          0    0          0    0   0

New Caledonia    0          0             0%          0    0          0    0   0

New Zealand      0          0             0%          0    0          0    0   0

Niue             0          0             0%          0    0          0    0   0

PNG              0          0             0%          0    0          0    0   0

Palau            0          0             0%          0    0          0    0   0

Pitcairn Islands 1          1             100%        0    0          0    0   0

Samoa            13         6             46%         0    262        606  18  4

Solomon Islands  13         4             31%         0    75         59   4   1

Tokelau          2          2             100%        0    2          9    0   0

Tonga            11         11            100%        0    17         73   0   0

Tuvalu           0          0             0%          0    0          0    0   0

Vanuatu          11         7             64%         0    49         171  0   1

Wallis & Futuna  0          0             0%          0    0          0    0   0

score 0 · Accepted Answer

这是我最终使用的代码。我使用记事本检查每列的最大字符长度并将其用于 fwf_widths()。

library(readr)

dtaPCF <- read_fwf(file_name,
                   skip = 47,
                   n_max = 23,
                   trim_ws = TRUE,
                   fwf_widths(c(17, 11, 14, 12, 5, 11, 5, 4, 1), 
                              c("Country/Area", "No. sites", "No. reported", 
                                "% reported", "AFR", "Diarrhoea", "ILI", "PF", "DLI")))

r - 提取具有不同空白的 PDF 数据作为分隔符

1 回答 1

Related

Reference