我喜欢根据冒号的存在来拆分 pdf 文档的信息。一个样本在这里。
我正在尝试以下操作。阅读pdf后,我试图用冒号分割它。
library(textreadr)
dat <- '~Here is the thing1.pdf' %>%
textreadr::read_pdf()
dat
Source: local data frame [26 x 3]
page_id element_id text
1 1 1 Here is the thing.
2 1 2 Case ID 1
3 1 3 Exploring Angels: It is a long establish
4 1 4 page when looking at its layout. The poi
5 1 5 distribution of letters, as opposed to u
6 1 6 English. Many desktop publishing package
7 1 7 model text, and a search for 'lorem ipsu
8 1 8 versions have evolved over the years, so
9 1 9 and the like).
10 1 10 New agency: Lorem Ipsum is simply dummy
.. ... ... ...
或者
library(pdftools)
dat <- pdf_text("~Here is the thing1.pdf")
dat1 <- strsplit(dat[[1]], "\n")[[1]]
head(dat1)
[1] "Here is the thing.\r"
[2] "Case ID 1\r"
[3] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a\r"
[4] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal\r"
[5] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable\r"
[6] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default\r"
dat2 <- dat1 %>%
str_split(pattern = "\r")
head(dat2)
[[1]]
[1] "Here is the thing." ""
[[2]]
[1] "Case ID 1" ""
[[3]]
[1] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a"
[2] ""
[[4]]
[1] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal"
[2] ""
[[5]]
[1] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable"
[2] ""
[[6]]
[1] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default"
[2] "
我想把我的数据分类成这样的表:
Case.ID Exploring.Angels New.agency New.Factor New.Factor2 Creative.One
1 1 It is a long established fact that a reader Lorem Ipsum is simply dummy text ABC BNM <NA>
2 2 Various versions have evolved It has survived not only five ABC <NA> DFZ