以下假设您的记录在每个条目上一行,即它看起来像:
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:",
"Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32 E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")
如果不是,但如果我们可以假设 " Address:
" 字段总是在第一行,我们可以这样做:
## Starting point
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other",
"address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:",
"Atodo - Asociación de todo Address: calle 12 Bogota Colombia",
"Other address: Phone.: 12-1-23-32 E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")
## Locate the elements that have "Address:" and use cumsum to get an index
## Use tapply to paste the relevant vector elements together into single strings
text <- tapply(text,
cumsum(grepl("Address:", text)),
paste, collapse = " ")
从那里开始,该方法基本上如下:
- 提取
list
“标题”部分。
- 提取一个
list
相关值。
- 将它们重新组合在一起作为向量。
- 再次将它们分开。
- 将结果从“长”格式重塑为“宽”格式。
使用的工具如下:
library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit
该方法与@won782 的方法类似。
splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
"Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")
我发现一些“stringr”函数有点慢,所以坚持使用基础 R:
X1 <- regmatches(text, gregexpr(pattern, text))
X2 <- regmatches(text, gregexpr(pattern, text), invert = TRUE)
Combined <- Map(paste0,
lapply(X1, append, values = "Company:", after = 0),
lapply(X2, data.table:::trim))
这是我们目前的情况:
Combined
# [[1]]
# [1] "Company:Anada - Asociación de nada" "Address:calle 13 13 Medellin Colombia"
# [3] "Other address:" "Phone.:13-13-136131 13-13-13-1313"
# [5] "E-mail:anada@13.co" "Web page:"
# [7] "Category:3." "Private sector Notes:"
#
# [[2]]
# [1] "Company:Atodo - Asociación de todo"
# [2] "Address:calle 12 Bogota Colombia"
# [3] "Other address:"
# [4] "Phone.:12-1-23-32"
# [5] "E-mail:"
# [6] "Web page:www.atodoooo.com,"
# [7] "Category:99."
# [8] "Public sector Notes:note that there are missing fields."
该cSplit
函数与data.table
s 配合得很好,所以让我们直接使用它。
DT <- data.table(V1 = unlist(Combined)) ## unlist the values
DT <- cSplit(DT, "V1", ":") ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)] ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")] ## Add an id column
从那里,我们可以使用dcast.data.table
将数据集从“长”数据集转换为“宽”数据集,如下所示:
dcast.data.table(DT, id ~ V1_1, value.var = "V1_2")
# id Address Category Company
# 1: 1 calle 13 13 Medellin Colombia 3. Anada - Asociación de nada
# 2: 2 calle 12 Bogota Colombia 99. Atodo - Asociación de todo
# E-mail Notes Other address
# 1: anada@13.co NA NA
# 2: NA note that there are missing fields. NA
# Phone. Web page
# 1: 13-13-136131 13-13-13-1313 NA
# 2: 12-1-23-32 www.atodoooo.com,