0

I have the following text:

Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other
address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:
Atodo - Asociación de todo Address: calle 12 Bogota Colombia
Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.

I would like to obtain a matrix with column names to be converted as a .csv file as:

Company, Address, Other Address, Tel, E-mail, Web page, Category, Sector, Notes

And rows:

Anada - Asociación de nada, calle 13 13 Medellin Colombia, 13-13-136131 13-13-13-1313,anada@13.co,,3,Private,,

Atodo - Asociación de todo,calle 12 Bogota Colombia,,12-1-23-32,www.atodoooo.com,99,Public,note that there are missing fields.

How can it be done with R?

4

2 回答 2

2

这可能很乏味,但似乎需要字符串处理。

splitlist = 'Address|Other address|Phone|E-mail|Web page|Category'  
a = str_split(text[1], ':')  

for (i in 1:length(a[[1]])) {  
 a[[1]][i] = str_replace_all(a[[1]][i], splitlist, "")  
}  

# [[1]]
# [1] "Atodo - Asociacin de todo "           " calle 12 Bogota Colombia "          
# [3] " ."                                   " 12-1-23-32  "                       
# [5] " "                                    " www.atodoooo.com, "                 
# [7] " 99. Public sector Notes"             " note that there are missing fields."

然后,您可以使用较少的字符串处理来提取每个字段。

在这种情况下,除了正则表达式之外,我想不出任何更简单的方法。

于 2014-08-06T23:36:42.973 回答
1

以下假设您的记录在每个条目上一行,即它看起来像:

text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:", 
          "Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")

如果不是,但如果我们可以假设 " Address:" 字段总是在第一行,我们可以这样做:

## Starting point
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other", 
          "address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:", 
          "Atodo - Asociación de todo Address: calle 12 Bogota Colombia", 
          "Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")

## Locate the elements that have "Address:" and use cumsum to get an index
## Use tapply to paste the relevant vector elements together into single strings
text <- tapply(text, 
               cumsum(grepl("Address:", text)), 
               paste, collapse = " ")

从那里开始,该方法基本上如下:

  • 提取list“标题”部分。
  • 提取一个list相关值。
  • 将它们重新组合在一起作为向量。
  • 再次将它们分开。
  • 将结果从“长”格式重塑为“宽”格式。

使用的工具如下:

library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit

该方法与@won782 的方法类似。

splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
               "Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")

我发现一些“stringr”函数有点慢,所以坚持使用基础 R:

X1 <- regmatches(text, gregexpr(pattern, text))
X2 <- regmatches(text, gregexpr(pattern, text), invert = TRUE)

Combined <- Map(paste0, 
                lapply(X1, append, values = "Company:", after = 0), 
                lapply(X2, data.table:::trim))

这是我们目前的情况:

Combined
# [[1]]
# [1] "Company:Anada - Asociación de nada"    "Address:calle 13 13 Medellin Colombia"
# [3] "Other address:"                        "Phone.:13-13-136131 13-13-13-1313"    
# [5] "E-mail:anada@13.co"                    "Web page:"                            
# [7] "Category:3."                           "Private sector Notes:"                
# 
# [[2]]
# [1] "Company:Atodo - Asociación de todo"                     
# [2] "Address:calle 12 Bogota Colombia"                       
# [3] "Other address:"                                         
# [4] "Phone.:12-1-23-32"                                      
# [5] "E-mail:"                                                
# [6] "Web page:www.atodoooo.com,"                             
# [7] "Category:99."                                           
# [8] "Public sector Notes:note that there are missing fields."

cSplit函数与data.tables 配合得很好,所以让我们直接使用它。

DT <- data.table(V1 = unlist(Combined))       ## unlist the values
DT <- cSplit(DT, "V1", ":")                   ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)]  ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")]         ## Add an id column

从那里,我们可以使用dcast.data.table将数据集从“长”数据集转换为“宽”数据集,如下所示:

dcast.data.table(DT, id ~ V1_1, value.var = "V1_2")
#    id                       Address Category                    Company
# 1:  1 calle 13 13 Medellin Colombia       3. Anada - Asociación de nada
# 2:  2      calle 12 Bogota Colombia      99. Atodo - Asociación de todo
#         E-mail                               Notes Other address
# 1: anada@13.co                                  NA            NA
# 2:          NA note that there are missing fields.            NA
#                        Phone.          Web page
# 1: 13-13-136131 13-13-13-1313                NA
# 2:                 12-1-23-32 www.atodoooo.com,
于 2014-08-07T08:06:19.090 回答