1

我的数据如下所示:

DocID             Impact
CCRB-9-569  114;Adaptation - Strategic
CCRB-9-531  173;Nutrient trading
CCRB-9-886  
CCRB-9-989  
CCRB-9-530  71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
CCRB-9-671  106;Adaptation Responses;98;Climate Change
CCRB-9-570  114;Adaptation - Strategic
CCRB-9-990  
CCRB-9-526  98;Climate Change

理想情况下,我想结束:

DocID             Impact
CCRB-9-569  Adaptation - Strategic
CCRB-9-531  Nutrient trading
CCRB-9-886  
CCRB-9-989  
CCRB-9-530  Change in Temperature
CCRB-9-530  Extreme weather events
CCRB-9-530  Lower Rainfall
CCRB-9-671  Adaptation Responses
CCRB-9-671  Climate Change
CCRB-9-570  Adaptation - Strategic
CCRB-9-990  
CCRB-9-526  Climate Change

我开始尝试

test1=lapply(unlist(strsplit(test$Impact,"\\;")),as.character)

但是没有能力链接回 DocID 并且没有输入的行没有任何空格。我一直在尝试省略 unlist,尝试重新列出,使用 cbind.fill 函数,合并等,但我错过了一些东西。如果影响列中的数字(114、173 等)最终出现在输出文件中,那很好,只要为它们分配了正确的 DocID 编号。谢谢你的帮助

4

3 回答 3

3

类似的 data.table解决方案

# some dummy data
.data <- data.frame(id = letters[1:5], text = c('12;a-b;34','','a-c','a-c;12;12',''))
# make both columns character, not factor, and make it a data.table
.data <- as.data.table(lapply(.data, as.character))
# for each id, split and return (returning '' if nothing)


.data[, { value = unlist(strsplit(text,split = '\\;')) 
          if (length(value) == 0) text else value },
        by = id]
于 2012-08-10T05:34:35.013 回答
2

我无法让@csgillespie 的功能strsplit正确执行,所以我自己制作了:

 foo <- function(x){  ivec <-                   
  unlist(    # needed to convert the list from strsplit to a vector.
  # The regex split pattern can be read as 
     #---- "find any sections possibly but not necessarily starting with a space or ";"
     # --- "followed necessarily by one or more digits and a ";"
  # strsplit will split and remove these segments.

     strsplit( as.character(x), split= "\\s?;?[[:digit:]]+;" ))   

   #Need to remove length zero items except for the DocID's that don't have any   

     if( any(nchar(ivec))>0){ ivec[nchar(ivec) >0 ] }else{""}
    } # end of function.

 out <- ddply(dta, .(DocID), summarise, Impact=foo(Impact) )
 out
#--------------
         DocID                 Impact
1  CCRB-9-526          Climate Change
2  CCRB-9-530   Change in Temperature
3  CCRB-9-530  Extreme weather events
4  CCRB-9-530          Lower Rainfall
5  CCRB-9-531        Nutrient trading
6  CCRB-9-569  Adaptation - Strategic
7  CCRB-9-570  Adaptation - Strategic
8  CCRB-9-671    Adaptation Responses
9  CCRB-9-671          Climate Change
10 CCRB-9-886                        
11 CCRB-9-989                        
12 CCRB-9-990                        

测试用例的构建(需要使用非空白分隔符):

dta <- read.table(text="DocID     |        Impact
 CCRB-9-569 | 114;Adaptation - Strategic
 CCRB-9-531 | 173;Nutrient trading
 CCRB-9-886 | 
 CCRB-9-989 | 
 CCRB-9-530 | 71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall
 CCRB-9-671 | 106;Adaptation Responses;98;Climate Change
 CCRB-9-570 | 114;Adaptation - Strategic
 CCRB-9-990 | 
 CCRB-9-526 | 98;Climate Change", header=TRUE, sep="|")
于 2012-08-10T05:54:46.510 回答
0

您可以使用该plyr软件包相当轻松地做到这一点。首先,创建一些虚拟数据并加载包:

dd = data.frame(DocID = c("CCRB-9-569", "CCRB-9-530", "CCRB-9-886"),
                 Impact=c("114;Adaptation - Strategic", 
     "71;Change in Temperature;65;Extreme weather events;96;Lower Rainfall",
                          ""), stringsAsFactors=FALSE)
library(plyr)

接下来,我们创建一个适用于Impact列的函数:

f = function(i) { 
    l = unlist(strsplit(as.character(i),";"))
    ##Need to determine if the string was empty
    if(length(l)> 1) l = l[seq(2, length(l), by=2)]
    return(l)

}

然后我们使用ddply

ddply(dd, "DocID", summarise, Impact = f(Impact))

在这里,我们dd将输入作为输入,通过 DocID 分离出来,并将函数f应用于 out Impact 块。


请注意,我的函数f假设您想将字符串拆分为;

功能逻辑

plyr函数根据其值“创建”较小的数据帧DocID。然后我假设对于特定DocID值具有以下格式:

 Number;string;Number;string;Number;string

当我们基于 分割时;,我们得到向量:

Number, string, Number, string, Number, string

所以我们只需要选择偶数元素,即

l[seq(2, length(l), 2)]
于 2012-08-10T05:20:29.233 回答