我正在寻找一种方法来搜索构成数据框行的因素级别之间的特定交互形式。
我有一个数据框,例如这个,其中每一列都是一个个体,每一行都是一个观察:
A B C D E G H I
1 NA "1" "1" "1" "1" NA "1" "1"
2 "2" "1" "2" "1" "1" NA "1" "1"
3 "1" "2" "2" "1" "1" "1" "1" "2"
4 "1" "2" "2" "2" "3" "3" "4" "2"
5 "1" "1" "2" "2" "1" "2" "1" "2"
我要检测的是因子水平组合的存在(或不存在),例如 for an x:x' and x:y'
exists also a combination y:x' and y:y'
。例如在这里,第 2 行和第 3 行存在这样的组合,我可以通过使用interaction
or看到:
:
> df <- structure(c(NA, "2", "1", "1", "1", "1", "1", "2", "2", "1",
"1", "2", "2", "2", "2", "1", "1", "1", "2", "2", "1", "1", "1",
"3", "1", NA, NA, "1", "3", "2", "1", "1", "1", "4", "1", "1",
"1", "2", "2", "2"), .Dim = c(5L, 8L), .Dimnames = list(c("1",
"2", "3", "4", "5"), c("A", "B", "C", "D", "E", "G", "H", "I")))
> interaction(df["2",],df["3",])
[1] 2.1 1.2 2.2 1.1 1.1 <NA> 1.1 1.2
Levels: 1.1 2.1 1.2 2.2
也:
> as.factor(df["2",]):as.factor(df["3",])
[1] 2:1 1:2 2:2 1:1 1:1 <NA> 1:1 1:2
Levels: 1:1 1:2 2:1 2:2
但是,现在,我希望自动完成检测,这样我就可以将所有行对的标签放在数据框中,其中包含这样的配置 (x:y, x:y', x':y, x':y') 被检测到我以后要绘制的网络的边缘列表中(例如,我想"2","3"
在边缘列表中添加一行)。
我找到了一种使用 Perl 和正则表达式的复杂方法来做到这一点,但我想知道在 R 中是否存在不使用 Regexp 的方法来做到这一点。
编辑[04/05/2013]
为避免不清楚,以下是有关我正在寻找的配置的更多详细信息:
let {x,y,...} be observations of the first row
let {x',y',...} be observations of the second row
for interactions ({x,x'} and {x,y'}) does it exists interactions ({y,x'} and {y,y'})
因此,举几个例子,交互例如:
1:1, 1:2, 2:1, 2:2 (rows 2 and 3)
或者
1:1, **2:1**, **2:2**, **3:1**, **3:2**, 4:1 (rows 4 and 5)
会匹配,但不匹配
1:1,1:2,1:3,1:4, 2:2 (rows 3 and 4)
或者
1:1,1:2 (rows 1 and 2)
例如。
我现在拥有的是一个代码,它可以在很长一段时间内完成我想做的事情(模仿以前的 Perl 脚本)(即使我添加了一个 while 循环以避免不必要的比较),并使用多个循环和正则表达式. 我希望有一种不那么复杂的方法来进行这种比较。这是我现在的做法:
df <- structure(c(NA, "2", "1", "1", "1", "1", "1", "2", "2", "1",
"1", "2", "2", "2", "2", "1", "1", "1", "2", "2", "1", "1", "1",
"3", "1", NA, NA, "1", "3", "2", "1", "1", "1", "4", "1", "1",
"1", "2", "2", "2"), .Dim = c(5L, 8L), .Dimnames = list(c("1",
"2", "3", "4", "5"), c("A", "B", "C", "D", "E", "G", "H", "I")))
"myfunction" = function(x){
TableVariantes = as.matrix(x) ;
#Creating the edgelist for the network
edgelist = c(character(0),character(0));
TotalVL = nrow(TableVariantes);
for(i in 1:(TotalVL-1)){
VLA = i;
if(!(i+1) > TotalVL){
for(j in (i+1):TotalVL){
VLB = j ;
problematic.configuration = FALSE;
#False until proven otherwise
interactions = interaction(as.factor(TableVariantes[VLA,]):as.factor(TableVariantes[VLB,]),drop=TRUE);
if(nlevels(as.factor(interactions)) > 3){
#More than three configurations, let's go
#Testing every level of the first variant location
for(k in levels(as.factor(TableVariantes[VLA,]))){
# We create the regexp we will need afterwards. Impossible to use variables inside a regex in R.
searchforK = paste(k,":(.+)",sep="")
if (length(grep(searchforK,levels(interactions), ignore.case = TRUE, perl = TRUE)) > 1){
#More than one configuration for this level of the first row
#capturing corresponding observations of the second row
second.numbers = regexec(searchforK,levels(interactions), ignore.case = TRUE)
second.numbers = do.call(rbind,lapply(regmatches(levels(interactions),second.numbers),`[`))
#Interactions with first number other than the one we are testing
invert.matches = grep(searchforK,levels(interactions), ignore.case = TRUE, perl = TRUE, value=TRUE, invert=TRUE)
#listing these alternative first numbers
alternative.first.numbers = regexec("(.+?):.+",levels(as.factor(invert.matches)), ignore.case = TRUE)
alternative.first.numbers = do.call(rbind,lapply(regmatches(levels(as.factor(invert.matches)),alternative.first.numbers),`[`))
#testing each alternative first number
for(l in levels(as.factor(alternative.first.numbers[,2]))){
#variable problems to count the problematic configurations
problems = 0 ;
#with each alternative second number
for(m in levels(as.factor(second.numbers[,2]))){
searchforproblem = paste(l,":",m,sep="");
if(length(grep(searchforproblem,invert.matches,ignore.case = TRUE, perl = TRUE)) > 0){
#if it matches
problems = problems + 1;
}
if(problems > 1){
#If two possibilities at least
problematic.configuration = TRUE;
}
}
}
}
}
}
if(problematic.configuration == TRUE){
edgelist = rbind(edgelist,c(rownames(TableVariantes)[VLA],rownames(TableVariantes)[VLB]));
#adding a new edge to the network of conflicts !
}
}
}
}
return(edgelist);
}