0

我正在尝试用 R 代码编写先验算法。首先,我想计算列表中每个项目的频率。我的初始代码如下:

a_list <- list(c("I1","I2","I5"),
           c("I2","I4"),
           c("I2","I3"),
           c("I1","I2","I4"),
           c("I1","I3"),
           c("I2","I3"),
           c("I1","I3"),
           c("I1","I2","I3","I5"),
           c("I1","I2","I3"))
sapply(a_list, function(x) length(x))
un <- unique(unlist(a_list))
nm <- lapply(un, function(x) sapply(a_list, function(y) sum(y == x)))
names(nm) <- un
nm

我的结果是:

> nm

$I1
[1] 1 0 0 1 1 0 1 1 1

$I2
[1] 1 1 1 1 0 1 0 1 1

$I5
[1] 1 0 0 0 0 0 0 1 0

$I4
[1] 0 1 0 1 0 0 0 0 0

$I3
[1] 0 0 1 0 1 1 1 1 1

但是,我希望它被安排为(也许在矩阵或数组中重新列出,然后我可以进一步使用它):

> nm

I1 6
I2 7
I3 6
I4 2
I5 2

每个项目显示频率计数和字母顺序。有没有办法实现它?我尝试了 cbind、apply、relist,但还没有找到解决方案。谢谢

更新:

library(dplyr)
a_list <- list(c("I1","I2","I5"),
           c("I2","I4"),
           c("I2","I3"),
           c("I1","I2","I4"),
           c("I1","I3"),
           c("I2","I3"),
           c("I1","I3"),
           c("I1","I2","I3","I5"),
           c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
a
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
c

现在我的结果为:

> a
   . Freq
1 I1    6
2 I2    7
3 I3    6
4 I4    2
5 I5    2

> c
   . Freq
1 I1    6
2 I2    7
3 I3    6

然后如何从扫描原始列表中设置“I1,I2”,...,“I2,I3”的组合?

UpDATE:我尝试如下组合,它输出一个矩阵。

> combn(c$.,2)
     [,1] [,2] [,3]
[1,] I1   I1   I2  
[2,] I2   I3   I3  
Levels: I1 I2 I3 I4 I5

进一步修改为:

d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
result

我的结果是:

> result
[1] "I1,I2" "I1,I3" "I2,I3"

接下来是从原始“a_list”中计算上述项目集的频率。也许最好输出为

""I1","I2"", ""I1","I3"", ""I2","I3""

为了与原始列表进行比较。

如何从原始 a_list 中获取此矩阵中项集的频率?先验算法要求扫描不小于最小支持度的所有项集,从1维(即a_list中的“I1”、“I2”、...、“I5”)到2维(即。“I1,I2”“I1 ,I3" "I2,I3" 在这种情况下),然后在适用的情况下打开(例如 "I1,I2,I3")。

更新:现在我可以单独找到具有特定模式的匹配项,例如 ("I1","I2") 或 ("I1","I3")。

toMatch <- c("I1","I2")
matches <- grepRaw(toMatch,a_list,ignore.case = TRUE)
matches

结果:

> matches
[1] 4

一次性匹配“结果”中的所有模式(我在上面的示例中手动输入模式,但需要从“结果”中提取)的问题仍有待解决。并以以下形式输出它们:

Itemset     Freq
""I1","I2"" 4     
""I1","I3"" 4
""I2","I3"" 4
4

1 回答 1

1

dplyr软件包使此操作清晰。

library(dplyr)
unlist(a_list) %>% table %>% data.frame

  unlist.a_list. Freq
1             I1    6
2             I2    7
3             I3    6
4             I4    2
5             I5    2

更新:

我不确定你在寻找什么,但这里是如何获得组合:

Cols <- paste0("I",1:3)
p <- length(Cols)
id <- unlist(lapply(1:p, function(i) combn(1:p,i,simplify=F)), recursive=F)
formulas <- sapply(id,function(i) paste(Cols[i],collapse=","))

> formulas
[1] "I1"       "I2"       "I3"       "I1,I2"    "I1,I3"    "I2,I3"    "I1,I2,I3"

更新 2:

这应该做你需要的:

library(dplyr)
a_list <- list(c("I1","I2","I5"),
           c("I2","I4"),
           c("I2","I3"),
           c("I1","I2","I4"),
           c("I1","I3"),
           c("I2","I3"),
           c("I1","I3"),
           c("I1","I2","I3","I5"),
           c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
> result
[1] "I1,I2" "I1,I3" "I2,I3"

然后折叠你的 a_list 看起来像结果:

a.new.list <- sapply(a_list, paste, collapse=",")
> a.new.list
[1] "I1,I2,I5"    "I2,I4"       "I2,I3"       "I1,I2,I4"    "I1,I3"       "I2,I3"       "I1,I3"      
[8] "I1,I2,I3,I5" "I1,I2,I3" 

使用match函数并遍历所有结果:

hits <- sapply(1:length(result), function(j) match(a.new.list,result[j]))
colnames(hits) <- result
rownames(hits) <- a.new.list
> hits
            I1,I2 I1,I3 I2,I3
I1,I2,I5       NA    NA    NA
I2,I4          NA    NA    NA
I2,I3          NA    NA     1
I1,I2,I4       NA    NA    NA
I1,I3          NA     1    NA
I2,I3          NA    NA     1
I1,I3          NA     1    NA
I1,I2,I3,I5    NA    NA    NA
I1,I2,I3       NA    NA    NA

> apply(hits,2, sum, na.rm=TRUE)
I1,I2 I1,I3 I2,I3 
0     2     2 
于 2015-03-23T03:47:47.027 回答