我正在尝试用 R 代码编写先验算法。首先,我想计算列表中每个项目的频率。我的初始代码如下:
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
sapply(a_list, function(x) length(x))
un <- unique(unlist(a_list))
nm <- lapply(un, function(x) sapply(a_list, function(y) sum(y == x)))
names(nm) <- un
nm
我的结果是:
> nm
$I1
[1] 1 0 0 1 1 0 1 1 1
$I2
[1] 1 1 1 1 0 1 0 1 1
$I5
[1] 1 0 0 0 0 0 0 1 0
$I4
[1] 0 1 0 1 0 0 0 0 0
$I3
[1] 0 0 1 0 1 1 1 1 1
但是,我希望它被安排为(也许在矩阵或数组中重新列出,然后我可以进一步使用它):
> nm
I1 6
I2 7
I3 6
I4 2
I5 2
每个项目显示频率计数和字母顺序。有没有办法实现它?我尝试了 cbind、apply、relist,但还没有找到解决方案。谢谢
更新:
library(dplyr)
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
a
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
c
现在我的结果为:
> a
. Freq
1 I1 6
2 I2 7
3 I3 6
4 I4 2
5 I5 2
> c
. Freq
1 I1 6
2 I2 7
3 I3 6
然后如何从扫描原始列表中设置“I1,I2”,...,“I2,I3”的组合?
UpDATE:我尝试如下组合,它输出一个矩阵。
> combn(c$.,2)
[,1] [,2] [,3]
[1,] I1 I1 I2
[2,] I2 I3 I3
Levels: I1 I2 I3 I4 I5
进一步修改为:
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
result
我的结果是:
> result
[1] "I1,I2" "I1,I3" "I2,I3"
接下来是从原始“a_list”中计算上述项目集的频率。也许最好输出为
""I1","I2"", ""I1","I3"", ""I2","I3""
为了与原始列表进行比较。
如何从原始 a_list 中获取此矩阵中项集的频率?先验算法要求扫描不小于最小支持度的所有项集,从1维(即a_list中的“I1”、“I2”、...、“I5”)到2维(即。“I1,I2”“I1 ,I3" "I2,I3" 在这种情况下),然后在适用的情况下打开(例如 "I1,I2,I3")。
更新:现在我可以单独找到具有特定模式的匹配项,例如 ("I1","I2") 或 ("I1","I3")。
toMatch <- c("I1","I2")
matches <- grepRaw(toMatch,a_list,ignore.case = TRUE)
matches
结果:
> matches
[1] 4
一次性匹配“结果”中的所有模式(我在上面的示例中手动输入模式,但需要从“结果”中提取)的问题仍有待解决。并以以下形式输出它们:
Itemset Freq
""I1","I2"" 4
""I1","I3"" 4
""I2","I3"" 4