regex - 如何将多值字符串转换为 R 中的可用频率表？

Question

我在一个名为 plugins_Apache_module 的数据框中有一个字段，它包含如下字符串：

c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
    "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
    "mod_ssl/2.2.9")

我需要一个关于模块的频率表，以及它们的版本。

在 R 中执行此操作的最佳方法是什么？由于在 R 中相当新，我见过 strsplit、gsub，一些聊天室还建议我使用qdap 包。

理想情况下，我希望将字符串转换为每个模块都有一列的数据框，如果模块在那里，那么版本将进入该特定字段。我将如何完成这样的转变？

如果我想要顶级频率，建议使用哪种数据帧格式 - 比如说 mod_ssl（所有版本）以及关系选项（mod_perl 经常与 mod_ssl 一起使用）。

在推入数据帧进行处理时，我不太确定如何处理这种可变长度的数据。欢迎任何建议。

我认为正确的答案看起来像：

mod_perl   mod_python  mod_ssl  mod_auth_passthrough mod_bwlimited 
1.99_16    3.1.3       2.0.52                      
                       2.2.23   2.1                  1.4
                       2.2.9

所以基本上第一位变成一列，后面的版本变成行条目

score 1 · Accepted Answer

st <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52", "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23", "mod_ssl/2.2.9")

 scan(text=st, what="", sep=",")
Read 7 items
[1] "mod_perl/1.99_16"         "mod_python/3.1.3"         "mod_ssl/2.0.52"          
[4] "mod_auth_passthrough/2.1" "mod_bwlimited/1.4"        "mod_ssl/2.2.23"          
[7] "mod_ssl/2.2.9"

strsplit( scan(text=st, what="", sep=","), "/")
Read 7 items
[[1]]
[1] "mod_perl" "1.99_16" 

[[2]]
[1] "mod_python" "3.1.3"     

[[3]]
[1] "mod_ssl" "2.0.52" 

[[4]]
[1] "mod_auth_passthrough" "2.1"                 

[[5]]
[1] "mod_bwlimited" "1.4"          

[[6]]
[1] "mod_ssl" "2.2.23" 

[[7]]
[1] "mod_ssl" "2.2.9"  

table( sapply(strsplit( scan(text=st, what="", sep=","), "/"), "[",1)  )
#----------------
Read 7 items
mod_auth_passthrough        mod_bwlimited             mod_perl           mod_python 
                   1                    1                    1                    1 
             mod_ssl 
                   3 

 table( scan(text=st, what="", sep=",") )
#-----------
Read 7 items

mod_auth_passthrough/2.1        mod_bwlimited/1.4         mod_perl/1.99_16 
                       1                        1                        1 
        mod_python/3.1.3           mod_ssl/2.0.52           mod_ssl/2.2.23 
                       1                        1                        1 
           mod_ssl/2.2.9 
                       1

score 1 · Accepted Answer

你要求至少两种不同的东西。添加所需的输出有很大帮助。我不确定你所要求的是否是你真正想要的，但你问了，这似乎是一个有趣的问题。好的，这就是我将如何使用qdap来解决这个问题（但这需要qdap 版本 1.1.0）：

## load qdap
library(qdap)

## your data
x <- c("mod_perl/1.99_16,mod_python/3.1.3,mod_ssl/2.0.52",
    "mod_auth_passthrough/2.1,mod_bwlimited/1.4,mod_ssl/2.2.23",
    "mod_ssl/2.2.9")

## strsplit on commas and slashes
dat <- unlist(lapply(x, strsplit, ",|/"), recursive=FALSE)

## make just a list of mods per row
mods <- lapply(dat, "[", c(TRUE, FALSE))

## make a string of versions
ver <- unlist(lapply(dat, "[", c(FALSE, TRUE)))

## make a lookup key and split it into lists
key <- data.frame(mod = unlist(mods), ver, row = rep(seq_along(mods), 
   sapply(mods, length)))
key2 <- split(key[, 1:2], key$row)

## make it into freq. counts
freqs <- mtabulate(mods)

## rename assign freq table to vers in case you want freqs ans replace 0 with NA
vers <- freqs
vers[vers==0] <- NA

## loop through and fill the ones in each row using an env. lookup (%l%)
for(i in seq_len(nrow(vers))) {
    x <- vers[i, !is.na(vers[i, ]), drop = FALSE]
    vers[i, !is.na(vers[i, ])] <- colnames(x) %l% key2[[i]]
}

## Don't print the NAs
print(vers, na.print = "")

##   mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1                                     1.99_16      3.1.3  2.0.52
## 2                  2.1           1.4                      2.2.23
## 3                                                          2.2.9

## the frequency counts per mods 
freqs

##   mod_auth_passthrough mod_bwlimited mod_perl mod_python mod_ssl
## 1                    0             0        1          1       1
## 2                    1             1        0          0       1
## 3                    0             0        0          0       1

regex - 如何将多值字符串转换为 R 中的可用频率表？

2 回答 2

Related

Reference