1

我对 R 语言很陌生,不太确定如何做这个。如果我有一个 tsv(制表符分隔文件)并通过以下方式读入表:

> table <- read.delim(file='test.tsv',sep='\t',header=TRUE,stringsAsFactors=FALSE)

    id              features
1. 131  FeatureA,FeatureB,FeatureC,
2. 132  FeatureA,FeatureD,FeatureE,FeatureF
3. 135  FeatureD,FeatureE,FeatureC
4. 139  FeatureF,FeatureB

我希望能够可视化特征的聚类,但要在 R 中利用这一点,我需要将名为 feature 的列的类型更改为列表。

做这个的最好方式是什么?

4

2 回答 2

2

我的“splitstackshape”包是为了处理这些类型的任务而编写的。您可以探索concat.split函数族。

这里有一些例子:

作为一个list. strsplit(但是该函数对输出进行排序——在我添加一个不对输出进行排序的选项之前,你会做得更好)。

library(splitstackshape)
x1 <- concat.split.list(mydf, split.col="features", sep=",", drop = TRUE)
x1
#     id                          features_list
# 1. 131           FeatureA, FeatureB, FeatureC
# 2. 132 FeatureA, FeatureD, FeatureE, FeatureF
# 3. 135           FeatureD, FeatureE, FeatureC
# 4. 139                     FeatureF, FeatureB
str(x1)
# 'data.frame':  4 obs. of  2 variables:
#  $ id           : int  131 132 135 139
#  $ features_list:List of 4
#   ..$ : chr  "FeatureA" "FeatureB" "FeatureC"
#   ..$ : chr  "FeatureA" "FeatureD" "FeatureE" "FeatureF"
#   ..$ : chr  "FeatureD" "FeatureE" "FeatureC"
#   ..$ : chr  "FeatureF" "FeatureB"

作为“宽” data.frame

x2 <- concat.split.multiple(mydf, split.col="features", sep=",")
x2
#     id features_1 features_2 features_3 features_4
# 1. 131   FeatureA   FeatureB   FeatureC       <NA>
# 2. 132   FeatureA   FeatureD   FeatureE   FeatureF
# 3. 135   FeatureD   FeatureE   FeatureC       <NA>
# 4. 139   FeatureF   FeatureB       <NA>       <NA>

作为“长” data.frame

x3 <- concat.split.multiple(mydf, split.cols="features", seps=",", direction="long")
x3
#     id time features
# 1  131    1 FeatureA
# 2  132    1 FeatureA
# 3  135    1 FeatureD
# 4  139    1 FeatureF
# 5  131    2 FeatureB
# 6  132    2 FeatureD
# 7  135    2 FeatureE
# 8  139    2 FeatureB
# 9  131    3 FeatureC
# 10 132    3 FeatureE
# 11 135    3 FeatureC
# 12 139    3     <NA>
# 13 131    4     <NA>
# 14 132    4 FeatureF
# 15 135    4     <NA>
# 16 139    4     <NA>

更新,根据您的评论:

strsplit正如我在评论中提到的,这是直接的结果。注意提取方法。

> mydf$featuresList <- strsplit(mydf$features, ",")
> mydf
    id                            features                           featuresList
1. 131         FeatureA,FeatureB,FeatureC,           FeatureA, FeatureB, FeatureC
2. 132 FeatureA,FeatureD,FeatureE,FeatureF FeatureA, FeatureD, FeatureE, FeatureF
3. 135          FeatureD,FeatureE,FeatureC           FeatureD, FeatureE, FeatureC
4. 139                   FeatureF,FeatureB                     FeatureF, FeatureB
> mydf[, "featuresList"][[2]]
[1] "FeatureA" "FeatureD" "FeatureE" "FeatureF"
> mydf[, "featuresList"][[2]][2]
[1] "FeatureD"
于 2013-12-11T06:33:07.120 回答
2

you could use strsplit:

table$list.features = strsplit(table$features,",")

you might also want to create indicator variables for these features:

table[unique(unlist(table$list.features))]=0
for (i in 1:nrow(table)) table[i,table$list.features[[i]]]=1
于 2013-12-11T06:37:04.150 回答