-1

我有一个包含 100,000 行的数据集,交易格式如下

B038-82C81778E81C   Toy Story
B038-82C81778E81C   Planet of the apes
B038-82C81778E81C   Iron Man
9C05-EE9B44E8C18F   Bruce Almighty
9C05-EE9B44E8C18F   Iron Man
9C05-EE9B44E8C18F   Toy Story
8F59-9956070D8005   Toy Story
8F59-9956070D8005   Gravity
8F59-9956070D8005   Iron Man
8F59-9956070D8005   Gone
B52F-9936734525AF   Planet of the Apes
B52F-9936734525AF   Bruce Almighty

我想将其转换为如下矩阵格式(或 TRUE/FALSE 标志)

Matrix              Toy Story  Planet of the Apes  Iron Man  Bruce Almighty   Gone  Gravity
B038-82C81778E81C    1             1                 1             0            0     0
9C05-EE9B44E8C18F    1             0                 1             1            0     0 
8F59-9956070D8005    1             0                 1             0            1     1
B52F-9936734525AF    0             1                 0             1            0     0

我尝试了以下步骤

TrnsDataset1<-read.transactions("~/Desktop/movieswid_1Copy.txt", format= c("single"), sep="\t", cols = c(1,2), rm.duplicates=TRUE);
L <- as(TrnsDataset1,"list");
M <- as(L,"matrix")
CM<- as (M,"ngCMatrix");

但是,在我的列表转换中,我得到的输出为

B038-82C81778E81C   c("Toy Story\nB038-82C81778E81C\tPlanet of the apes\nB038-82C81778E81C\tIron Man")
9C05-EE9B44E8C18F   c("Bruce Almighty","Iron Man","Toy Story")

所以有些行是完美的,但在某些行中,唯一 id 被添加到带有 \t 和 \n 的电影列表中

我想要以下格式的列表 9C05-EE9B44E8C18F c("Bruce Almighty","Iron Man","Toy Story")

这样我相信我会很容易地达到所需的结果。非常感谢您的帮助。

4

1 回答 1

0

I'm a bit confused because you say you want two things. If you just want the sparse matrix, then you can skip the list and standard matrix transformation. you can just do

TrnsDataset1 <- read.transactions(...);
mm <- t(as(TrnsDataset1,"ngCMatrix"))

This results in

4 x 6 sparse Matrix of class "ngCMatrix"
                  Bruce Almighty Gone Gravity Iron Man Planet Toy Story
8F59-9956070D8005              .    |       |        |      .         |
9C05-EE9B44E8C18F              |    .       .        |      .         |
B038-82C81778E81C              .    .       .        |      |         |
B52F-9936734525AF              |    .       .        .      |         .

which is a matrix of true/false values (here abbreviate to fit in space). There is no need to go through the list form at all.

于 2014-07-26T05:42:47.030 回答