0

我有每行中元素数量不等的表,每个元素的计数为 1 或 2 附加到字符串。我想创建一个每个字符串存在/不存在的矩阵,但包括计数 (1,2) 并在找不到字符串时放置零。

由此:

  V1      V2      V3         V4      V5
1  A   cat:2   dog:1    mouse:1 horse:2
2  B   dog:2 mouse:2 dolphin:2        
3  C horse:2                           
4  D   cat:1 mouse:2  dolphin:2   

对此:

  cat dog mouse horse dolphin
A 2 1 1 2 0
B 0 2 2 0 2
C 0 0 0 2 0
D 1 0 2 0 2

我已经查找了类似问题的先前解决方案: 将数据帧转换为存在缺席矩阵

把他们创建一个 0/​​1 的缺席矩阵,不包括计数。

样本数据:

structure(list(V1 = c("A", "B", "C", "D"),
               V2 = c("cat:2", "dog:2", "horse:2", "cat:1"),
               V3 = c("dog:1", "mouse:2", "", "mouse:2"),
               V4 = c("mouse:1", "dolphin:2", "", "dolphin:2"),
               V5 = c("horse:2", "", "", "")),
               .Names = c("V1", "V2", "V3", "V4", "V5"),
               class = "data.frame", row.names = c(NA, -4L))
4

2 回答 2

2

也许一些包可以使这更容易,但这里有一个解决方案。对于大数据来说它不会很快,但它可以完成工作:

#split the strings
tmp <- apply(DF[,-1], 1, strsplit, ":")

#extract the first strings
names <- lapply(tmp,function(x)  c(na.omit(sapply(x, "[", 1))))
uniquenames <- unique(unlist(names))

#extract the numbers
reps <- lapply(tmp,function(x)  as.numeric(na.omit(sapply(x, "[", 2))))

#make the numbers named vectors
res <- mapply(setNames, reps, names)

#subset the named vectors and combine result in a matrix
res <- do.call(rbind, lapply(res, "[",uniquenames))

#cosmetics
colnames(res) <- uniquenames
rownames(res) <- DF$V1
res[is.na(res)] <- 0
#  cat dog mouse horse dolphin
#A   2   1     1     2       0
#B   0   2     2     0       2
#C   0   0     0     2       0
#D   1   0     2     0       2
于 2019-07-19T10:46:55.090 回答
1

您可以在将数据融合为长格式后将动物与计数分开separatetidyr然后使用计数作为值(需要在上一步中将其从字符转换为数字)转换为宽。

data %>% 
  melt("V1") %>% 
  separate(value, c("animal", "count"), ":", fill = "left") %>%  
  transform(count = as.numeric(count)) %>% 
  dcast(V1 ~ animal, value.var = "count", fun.aggregate = sum) %>% 
  select(-"NA")

#   V1 cat dog dolphin horse mouse
# 1  A   2   1       0     2     1
# 2  B   0   2       2     0     2
# 3  C   0   0       0     2     0
# 4  D   1   0       2     0     2
于 2019-07-19T12:37:42.547 回答