3

我有一个令人讨厌的数据表,其中有几种不同类型的混乱,我无法弄清楚如何组合使用tidyrsplitstackshape包的其他一些答案。

subject <- c("A", "B", "C")
review <- c("Bill: [1.0]", "Bill: [2.0], Cathy: [3.0]", "Fred: [4.0], Cathy: [2.0]")
data.table(cbind(subject, review))

这使:

   subject                    review
1:       A               Bill: [1.0]
2:       B Bill: [2.0], Cathy: [3.0]
3:       C Fred: [4.0], Cathy: [2.0]

这表现出整洁的混乱,多个变量存储在一列中,以及一些丑陋的格式。

我想要的是一张像这样的桌子:

subject  Bill  Fred  Cathy
A        1.0   0.0   0.0
B        2.0   0.0   3.0
C        0.0   4.0   2.0
4

4 回答 4

2

这是一个使用选项data.table

library(data.table)
dcast(dt[, strsplit(review, ", "),  subject][, 
    c('v1', 'v2') := tstrsplit(V1, ":\\s+\\[|\\]")],
       subject ~ v1, value.var = 'v2', fill = 0)
#   subject Bill Cathy Fred
#1:       A  1.0     0    0
#2:       B  2.0   3.0    0
#3:       C    0   2.0  4.0

数据

dt <- data.table (subject, review) 
于 2018-03-09T01:33:48.927 回答
2

这应该这样做。我建议检查中间结果以了解不同的步骤:

# example setup
library(tidyverse)

subject <- c("A", "B", "C")
review <- c("Bill: [1.0]", "Bill: [2.0], Cathy: [3.0]", "Fred: [4.0], Cathy: [2.0]")
dt <- tibble(subject, review)

# solution
dt %>% 
  separate_rows(review, sep = ",") %>%
  separate(review, c("name", "interval"), sep = ":") %>%
  mutate(interval = as.numeric(str_replace_all(interval, "\\[|\\]", ""))) %>%
  complete(subject, name) %>%
  replace_na(list(interval = 0)) %>%
  spread(name, interval)
于 2018-03-08T22:30:33.483 回答
1

“splitstackshape”方法同样需要先拆分为“长”形式,然后再拆分为“宽”形式,然后再对数据进行整形。

library(splitstackshape)
library(magrittr)

DT %>% 
  .[, review := gsub("\\[|\\]", "", review)] %>% 
  cSplit("review", ",", "long") %>% 
  cSplit("review", ":", "wide") %>% 
  dcast(subject ~ review_1, value.var = "review_2", fill = 0)
##    subject Bill Cathy Fred
## 1:       A    1     0    0
## 2:       B    2     3    0
## 3:       C    0     2    4
于 2018-03-30T02:38:28.190 回答
0

这可能是另一种方式。

library(data.table)
library(tidyr)
t <- data.table (subject, review)
tmp <- t[,c(text=strsplit(review, " ", fixed = TRUE)), by =subject]
tmp$text <- gsub("[^[:alnum:][:space:].]", "", tmp$text)

subject <- tmp$subject[is.na(extract_numeric(tmp$text))]
col2 <- tmp$text[is.na(extract_numeric(tmp$text))]
col3 <- extract_numeric(tmp$text)[!is.na(extract_numeric(tmp$text))]
tmp2 <- data.frame(cbind (subject, col2, col3))
library(reshape2)
m <- dcast(tmp2, subject~col2, value.var="col3")
m[is.na(m)] <- 0
于 2018-03-09T18:39:30.703 回答