r - 在纵向数据集中的“ID”内识别最高的“字母”——忽略 B——

Question

我正在尝试确定纵向数据集中的最高score值ID。

假设我的数据看起来像这样，

dfL <- data.frame(ID = c(1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L, 10L), week = c("baseline", 4L, 6L, "baseline", 6L, 9L, 9L, 12L, "baseline", 4L, 6L, 9L, 12L, "baseline"), score = c(NA, "A", "B", NA, "B", "E", "D", "C", NA, "B", "A", "A", "B", NA)); dfL
   ID     week score
1   1 baseline  <NA>
2   1        4     A
3   1        6     B
4   4 baseline  <NA>
5   4        6     B
6   4        9     E
7   4        9     D
8   4       12     C
9   9 baseline  <NA>
10  9        4     B
11  9        6     A
12  9        9     A
13  9       12     B
14 10 baseline  <NA>

我要做的是找到最高分，用字母表示，忽略 B，然后把这个字母放在baselinefor each上ID。设计的结果是这样的，

dfL$hi_score <- c("A", NA, NA, "E", NA, NA, NA, NA, "A", NA, NA, NA, NA, NA);dfL
   ID     week score hi_score
1   1 baseline  <NA>        A
2   1        4     A     <NA>
3   1        6     B     <NA>
4   4 baseline  <NA>        E
5   4        6     B     <NA>
6   4        9     E     <NA>
7   4        9     D     <NA>
8   4       12     C     <NA>
9   9 baseline  <NA>        A
10  9        4     B     <NA>
11  9        6     A     <NA>
12  9        9     A     <NA>
13  9       12     B     <NA>
14 10 baseline  <NA>     <NA>

对于知道如何解决这个问题的人，你能推荐任何有很好的教程来学习如何操作纵向数据的书籍或网页吗？

score 2 · Accepted Answer

Here's a quick solution.

dfL <- data.frame(ID = c(1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L, 10L), week = c("baseline", 4L, 6L, "baseline", 6L, 9L, 9L, 12L, "baseline", 4L, 6L, 9L, 12L, "baseline"), score = c(NA, "A", "B", NA, "B", "E", "D", "C", NA, "B", "A", "A", "B", NA));

#find the highest score per id excluding "B"
highestScore = by(dfL$score, dfL$ID, function(ids){ 
    head(rev(sort(ids[ids != "B"])), 1) 
})

dfL$hi_score = NA
for (id in names(highestScore)){
    newWeek = as.character(highestScore[[id]])
    #to account for weeks with no scores
    newWeek = ifelse(length(newWeek)==0, NA, newWeek)
    #only update the hi scores at the baseline position  
    dfL[which(dfL$ID == id & dfL$week == "baseline"), "hi_score"] = newWeek
}

dfL

As for the tutorials, it's all about practice. Reading the questions and answers on this site is a great start.

score 1 · Accepted Answer

我认为这可以完成工作。

dfL <- data.frame(ID = c(1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L, 10L), week = c("baseline", 4L, 6L, "baseline", 6L, 9L, 9L, 12L, "baseline", 4L, 6L, 9L, 12L, "baseline"), score = c(NA, "A", "B", NA, "B", "E", "D", "C", NA, "B", "A", "A", "B", NA)); dfL
library(plyr)

dfL$score <- as.character(dfL$score)
dfL$score <- ifelse(dfL$score!="B",dfL$score,NA)
maxdat <- ddply(dfL,.(ID),summarise,hi_score=max(score,na.rm=TRUE))
finaldat <- merge(dfL, maxdat, by="ID")

如果你真的想在与基线周不同的行中丢失，你可以这样做：

finaldat$hi_score<- ifelse(finaldat$week=="baseline", finaldat$hi_score,NA)

If you want to learn more about data transformation you should certainly check Hadley's packages like reshape2 http://had.co.nz/reshape/ and plyr http://plyr.had.co.nz/.

r - 在纵向数据集中的“ID”内识别最高的“字母”——忽略 B——

2 回答 2

Related

Reference