0

我有一个数据框,其中包含某些医院的月度调查分数。每个月,我们存储医院获得的分数(_Score列)和当月所有医院的相应平均分数(_Average列)。

这是它的外观的简短示例-

df = data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                Question=c(rep("Q1",40)),
                key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                      "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                      "2020-05-31_Average","2020-05-31_Score"),4)),
                value=c(round(runif(40,0,1),2)))

library(tidyr)
df = df %>% spread(key,value)

我想转换这个数据框,这样 -

1)前两列,HospitalQuestion保持不变

2)仅保留最近三个月_Score的列

3)_Average保留最近一个月的列

4) 理想情况下,列需要从最旧到最近重新排序(即按以下顺序:Month M-2_Score, Month M-1_Score, Month M_Score, Month M_Average

5)计算Variance最后一列,即Score M和Score M-1之差

我想要达到的目标

使用 dplyr,这可以通过重新排序列来手动完成。但是我正在寻找一种方法来构建一种逻辑,该逻辑可以按照上述顺序自动重新排列最近 3 个月的列。通过获取列名中嵌入的日期值并根据它们重新排序。

结果表如下所示 -

#Final table
df_transformed = df %>%
  select(1:2,8,10,12,11) %>%
  mutate(Variance=.[[5]]-.[[4]])

任何有关如何使用列名中的日期值更有效地执行此操作的提示都将受到高度赞赏。

4

3 回答 3

1

如果数据集中的列已经按时间顺序排列,这是一个可能的解决方案

# create vectors of variables: 3 last "_Score" and 1 last "_Average"
score_vars <- tail(names(df)[grep("_Score", names(df))], 3)
average_var <- tail(names(df)[grep("_Average", names(df))], 1)

df %>% 
  select(Hospital, Question, !!score_vars, !!average_var) %>% 
  mutate(Variance = !!rlang::sym(score_vars[3]) - !!rlang::sym(score_vars[2]))

输出

# Hospital Question 2020-03-31_Score 2020-04-30_Score 2020-05-31_Score 2020-05-31_Average Variance
# 1 Hospital A       Q1             0.28             0.69             0.31               0.94    -0.38
# 2 Hospital B       Q1             0.19             0.41             0.27               0.91    -0.14
# 3 Hospital C       Q1             0.53             0.03             0.25               0.05     0.22
# 4 Hospital D       Q1             0.43             0.59             0.46               0.36    -0.13
于 2020-06-03T09:03:38.923 回答
0

我并没有真正得到问题 4 和 5,但他们感觉有点像“你能帮我做作业吗?”。对于问题 1 至 3,请考虑以下问题:

library(tidyverse)
library(lubridate)

df <- data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                Question=c(rep("Q1",40)),
                key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                      "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                      "2020-05-31_Average","2020-05-31_Score"),4)),
                value=c(round(runif(40,0,1),2)))

# take the dataframe
df %>%
    # get month as a date and key separately
    mutate(month = str_replace(key, "_[[:alpha:]]*$", "") %>% ymd()
           , key = str_extract(key, "[[:alpha:]]*$")) %>%
    # filter Score for the last 3 and Average for the last 1 months
    filter(!(today() - month > months(3) & 
                 key == "Score")) %>%
    filter(!(today() - month > months(1) &
                 key == "Average"))
于 2020-06-03T08:53:28.703 回答
0

在这一步之前,我已经使用了您df的长格式原件。spread

library(dplyr)
library(tidyr)

df %>%
  #Bring date and key in separate columns
  separate(key, c('Date', 'key'), sep = '_') %>%
  #Convert date column to date class
  mutate(Date = as.Date(Date)) %>%
  #arrange data according with highest date first
  arrange(Hospital, key, desc(Date)) %>%
  #For each hospital and key
  group_by(Hospital, key) %>%
  #If it is a "score" column select top 3 values and 
  #for average column select only 1 value
  slice(if(first(key) == 'Score') 1:3 else 1) %>%
  select(-Question) %>%
  ungroup() %>%
  #Get the data in wide format
  pivot_wider(names_from = c(key, Date), values_from = value) %>%
  #Calculate variance column
  mutate(Variance = .[[3]] - .[[4]])

# A tibble: 4 x 6
#  Hospital   `Average_2020-05-31` `Score_2020-05-31` `Score_2020-04-30` `Score_2020-03-31` Variance
#  <chr>                     <dbl>              <dbl>              <dbl>              <dbl>    <dbl>
#1 Hospital A                 0.45               0.44               0.66               0.97    -0.22
#2 Hospital B                 0.11               0.53               0.68               0.27    -0.15
#3 Hospital C                 1                  0.18               0.56               0.41    -0.38
#4 Hospital D                 0.31               0.83               0.6                0.79     0.23

计算方差.[[3]] - .[[4]]将是固定的,因为"Hospital"列是固定的,并且始终是第一列。"Average"column 将在"Score"column 之前(按字母顺序),并且由于数据按 排序Date,我们知道最高日期将放在第一位,然后是第二高的,依此类推。

于 2020-06-03T08:59:53.207 回答