我有一个包含多个变量(CPI - Workers
、、等等)的长表CPI - Consumers
,(Seas) Unemployment Level (thous)
但是为了简洁起见,我将数据集截断为 3 个变量和 6 个时间段。我想创建一个新变量,它是前两个变量的组合。让我们打电话它CPI - Average
当然只是前两个或 ( CPI - Workers
+ CPI - Consumers
) / 2 的平均值。这是在宽表中的简单计算,但是,为了满足 ggplot,我以长格式存储了我的数据。
请注意,我将所有变量存储在一张长表中。当我需要可视化趋势时,我会在 ggplot 命令中过滤到所需的一个或多个变量。
我的问题是如何在不先将数据转换为宽格式的情况下创建新变量?
首先,这是我的数据集:
DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date
1994 M01 January 143.8 "CPI - Workers" 1994-01-01
1994 M02 February 144.0 "CPI - Workers" 1994-02-01
1994 M03 March 144.3 "CPI - Workers" 1994-03-01
1994 M04 April 144.5 "CPI - Workers" 1994-04-01
1994 M05 May 144.8 "CPI - Workers" 1994-05-01
1994 M06 June 145.3 "CPI - Workers" 1994-06-01
1994 M01 January 146.3 "CPI - Consumers" 1994-01-01
1994 M02 February 146.7 "CPI - Consumers" 1994-02-01
1994 M03 March 147.1 "CPI - Consumers" 1994-03-01
1994 M04 April 147.2 "CPI - Consumers" 1994-04-01
1994 M05 May 147.5 "CPI - Consumers" 1994-05-01
1994 M06 June 147.9 "CPI - Consumers" 1994-06-01
1994 M01 January 8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994 M02 February 8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994 M03 March 8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994 M04 April 8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994 M05 May 7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994 M06 June 7927 "(Seas) Unemployment Level (thous)" 1994-06-01
'))
其次,计算的输出应该是这样的:
DT_long <- as.data.table(read.table(header=TRUE, text='year period periodName value variable_name date
1994 M01 January 143.8 "CPI - Workers" 1994-01-01
1994 M02 February 144.0 "CPI - Workers" 1994-02-01
1994 M03 March 144.3 "CPI - Workers" 1994-03-01
1994 M04 April 144.5 "CPI - Workers" 1994-04-01
1994 M05 May 144.8 "CPI - Workers" 1994-05-01
1994 M06 June 145.3 "CPI - Workers" 1994-06-01
1994 M01 January 146.3 "CPI - Consumers" 1994-01-01
1994 M02 February 146.7 "CPI - Consumers" 1994-02-01
1994 M03 March 147.1 "CPI - Consumers" 1994-03-01
1994 M04 April 147.2 "CPI - Consumers" 1994-04-01
1994 M05 May 147.5 "CPI - Consumers" 1994-05-01
1994 M06 June 147.9 "CPI - Consumers" 1994-06-01
1994 M01 January 8630 "(Seas) Unemployment Level (thous)" 1994-01-01
1994 M02 February 8583 "(Seas) Unemployment Level (thous)" 1994-02-01
1994 M03 March 8470 "(Seas) Unemployment Level (thous)" 1994-03-01
1994 M04 April 8331 "(Seas) Unemployment Level (thous)" 1994-04-01
1994 M05 May 7915 "(Seas) Unemployment Level (thous)" 1994-05-01
1994 M06 June 7927 "(Seas) Unemployment Level (thous)" 1994-06-01
1994 M01 January 145.05 "CPI - Average" 1994-01-01
1994 M02 February 145.35 "CPI - Average" 1994-02-01
1994 M03 March 145.70 "CPI - Average" 1994-03-01
1994 M04 April 148.85 "CPI - Average" 1994-04-01
1994 M05 May 146.15 "CPI - Average" 1994-05-01
1994 M06 June 146.60 "CPI - Average" 1994-06-01
'))
第四个变量(CPI - 平均值)取每个日期前两个变量的平均值。请忽略这个平均值在经济上没有意义的事实,我只是想为这个例子做一个简单的计算。
这样的计算在宽格式中非常简单。所以让我们先将数据转换为宽,然后进行计算。
DT_wide <- DT_long %>% pivot_wider(names_from = variable_name, values_from = value)
DT_wide_with_average <- DT_wide %>% mutate(`CPI - Average` = (`CPI - Workers` + `CPI - Consumers`) / 2)
这将获取宽表并添加一个包含计算结果的新列:
DT_wide_with_average <- as.data.table(read.table(header=TRUE, text='year period periodName date `CPI - Workers` `CPI - Consumers` `(Seas) Unemployment Level (thous)` `CPI - Average`
1994 M01 January 1994-01-01 144. 146. 8630 145.
1994 M02 February 1994-02-01 144 147. 8583 145.
1994 M03 March 1994-03-01 144. 147. 8470 146.
1994 M04 April 1994-04-01 144. 147. 8331 146.
1994 M05 May 1994-05-01 145. 148. 7915 146.
1994 M06 June 1994-06-01 145. 148. 7927 147.
'))
请忽略小数已被 pivot_wider 截断的事实。
在宽模式下工作、创建变量、分析变量、修改计算、重新排序列顺序、删除不需要的列是我们人类在分析简单数据表时的想法。
不幸的是,ggplot 需要长格式,被 R 之神认为是“整洁”的,但在我们这些凡人眼中却相当混乱。很抱歉,如果我把沙发、桌子、椅子、灯和地毯堆放在房间的一个角落里,那会很乱,而如果我像平时一样把它们留在房间里,它们就会很乱整齐的。在现实世界中,我可能会将家具堆放在一个角落里,以便粉刷房间或打磨地板。这对手头的任务很有用,但它会被认为是杂乱无章的,对普通生活没有用处。因此,将长桌视为整洁而将宽桌视为凌乱是违反直觉的。当我第一次被介绍到 tidyverse 时,我花了很长时间才弄清楚这个违反直觉的逻辑。很抱歉咆哮,但希望这是对 R 之神有用的客户反馈。至少,如果诸神承认违反直觉的命名法,这将对 R 学习者有所帮助。如果我在进入浴室之前被警告过,带“H”的水龙头把手是冷水,带“C”的水龙头把手是热水,我就不太可能烫到手了!
数据分析是迭代的。我不想每次迭代都采取以下步骤:
- pivot_wider
- 计算新变量
- pivot_longer
- 检查ggplot中的趋势
我宁愿:
- 计算新变量
- 检查ggplot中的趋势
简而言之,我想专注于我的经济分析,而不是不必要的 R 编程。
那么,我怎样才能从我的长格式表中选择一个变量子集,在计算中使用它们来创建一个新变量并确保新变量被rbind
-ed 到我的长表的末尾......而不必转换为宽格式?
谢谢你的帮助!