1

I'm using R to pull in data through an API and merge all of it into a single table, which I then write to a CSV file. To graph it properly in Tableau, however, I need to prepare the data by using their reformatting tool for Excel to get it from a cross-tablulated format to a format where each line contains only one piece of data. For example, taking something from the format:

ID,Gender,School,Math,English,Science
1,M,West,90,80,70
2,F,South,50,50,50

To:

ID,Gender,School,Subject,Score
1,M,West,Math,90
1,M,West,English,80
1,M,West,Science,70
2,F,South,Math,50
2,F,South,English,50
2,F,South,Science,50

Are there any existing tools in R or in an R library that would allow me to do this, or that would provide a starting point? I am trying to automate the preparation of data for Tableau so that I just need to run a single script to get it formatted properly, and would like to remove the manual Excel step if possible.

4

1 回答 1

1

在 R 和其他几个程序中,这个过程被称为“重塑”数据。事实上,您最初链接到的 Tableau 页面提到了他们的“Excel Reshaper 插件”。

在基础 R 中,有一些函数可以重塑数据,例如(臭名昭著的)reshape()函数,它将面板数据从宽格式转换为长格式,并stack()创建了您的数据的瘦堆栈。

不过,对于此类数据转换,“reshape2”包似乎更受欢迎。这是“融化”您的示例数据的示例,我将其存储在data.frame名为“mydf”的文件中:

library(reshape2)
melt(mydf, id.vars=c("ID", "Gender", "School"), 
     value.name="Score", variable.name="Subject")
#   ID Gender School Subject Score
# 1  1      M   West    Math    90
# 2  2      F  South    Math    50
# 3  1      M   West English    80
# 4  2      F  South English    50
# 5  1      M   West Science    70
# 6  2      F  South Science    50

对于这个例子,base Rreshape()不是很合适,但确实合适stack()。在这里,我stack只编辑了最后三列:

stack(mydf[4:6])
#   values     ind
# 1     90    Math
# 2     50    Math
# 3     80 English
# 4     50 English
# 5     70 Science
# 6     50 Science

要获得data.frame您正在寻找的内容,您将cbind使用上述输出的前三列。


作为参考,Hadley Wickham 的Tidy Data论文是思考数据结构如何促进进一步处理和可视化的一个很好的切入点。

于 2013-08-26T15:41:37.617 回答