r - 使用 R 透视 CSV 文件

Question

我有一个看起来像这样的文件：

                 type          created_at repository_name
1         IssuesEvent 2012-03-11 06:48:31       bootstrap
2         IssuesEvent 2012-03-11 06:48:31       bootstrap
3         IssuesEvent 2012-03-11 06:48:31       bootstrap
4         IssuesEvent 2012-03-11 06:52:50       bootstrap
5         IssuesEvent 2012-03-11 06:52:50       bootstrap
6         IssuesEvent 2012-03-11 06:52:50       bootstrap
7   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
8   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
9   IssueCommentEvent 2012-03-11 07:03:57       bootstrap
10        IssuesEvent 2012-03-11 07:03:58       bootstrap
11        IssuesEvent 2012-03-11 07:03:58       bootstrap
12        IssuesEvent 2012-03-11 07:03:58       bootstrap
13         WatchEvent 2012-03-11 07:15:44       bootstrap
14         WatchEvent 2012-03-11 07:15:44       bootstrap
15         WatchEvent 2012-03-11 07:15:44       bootstrap
16         WatchEvent 2012-03-11 07:18:45        hogan.js
17         WatchEvent 2012-03-11 07:18:45        hogan.js
18         WatchEvent 2012-03-11 07:18:45        hogan.js

我正在使用的数据集可以在https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/twitter_events_mini.csv上访问。

我想为“repository_name”列中的每个条目创建一个表（例如bootstrap、hogan.js）。在该列中，我需要具有与该条目相对应的“类型”列中的数据（即，只有当前“类型”列中的行在当前“repository_name”列中也具有值“bootstrap”应属于新的“引导”列）。因此：

时间戳仅用于排序，不需要跨行同步（实际上它们可以删除，因为数据已经根据时间戳排序）
即使“IssuesEvent”重复 10 次，我也需要保留所有这些，因为我将使用 R 包 TraMineR 进行序列分析
列可以不等长
不同存储库的列之间没有关系（“repository_name”）

换句话说，我想要一个看起来像这样的表：

     bootstrap            hogan.js
1    IssuesEvent          PushEvent
2    IssuesEvent          IssuesEvent
3    OssueCommentEvent    WatchEvent

我怎样才能在 R 中做到这一点？

我使用 reshape 包的一些失败尝试可以在https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/reshaping_bigqueries.R上找到。

score 5 · Accepted Answer

我刚加入stackoverflow；希望我的回答有点用处。

通过表格，我假设您的意思是您想要一个数据框。但是，列的长度似乎不太可能相等，而且行似乎无论如何都没有多大意义。也许列表会更好？

这是一个混乱的解决方案：

names <- unique(olddataframe$repository_name)
results <- sapply(1:length(names), function(j){
    sapply(which(olddataframe$repository_name == names[j]), function(i){
        olddataframe$type[i]
   )
})
names(results) <- names
results

score 5 · Accepted Answer

您的样本数据：

data <- structure(list(type = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("IssueCommentEvent", 
"IssuesEvent", "WatchEvent"), class = "factor"), created_at = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
6L), .Label = c("2012-03-11 06:48:31", "2012-03-11 06:52:50", 
"2012-03-11 07:03:57", "2012-03-11 07:03:58", "2012-03-11 07:15:44", 
"2012-03-11 07:18:45"), class = "factor"), repository_name = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L), .Label = c("bootstrap", "hogan.js"), class = "factor")), .Names = c("type", 
"created_at", "repository_name"), class = "data.frame", row.names = c(NA, 
-18L))

我从您的预期输出中收集到，type当它多次显示相同的created_at值时，您只需要一个，换句话说，您想要删除重复项：

data <- unique(data)

然后，要按照它们出现的顺序提取所有type条目，您可以简单地使用：repository_name

data.split <- split(data$type, data$repository_name)
data.split
# $bootstrap
# [1] IssuesEvent       IssuesEvent       IssueCommentEvent
# [4] IssuesEvent       WatchEvent       
# Levels: IssueCommentEvent IssuesEvent WatchEvent
# 
# $hogan.js
# [1] WatchEvent
# Levels: IssueCommentEvent IssuesEvent WatchEvent

它返回一个列表，该列表是具有不同长度的向量集合的 R 数据结构选择。

编辑：现在您已经提供了输出数据的示例，很明显您的预期输出确实是一个 data.frame。NA您可以使用以下函数将上面的列表转换为用 s 填充的 data.frame ：

list.to.df <- function(arg.list) {
   max.len  <- max(sapply(arg.list, length))
   arg.list <- lapply(arg.list, `length<-`, max.len)
   as.data.frame(arg.list)
}

df.out <- list.to.df(data.split)
df.out
#           bootstrap   hogan.js
# 1       IssuesEvent WatchEvent
# 2       IssuesEvent       <NA>
# 3 IssueCommentEvent       <NA>
# 4       IssuesEvent       <NA>
# 5        WatchEvent       <NA>

然后，您可以使用将其保存到文件中

write.csv(df.out, file = "out.csv", quote = FALSE, na = "", row.names = FALSE)

获得与您在 github 上发布的完全相同的输出格式。

score 1 · Accepted Answer

使用@flodel 的data对象，您也可以尝试aggregate()，但是对于许多事件类型，这将很快变得不可读：

aggregate(list(Type = unique(data)$type), 
          list(Repository = unique(data)$repository_name), 
          function(x) paste0(x))
#   Repository                                                                 Type
# 1  bootstrap IssuesEvent, IssuesEvent, IssueCommentEvent, IssuesEvent, WatchEvent
# 2   hogan.js                                                           WatchEvent

您也可以尝试reshape()使用 (transpose) 进行一些技巧t()，如下所示。

temp = unique(data)
temp = reshape(temp, direction = "wide", 
               idvar="repository_name", timevar="created_at")
# If you want to keep the times, remove `row.names=NULL` below
temp1 = data.frame(t(temp[-1]), row.names=NULL)
names(temp1) = t(temp[1])
temp1
#           bootstrap   hogan.js
# 1       IssuesEvent       <NA>
# 2       IssuesEvent       <NA>
# 3 IssueCommentEvent       <NA>
# 4       IssuesEvent       <NA>
# 5        WatchEvent       <NA>
# 6              <NA> WatchEvent

但是，我发现所有这些 NA 都令人讨厌。我想说@flodel 的答案是最直接的，从长远来看可能是最有用的（也就是说，一旦你以这种形式获得数据，就不知道你想做什么）。

更新（更多诡计）

（实际上，这是一个“非常适合拖延”的时刻）

我的最终（非常低效）答案如下。

如上所述继续，但删除日期/时间的东西，并从因素转换为字符。

# Using @flodel's data
temp1 = unique(data)[-2]
# Remove the factors
temp1[sapply(temp1, is.factor)] = lapply(temp1[sapply(temp1, is.factor)], 
                                         as.character)
# Split and unlist your data
temp2 = split(temp1[-c(2:3)], temp1$repository_name)
temp3 = sapply(temp2, as.vector)

rbind()并将cbind()“回收”不同长度的对象以使它们具有相同的长度，但我们不希望这样。因此，我们需要强制 R 相信长度是相同的。所以，找出最大长度。temp3当我们这样做时，提取对象中名称的清理版本。

# What is the max number of rows we need?
LEN = max(sapply(temp3, length))
# What are the names we want for our columns?
NAMES = gsub(".type", "", names(temp3))

现在，将项目从temp3您的工作区中提取出来，并确保它们的长度相同。

# Use assign to unlist the vectors to the workspace
for (i in 1:length(temp3)) assign(NAMES[i], temp3[[i]])
# Make sure they have the same lengths
length(hogan.js) = LEN
length(bootstrap) = LEN

最后，用于cbind()将您的数据放在一起。

# Use cbind to put these together
data.frame(cbind(bootstrap, hogan.js))
#           bootstrap   hogan.js
# 1       IssuesEvent WatchEvent
# 2       IssuesEvent       <NA>
# 3 IssueCommentEvent       <NA>
# 4       IssuesEvent       <NA>
# 5        WatchEvent       <NA>

r - 使用 R 透视 CSV 文件

3 回答 3

更新（更多诡计）

（实际上，这是一个“非常适合拖延”的时刻）

Related

Reference