database - 在 R 中组合多个观察值

Question

我有一个推文的平面文件，想按用户聚合它们的属性。

例如

user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8

我想转换成：

user1, hashtag1, hashtag2, hashtag3, hashtag4
user2, hashtag5, hashtag6, hashtag7, hashtag8

有没有优雅的方法来做到这一点？

score 3 · Accepted Answer

除非每个用户的主题标签数量始终相同，否则我会将结果汇总到一个列表中。列表的每个元素都是一个用户标签的（可能是可变长度的）向量。

# Read in your example data
df <- read.table(text="user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8", sep=",", header=FALSE, stringsAsFactors=FALSE)


lapply(split(df[-1], df[1]), function(X) unname(unlist(X)))
# $user1
# [1] " hashtag1"  " hashtag3"  " hashtag2 " " hashtag4 "
# 
# $user2
# [1] " hashtag5"  " hashtag7"  " hashtag6 " " hashtag8"

score 1 · Accepted Answer

你正在寻找重塑。reshape命令（它的语法很痛苦，但基本上你想从“长”到“宽”，用“用户”作为你的 id 变量）或者后面跟着的包会reshape2做你想做的事。meltdcast

或者，由于主题标签的数量似乎可能会有所不同，您可以使用plyr：

> colnames(x) <- c("user","tag1","tag2")
> 
> library(plyr)
> extract.hashtags <- function(x) {
+   x <- subset(x,select=c(-user))
+   mat <- as.matrix(x)
+   dim(mat) <- c(1,length(mat))
+   as.data.frame(mat)
+ }
> ddply(x, .(user), extract.hashtags )
   user       V1       V2       V3       V4
1 user1 hashtag1 hashtag3 hashtag2 hashtag4
2 user2 hashtag5 hashtag7 hashtag6 hashtag8

score 1 · Accepted Answer

一种方法是使用该aggregate()功能。来自?aggregate：

将数据拆分为子集，计算每个子集的汇总统计信息，并以方便的形式返回结果

首先，读入您的数据（您应该在将来的问题中这样做以提供可重现的示例，请参阅：How to make a great R reproducible example?）：

txt <- "user1, hashtag1, hashtag2 
user1, hashtag3, hashtag4 
user2, hashtag5, hashtag6 
user2, hashtag7, hashtag8"

x <- read.delim(file = textConnection(txt), header = F, sep = ",", 
        strip.white = T, stringsAsFactors = F)

然后，使用aggregate()将数据拆分为子集，并将每个子集转换为一维数组：

aggregate(x[-1], by = x[1], function(z)
        {
            dim(z) <- c(length(z)) # Change dimensions of z to 1-dimensional array
            z
        })
#      V1     V2.1     V2.2     V3.1     V3.2
# 1 user1 hashtag1 hashtag3 hashtag2 hashtag4
# 2 user2 hashtag5 hashtag7 hashtag6 hashtag8

编辑

这种方法只有在所有用户都拥有相同数量的标签时才有效，这似乎不太可能。@Josh O'Brien 的回答是更好的方法。

database - 在 R 中组合多个观察值

3 回答 3

Related

Reference