r - 如何聚合每分钟的推文

Question

我做了一些有趣的推特挖掘。我使用 twitters streaming-APi 并在足球比赛之前、期间和之后记录了推文。现在我想做一个显示足球比赛推文频率的 ggplot2-graph。

在原始数据框中，我每条推文有一行，还有一个变量“created_at”，其中包含如下日期：Sat Dec 13 13:04:34 +0000 2014

然后我像这样改变了时间格式

tweets$format<- as.POSIXct(tweets$created_at, format = "%a %b %d %H:%M:%S %z %Y", tz="") 一个

并得到了这个2014-12-13 14:04:34 CET。因为我不需要日期，我想，我可以摆脱它

tweets$Uhrzeit <- sub(".* ", "", tweets$format)

有了这个，我只剩下时间了14:04:34。

我的问题是，我想以每分钟推文的准确性分析推文频率。我如何汇总每分钟的推文？正如我之前所说，每一行都是一条推文。我用时间和第二个包含“1”的变量制作了一个数据框。现在我想每分钟计算（聚合，求和）第二个变量。我试图找到一个解决方案，阅读有关动物园图书馆和计时图书馆的信息，但它让我感到困惑。

希望，有人可以帮助我。

编辑：可重现数据数据框是其中的一个子集：名称（推文）

 [1] "X"                         "text"                      "retweet_count"            
 [4] "favorited"                 "truncated"                 "id_str"                   
 [7] "in_reply_to_screen_name"   "source"                    "retweeted"                
[10] "created_at"                "in_reply_to_status_id_str" "in_reply_to_user_id_str"  
[13] "lang"                      "listed_count"              "verified"                 
[16] "location"                  "user_id_str"               "description"              
[19] "geo_enabled"               "user_created_at"           "statuses_count"           
[22] "followers_count"           "favourites_count"          "protected"                
[25] "user_url"                  "name"                      "time_zone"                
[28] "user_lang"                 "utc_offset"                "friends_count"            
[31] "screen_name"               "country_code"              "country"                  
[34] "place_type"                "full_name"                 "place_name"               
[37] "place_id"                  "place_lat"                 "place_lon"                
[40] "lat"                       "lon"                       "expanded_url"             
[43] "url"                       "timeformat"

我将“created_at”变量转换为“timeformat”变量，如下所示：

tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1))
colnames(tweets.df)<-c("time","freq")

我只是绘制了数据。stat="bin" 默认 bin 为数据范围的 1/30。每分钟拥有它会更好。

ggplot(tweets,aes(x=timeformat)) + geom_line(stat="bin")

在此处输入图像描述

score 2 · Accepted Answer

鉴于您的示例数据集：

tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1), stringsAsFactors=FALSE)
colnames(tweets.df)<-c("time","freq")

首先，您的时间列包含文本字符串，您需要 POSIXct 对象：

tweets.df$time <- as.POSIXct(tweets.df$time)

然后，使用函数按分钟分箱cut.POSIXt：

by.mins <- cut.POSIXt(tweets.df$time,"mins")

然后你想用这个分割你的数据框，并对freq子集上的列求和：

tweets.mins <- split(tweets.df, by.mins)
sapply(tweets.mins,function(x)sum(as.integer(x$freq)))
2014-12-13 14:04:00 2014-12-13 14:05:00 2014-12-13 14:06:00 2014-12-13 14:07:00 2014-12-13 14:08:00 
                  3                   3                   3                   0                   1 
2014-12-13 14:09:00 2014-12-13 14:10:00 2014-12-13 14:11:00 2014-12-13 14:12:00 2014-12-13 14:13:00 
                  2                   3                   2                   2                   0 
2014-12-13 14:14:00 2014-12-13 14:15:00 2014-12-13 14:16:00 2014-12-13 14:17:00 2014-12-13 14:18:00 
                 20                   2                   2                   4                   2 
2014-12-13 14:19:00 
                  1

在这种情况下，由于freqis 始终等于 1，因此等效于 using table(by.mins)。

r - 如何聚合每分钟的推文

1 回答 1

Related

Reference