我做了一些有趣的推特挖掘。我使用 twitters streaming-APi 并在足球比赛之前、期间和之后记录了推文。现在我想做一个显示足球比赛推文频率的 ggplot2-graph。
在原始数据框中,我每条推文有一行,还有一个变量“created_at”,其中包含如下日期:Sat Dec 13 13:04:34 +0000 2014
然后我像这样改变了时间格式
tweets$format<- as.POSIXct(tweets$created_at, format = "%a %b %d %H:%M:%S %z %Y", tz="") 一个
并得到了这个2014-12-13 14:04:34 CET
。因为我不需要日期,我想,我可以摆脱它
tweets$Uhrzeit <- sub(".* ", "", tweets$format)
有了这个,我只剩下时间了14:04:34
。
我的问题是,我想以每分钟推文的准确性分析推文频率。我如何汇总每分钟的推文?正如我之前所说,每一行都是一条推文。我用时间和第二个包含“1”的变量制作了一个数据框。现在我想每分钟计算(聚合,求和)第二个变量。我试图找到一个解决方案,阅读有关动物园图书馆和计时图书馆的信息,但它让我感到困惑。
希望,有人可以帮助我。
编辑:可重现数据数据框是其中的一个子集:名称(推文)
[1] "X" "text" "retweet_count"
[4] "favorited" "truncated" "id_str"
[7] "in_reply_to_screen_name" "source" "retweeted"
[10] "created_at" "in_reply_to_status_id_str" "in_reply_to_user_id_str"
[13] "lang" "listed_count" "verified"
[16] "location" "user_id_str" "description"
[19] "geo_enabled" "user_created_at" "statuses_count"
[22] "followers_count" "favourites_count" "protected"
[25] "user_url" "name" "time_zone"
[28] "user_lang" "utc_offset" "friends_count"
[31] "screen_name" "country_code" "country"
[34] "place_type" "full_name" "place_name"
[37] "place_id" "place_lat" "place_lon"
[40] "lat" "lon" "expanded_url"
[43] "url" "timeformat"
我将“created_at”变量转换为“timeformat”变量,如下所示:
tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1))
colnames(tweets.df)<-c("time","freq")
我只是绘制了数据。stat="bin" 默认 bin 为数据范围的 1/30。每分钟拥有它会更好。
ggplot(tweets,aes(x=timeformat)) + geom_line(stat="bin")