r - 使用 R 分析 Twitter 数据

Question

我正在尝试使用 R 分析 Twitter 数据，通过绘制一段时间内的推文数量，当我写

plot(tweet_df$created_at, tweet_df$text)

我收到此错误消息：

Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf
6: In max(x) : no non-missing arguments to max; returning -Inf

这是我使用的代码：

library("rjson")
json_file <- "tweet.json"
json_data <- fromJSON(file=json_file)
library("streamR")
tweet_df <- parseTweets(tweets=file)
#using the twitter data frame
tweet_df$created_at
tweet_df$text
plot(tweet_df$created_at, tweet_df$text)

score 3 · Accepted Answer

你有几个问题在这里，但没有什么是不可克服的。如果您想随时间跟踪推文，您实际上是在要求每 x 时间范围内创建的推文（每分钟、每秒的推文，等等）。这意味着您只需要created_at列，并且可以使用 R 的hist函数构建图形。

如果您想按文本中提到的单词或其他内容进行拆分，那也是可行的，但您可能应该ggplot2这样做，并且可能会问一个不同的问题。无论如何，它看起来像是parseTweets将 twitter 时间戳转换为字符字段，因此您需要将其转换为POSIXctR 可以理解的时间戳字段。假设您有一个看起来像这样的数据框：

❥ head(tweet_df[,c("id_str","created_at")])
              id_str                     created_at
1 597862782101561346 Mon May 11 20:36:09 +0000 2015
2 597862782097346560 Mon May 11 20:36:09 +0000 2015
3 597862782105694208 Mon May 11 20:36:09 +0000 2015
4 597862782105694210 Mon May 11 20:36:09 +0000 2015
5 597862782076198912 Mon May 11 20:36:09 +0000 2015
6 597862782114078720 Mon May 11 20:36:09 +0000 2015

你可以这样做：

❥ dated_tweets <- as.POSIXct(tweet_df$created_at, format = "%a %b %d %H:%M:%S +0000 %Y")

这将为您提供 R 时间戳格式的日期推文向量。然后，您可以像这样绘制它们。我打开示例 twitter 提要 15 分钟左右。这是结果：

❥ hist(dated_tweets, breaks ="secs", freq = TRUE)

在此处输入图像描述

r - 使用 R 分析 Twitter 数据

1 回答 1

Related

Reference