我使用包对 Twitter 数据进行了文本挖掘rtweet
,但在将其保存到数据框后,我无法充分利用包等ggplot2
功能。我究竟做错了什么?
我使用该库对一些 twitter 数据进行了文本挖掘rtweet
,并且部分进行了挖掘,因此我不会超过 API 限制。在收集了我需要的所有数据后,我将它们全部合并到一个数据框中。我下载了dplyr
和ggplot2
包,并希望随着时间的推移可视化推文,但时间变量来自数据框时无法识别。但是,如果我使用其中一批原始名称的挖掘数据,它会被识别为时间变量并且绘制得很好。这是我用来挖掘数据、将其保存到数据帧并将它们全部组合成一个数据帧的代码。
##data splits used for mining - 7/8 companies at a time
#1st batch
food_comp1 <- get_timelines(c("accentcatering","angussoftfruits","wmbarrowcliffes","brewdog","brewhouse","capestonefarm","stpierregroupe"), n=3200)
#2nd batch
food_comp2 <- get_timelines(c("cavedirect","brockmoor","cherryfieldltd","fazendagroup","ddcfoods","dairypartners","drakeandmorgan"), n=3200)
#3rd batch
food_comp3 <- get_timelines(c("drinkwarehouse","etsteas","fentimansltd"), n=3200)
#4th batch
food_comp4 <- get_timelines(c("goustocooking","edinburgh_gin","innisandgunn","grapetreefoods","kkfinefoods","finnebrogue","thewhiskyshop"), n=3200)
#5th batch
food_comp5 <- get_timelines(c("onarollsandwich","potsandco","primacheese","purecircle","silburyoils","thealchemistuk"), n=3200)
#6th batch
food_comp6 <- get_timelines(c("artisan_glasgow","thebigprawnco","foodfellas","freshfoodco","wagyurestaurant","charlesfaram","wenzelsthebaker","westerrosssalmo"), n=3200)
#7th batch
food_comp7 <- get_timelines(c("elitefoods_","fevertreemixers","gordon_macphail","gosh_freefrom","specialitydrink"), n=3200)
#merging the datasets into 1
comp <- rbind(food_comp1,food_comp2,food_comp3,food_comp4,food_comp5,food_comp6,food_comp7)
write_as_csv(comp, file_name = "comp", prepend_ids = TRUE, na="", fileEncoding = "UTF-8")
comp <- read.csv("comp.csv", header = TRUE)
View(comp)
#creating a subset with the variables I need
compsubset <- subset(comp, select=c("created_at","screen_name","text","display_text_width","favorite_count","retweet_count","hashtags","media_type","lang"))
write_as_csv(compsubset, file_name = "compsubset", prepend_ids = TRUE, na="", fileEncoding = "UTF-8")
##final dataset with 9 variables
compsubset <- read.csv("compsubset.csv", header=TRUE)
#attemting to create a ggplot with created_at as the time variable (displays year-date-time)
ggplot(data = compsubset, aes(x = created_at)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Time") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")
错误:StatBin 需要一个连续的 x 变量:x 变量是离散的。也许你想要 stat="count"?
在转换为数据帧之前与其中一个挖掘批次相同的代码
ggplot(data = food_comp1, aes(x = created_at)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Time") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")
在最后一个 ggplot 之后,我得到了一个不错的条形图,其中包含随时间推移的推文
当我尝试使用dplyr
按时间过滤推文时会发生类似的事情,但在放入数据框之前可以很好地处理挖掘的数据
comptrial %>%
dplyr::filter(created_at >= "2018-01-01") %>%
dplyr::filter(created_at < "2019-01-01") %>%
dplyr::group_by(screen_name) %>%
ts_plot("days", trim = 1L) +
ggplot2::geom_point() +
ggplot2::theme_minimal() +
ggplot2::theme(
legend.title = ggplot2::element_blank(),
legend.position = "bottom",
plot.title = ggplot2::element_text(face = "bold")) +
ggplot2::labs(
x = NULL, y = NULL,
title = "Frequency of Twitter statuses posted by SMEs",
subtitle = "Twitter status (tweet) counts
caption = "\nSource: Data collected from SME's REST API via rtweet"
)
seq.POSIXt(data[[dtvar]][1], data[[dtvar]][length(data[[dtvar]])], 中的错误:'to' 的长度必须为 1 另外:警告消息:1 : 在 Ops.factor(created_at, "2018-01-01") : '>=' 对因子 2 没有意义:因子
screen_name
包含隐式 NA,考虑使用forcats::fct_explicit_na
3:因子screen_name
包含隐式 NA,考虑使用forcats::fct_explicit_na
4:因子screen_name
包含隐式 NA,考虑使用forcats::fct_explicit_na