I'm trying to figure out how I could identify documents (tweets in this case) based on a term they may include.
Say I have this data frame (df), which is composed of a list of the screen name of Twitter users and one of their tweets.
> df
ScreenName tweet
[1,] "Guy A" "one random tweet"
[2,] "Guy B" "another random tweet"
[3,] "Guy C" "a third random piece of text"
Well, within this data frame I would like to get the tweets that include a certain term -say "tweet"- and extract those in to a new data frame (df2) like so:
> df2
ScreenName tweet
[1,] "Guy A" "one random tweet"
[2,] "Guy B" "another random tweet"
I assume there must be a way to do it using the tm or qdap packages. But could not find anything and so ended up with this mess;
After cleaning the corpus I convert to termDocumentMatrix
tdm <- TermDocumentMatrix(corpus, control=list(minWordLength=1))
I then identify in which row of the Term Document Matrix the term I am interested in is
t <- as.vector(tdm[term,])
Subset - if term has been mentioned more than once
t.df <- as.data.frame(t)
t.sub <- subset(t.df, t >= 1)
Get document number (row number)
t.n <- as.numeric(rownames(t.sub))
Create new data frames where t.tw - only including tweets mentioning term and t.o - other tweets
t.tw <- tw[t.n,]
t.o <- tw[!1:nrow(tw) %in% t.n, ]
Thanks for your help.
Apologies if the horrendous piece of code above has offended any accomplished R users.