0

I'm trying to figure out how I could identify documents (tweets in this case) based on a term they may include.

Say I have this data frame (df), which is composed of a list of the screen name of Twitter users and one of their tweets.

> df
     ScreenName tweet                         
[1,] "Guy A"    "one random tweet"            
[2,] "Guy B"    "another random tweet"        
[3,] "Guy C"    "a third random piece of text"

Well, within this data frame I would like to get the tweets that include a certain term -say "tweet"- and extract those in to a new data frame (df2) like so:

> df2
     ScreenName tweet                 
[1,] "Guy A"    "one random tweet"    
[2,] "Guy B"    "another random tweet"

I assume there must be a way to do it using the tm or qdap packages. But could not find anything and so ended up with this mess;

After cleaning the corpus I convert to termDocumentMatrix

tdm <- TermDocumentMatrix(corpus, control=list(minWordLength=1))

I then identify in which row of the Term Document Matrix the term I am interested in is

t <- as.vector(tdm[term,])

Subset - if term has been mentioned more than once

t.df <- as.data.frame(t)
t.sub <- subset(t.df, t >= 1)

Get document number (row number)

t.n <- as.numeric(rownames(t.sub))

Create new data frames where t.tw - only including tweets mentioning term and t.o - other tweets

t.tw <- tw[t.n,]
t.o <- tw[!1:nrow(tw) %in% t.n, ]

Thanks for your help.

Apologies if the horrendous piece of code above has offended any accomplished R users.

4

1 回答 1

0

我会留在基地并使用以下行的grep功能(如果你已经有data.frame):

df[grep("tweet", df$tweet), ]

这是您的数据的全部内容:

df <- read.table(text='ScreenName tweet                         
"Guy A"    "one random tweet"            
"Guy B"    "another random tweet"        
"Guy C"    "a third random piece of text"', header=TRUE)

df[grep("tweet", df$tweet), ]

##   ScreenName                tweet
## 1      Guy A     one random tweet
## 2      Guy B another random tweet
于 2014-06-10T20:34:05.590 回答