I have a dataframe (myDF) with 700.000+ rows, each row has two columns, id and text. The text has 140 character texts (tweets) and I would like to run a sentiment analysis that I got off the web on them. However, no matter what I try, I have memory problems on a macbook with 4gb ram.
I was thinking that maybe I could loop through rows, e.g. do the first 10, and then the second 10...etc. (I run into problems even with batches of 100) Would this solve the problem? What is the best way to loop in such way?
I am posting my code here:
library(plyr)
library(stringr)
# function score.sentiment
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
# Parameters
# sentences: vector of text to score
# pos.words: vector of words of postive sentiment
# neg.words: vector of words of negative sentiment
# .progress: passed to laply() to control of progress bar
# create simple array of scores with laply
scores = laply(sentences,
function(sentence, pos.words, neg.words)
{
# split sentence into words with str_split (stringr package)
word.list = str_split(sentence, "\\s+")
words = unlist(word.list)
# compare words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# get the position of the matched term or NA
# we just want a TRUE/FALSE
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# final score
score = sum(pos.matches)- sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
# data frame with scores for each sentence
scores.df = data.frame(text=sentences, score=scores)
return(scores.df)
}
# import positive and negative words
pos = readLines("positive_words.txt")
neg = readLines("negative_words.txt")
# apply function score.sentiment
myDF$scores = score.sentiment(myDF$text, pos, neg, .progress='text')