1

I am trying to do a basic twitter sentiment analysis, by using apache spark.

The below page explains on Naive Bayes function used at apache spark which would be a candidate for the above problem. http://spark.apache.org/docs/1.0.0/mllib-naive-bayes.html

when you check at the java example, the training and test set are given as

JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set

I dont have any clue what datatype they are, but i can understand that they are some non english inputs.

I have a list of tweets say.

"I love my country."
"Great day at office."
"Google Chrome sucks!"

How do i use the naive bayes function to process the text ?

any insights on this would be helpful.

4

1 回答 1

2

LabeledPoint is of the format (double, Vectors(double[])) where first parameter is label and second is a Vector of features (only non-negative real values). But for your case it does not match. Which means you have to find a way to convert your data to real values. TFIDF seems to be one way. You might be interested to read this example for better understanding.

于 2014-09-19T12:33:43.290 回答