6

我试图根据推文的情绪将推文分为三类(买入、持有、卖出)。我正在使用 R 和包 e1071。

我有两个数据框:一个训练集和一组需要预测情绪的新推文。

训练集数据框:

   +--------------------------------------------------+

   **text | sentiment**

   *this stock is a good buy* | Buy

   *markets crash in tokyo* | Sell

   *everybody excited about new products* | Hold

   +--------------------------------------------------+

现在我想使用推文文本 trainingset[,2]和情绪类别来训练模型trainingset[,4]

classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)

查看分类器的元素

classifier$tables$x

我发现条件概率是计算出来的。每条关于买入、持有和卖出的推文都有不同的概率。到目前为止一切都很好。

但是,当我预测训练集时:

predict(classifier, trainingset[,2], type="raw")

我得到一个基于先验概率的分类,这意味着每条推文都被归类为持有(因为“持有”在情绪中所占份额最大)。所以每条推文都有相同的买入、持有和卖出概率:

      +--------------------------------------------------+

      **Id | Buy | Hold | Sell**

      1  |0.25 | 0.5  | 0.25

      2  |0.25 | 0.5  | 0.25

      3  |0.25 | 0.5  | 0.25

     ..  |..... | ....  | ...

      N  |0.25 | 0.5  | 0.25

     +--------------------------------------------------+

任何想法我做错了什么?感谢你的帮助!

谢谢

4

1 回答 1

8

看起来您使用整个句子作为输入来训练模型,而您似乎想使用单词作为输入特征。

用法:

## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)


## S3 method for class 'naiveBayes'
predict(object, newdata,
  type = c("class", "raw"), threshold = 0.001, ...)

论据:

  x: A numeric matrix, or a data frame of categorical and/or
     numeric variables.

  y: Class vector.

特别是,如果您以naiveBayes这种方式训练:

x <- c("john likes cake", "marry likes cats and john")
y <- as.factor(c("good", "bad")) 
bayes<-naiveBayes( x,y )

你得到一个能够识别这两个句子的分类器:

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.5  0.5 

Conditional probabilities:
            x
      x
y      john likes cake marry likes cats and john
  bad                0                         1
  good               1                         0

要实现单词级别分类器,您需要使用单词作为输入来运行它

x <-             c("john","likes","cake","marry","likes","cats","and","john")
y <- as.factors( c("good","good", "good","bad",  "bad",  "bad", "bad","bad") )
bayes<-naiveBayes( x,y )

你得到

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.625 0.375 

Conditional probabilities:
      x
y            and      cake      cats      john     likes     marry
  bad  0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
  good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000

一般来说R,不太适合处理 NLP 数据,python(或至少Java)会是更好的选择。

要将句子转换为单词,可以使用strsplit函数

unlist(strsplit("john likes cake"," "))
[1] "john"  "likes" "cake" 
于 2013-08-15T17:42:31.007 回答