r - 朴素贝叶斯分类器仅基于先验概率做出决策

Question

我试图根据推文的情绪将推文分为三类（买入、持有、卖出）。我正在使用 R 和包 e1071。

我有两个数据框：一个训练集和一组需要预测情绪的新推文。

训练集数据框：

   +--------------------------------------------------+

   **text | sentiment**

   *this stock is a good buy* | Buy

   *markets crash in tokyo* | Sell

   *everybody excited about new products* | Hold

   +--------------------------------------------------+

现在我想使用推文文本 trainingset[,2]和情绪类别来训练模型trainingset[,4]。

classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)

查看分类器的元素

classifier$tables$x

我发现条件概率是计算出来的。每条关于买入、持有和卖出的推文都有不同的概率。到目前为止一切都很好。

但是，当我预测训练集时：

predict(classifier, trainingset[,2], type="raw")

我得到一个仅基于先验概率的分类，这意味着每条推文都被归类为持有（因为“持有”在情绪中所占份额最大）。所以每条推文都有相同的买入、持有和卖出概率：

      +--------------------------------------------------+

      **Id | Buy | Hold | Sell**

      1  |0.25 | 0.5  | 0.25

      2  |0.25 | 0.5  | 0.25

      3  |0.25 | 0.5  | 0.25

     ..  |..... | ....  | ...

      N  |0.25 | 0.5  | 0.25

     +--------------------------------------------------+

任何想法我做错了什么？感谢你的帮助！

谢谢

score 8 · Accepted Answer

看起来您使用整个句子作为输入来训练模型，而您似乎想使用单词作为输入特征。

用法：

## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)


## S3 method for class 'naiveBayes'
predict(object, newdata,
  type = c("class", "raw"), threshold = 0.001, ...)

论据：

  x: A numeric matrix, or a data frame of categorical and/or
     numeric variables.

  y: Class vector.

特别是，如果您以naiveBayes这种方式训练：

x <- c("john likes cake", "marry likes cats and john")
y <- as.factor(c("good", "bad")) 
bayes<-naiveBayes( x,y )

你得到一个能够识别这两个句子的分类器：

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.5  0.5 

Conditional probabilities:
            x
      x
y      john likes cake marry likes cats and john
  bad                0                         1
  good               1                         0

要实现单词级别分类器，您需要使用单词作为输入来运行它

x <-             c("john","likes","cake","marry","likes","cats","and","john")
y <- as.factors( c("good","good", "good","bad",  "bad",  "bad", "bad","bad") )
bayes<-naiveBayes( x,y )

你得到

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.625 0.375 

Conditional probabilities:
      x
y            and      cake      cats      john     likes     marry
  bad  0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
  good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000

一般来说R，不太适合处理 NLP 数据，python（或至少Java）会是更好的选择。

要将句子转换为单词，可以使用strsplit函数

unlist(strsplit("john likes cake"," "))
[1] "john"  "likes" "cake"

r - 朴素贝叶斯分类器仅基于先验概率做出决策

1 回答 1

Related

Reference