3

我一直在与 R 争论使用朴素贝叶斯分类器模型对推文进行分类。

资料

包含 2 列的训练集:Tweet 和 Class。共有 300 条推文:150 条归类为“应用程序”,150 条归类为“其他”。

客观的:

包含 20 个数据点(推文)的测试集——前 10 个是“App”,后 10 个是“Other”。我想预测一下。我可以在 Excel (blekh) 中成功生成朴素贝叶斯模型,并正确预测 20 个中的 19 个。

我想用 R 复制它。

代码片段

library(tm)
library('e1071')

# Custom Function 
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x
}

# Process text - tolower(), remove punctuation etc. 
tweets.all$Tweet <- replacePunctuation(tweets.all$Tweet)
tweets.test$Tweet <- replacePunctuation(tweets.test$Tweet)

# Create a corpus for training and testing data set
tweets.train.corpus <- Corpus(VectorSource(as.vector(tweets.all$Tweet)))
tweets.test.corpus <- Corpus(VectorSource(as.vector(tweets.test$Tweet)))

# Create term document matrix but only get word lenghts that are 4 or above
tweets.train.matrix <- t(TermDocumentMatrix(tweets.train.corpus,control=list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));

# Build model with additive smoothing as 1
model <- naiveBayes(as.matrix((tweets.train.matrix)),as.factor(tweets.all$class),laplace=1)

#Predict
results <- predict(object=model,newdata=as.matrix(tweets.test.matrix));
results

数据样本

调用 head(tweets.all) 会产生:

 Tweet class
 1                            [blog] Using Nullmailer and Mandrill for your Ubuntu Linux server outboud mail:  https://opensourcehacker.com/2013/03/25/using-nullmailer-and-mandrill-for-your-ubuntu-linux-server-outboud-mail/?utm_source=twitterfeed&utm_medium=twitter  #plone   App
 2                     [blog] Using Postfix and free Mandrill email service for SMTP on Ubuntu Linux server:  https://opensourcehacker.com/2013/03/26/using-postfix-and-free-mandrill-email-service-for-smtp-on-ubuntu-linux-server/?utm_source=twitterfeed&utm_medium=twitter  #plone   App
 3 @aalbertson There are several reasons emails go to spam. Mind submitting a request at http://help.mandrill.com  with additional details?   App
 4                    @adrienneleigh I just switched it over to Mandrill, let's see if that improve the speed at which the emails are sent.   App
 5      @ankeshk +1 to @mailchimp We use MailChimp for marketing emails and their Mandrill app for txn emails... @sampad @abhijeetmk @hiway   App
 6 @biggoldring That error may occur if unsupported auth method used. Can you email us via http://help.mandrill.com  so we can get details?   App

调用 head(tweets.test) 会产生:

Tweet
1   Just love @mandrillapp transactional email service - http://mandrill.com Sorry @SendGrid and @mailjet #timetomoveon
2   @rossdeane Mind submitting a request at http://help.mandrill.com with account details if you haven't already? Glad to take a look!
3   @veroapp Any chance you'll be adding Mandrill support to Vero?
4   @Elie__ @camj59 jparle de relai SMTP!1 million de mail chez mandrill / mois comparŽ ˆ 1 million sur lite sendgrid y a pas photo avec mailjet
5   would like to send emails for welcome, password resets, payment notifications, etc. what should i use? was looking at mailgun/mandrill
6   From Coworker about using Mandrill:  "I would entrust email handling to a Pokemon".

输出

这就是我得到的:

 [1] Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other
 Levels: App Other

这是垃圾 - 即没有正确分类。知道我做错了什么吗?

4

0 回答 0