我一直在与 R 争论使用朴素贝叶斯分类器模型对推文进行分类。
资料:
包含 2 列的训练集:Tweet 和 Class。共有 300 条推文:150 条归类为“应用程序”,150 条归类为“其他”。
客观的:
包含 20 个数据点(推文)的测试集——前 10 个是“App”,后 10 个是“Other”。我想预测一下。我可以在 Excel (blekh) 中成功生成朴素贝叶斯模型,并正确预测 20 个中的 19 个。
我想用 R 复制它。
代码片段
library(tm)
library('e1071')
# Custom Function
replacePunctuation <- function(x)
{
x <- tolower(x)
x <- gsub("[.]+[ ]"," ",x)
x <- gsub("[:]+[ ]"," ",x)
x <- gsub("[?]"," ",x)
x <- gsub("[!]"," ",x)
x <- gsub("[;]"," ",x)
x <- gsub("[,]"," ",x)
x
}
# Process text - tolower(), remove punctuation etc.
tweets.all$Tweet <- replacePunctuation(tweets.all$Tweet)
tweets.test$Tweet <- replacePunctuation(tweets.test$Tweet)
# Create a corpus for training and testing data set
tweets.train.corpus <- Corpus(VectorSource(as.vector(tweets.all$Tweet)))
tweets.test.corpus <- Corpus(VectorSource(as.vector(tweets.test$Tweet)))
# Create term document matrix but only get word lenghts that are 4 or above
tweets.train.matrix <- t(TermDocumentMatrix(tweets.train.corpus,control=list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));
# Build model with additive smoothing as 1
model <- naiveBayes(as.matrix((tweets.train.matrix)),as.factor(tweets.all$class),laplace=1)
#Predict
results <- predict(object=model,newdata=as.matrix(tweets.test.matrix));
results
数据样本
调用 head(tweets.all) 会产生:
Tweet class
1 [blog] Using Nullmailer and Mandrill for your Ubuntu Linux server outboud mail: https://opensourcehacker.com/2013/03/25/using-nullmailer-and-mandrill-for-your-ubuntu-linux-server-outboud-mail/?utm_source=twitterfeed&utm_medium=twitter #plone App
2 [blog] Using Postfix and free Mandrill email service for SMTP on Ubuntu Linux server: https://opensourcehacker.com/2013/03/26/using-postfix-and-free-mandrill-email-service-for-smtp-on-ubuntu-linux-server/?utm_source=twitterfeed&utm_medium=twitter #plone App
3 @aalbertson There are several reasons emails go to spam. Mind submitting a request at http://help.mandrill.com with additional details? App
4 @adrienneleigh I just switched it over to Mandrill, let's see if that improve the speed at which the emails are sent. App
5 @ankeshk +1 to @mailchimp We use MailChimp for marketing emails and their Mandrill app for txn emails... @sampad @abhijeetmk @hiway App
6 @biggoldring That error may occur if unsupported auth method used. Can you email us via http://help.mandrill.com so we can get details? App
调用 head(tweets.test) 会产生:
Tweet
1 Just love @mandrillapp transactional email service - http://mandrill.com Sorry @SendGrid and @mailjet #timetomoveon
2 @rossdeane Mind submitting a request at http://help.mandrill.com with account details if you haven't already? Glad to take a look!
3 @veroapp Any chance you'll be adding Mandrill support to Vero?
4 @Elie__ @camj59 jparle de relai SMTP!1 million de mail chez mandrill / mois comparŽ ˆ 1 million sur lite sendgrid y a pas photo avec mailjet
5 would like to send emails for welcome, password resets, payment notifications, etc. what should i use? was looking at mailgun/mandrill
6 From Coworker about using Mandrill: "I would entrust email handling to a Pokemon".
输出
这就是我得到的:
[1] Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other Other
Levels: App Other
这是垃圾 - 即没有正确分类。知道我做错了什么吗?