我正在尝试使用 e1071 中的 Naive Bayes Learner 进行垃圾邮件分析。这是我用来设置模型的代码。
library(e1071)
emails=read.csv("emails.csv")
emailstrain=read.csv("emailstrain.csv")
model<-naiveBayes(type ~.,data=emailstrain)
有两组电子邮件,它们都有一个“声明”和一个类型。一种用于训练,一种用于测试。当我跑步时
model
并且只是阅读原始输出,当它确实是垃圾邮件时,它似乎给了一个高于零百分比的声明是垃圾邮件,而当声明不是垃圾邮件时也是如此。但是,当我尝试使用该模型来预测测试数据时
table(predict(model,emails),emails$type)
我明白了
ham spam
ham 2086 321
spam 2 0
这似乎是错误的。我还尝试使用训练集来测试数据,在这种情况下,它应该会给出相当好的结果,或者至少与模型中观察到的结果一样好。然而它给了
ham spam
ham 2735 420
spam 0 6
这仅比测试集稍微好一点。我认为预测功能的工作方式一定有问题。
数据文件的设置方式以及内部内容的一些示例:
type,statement
ham,How much did ur hdd casing cost.
ham,Mystery solved! Just opened my email and he's sent me another batch! Isn't he a sweetie
ham,I can't describe how lucky you are that I'm actually awake by noon
spam,This is the 2nd time we have tried to contact u. U have won the £1450 prize to claim just call 09053750005 b4 310303. T&Cs/stop SMS 08718725756. 140ppm
ham,"TODAY is Sorry day.! If ever i was angry with you, if ever i misbehaved or hurt you? plz plz JUST SLAP URSELF Bcoz, Its ur fault, I'm basically GOOD"
ham,Cheers for the card ... Is it that time of year already?
spam,"HOT LIVE FANTASIES call now 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870..k"
ham,"When people see my msgs, They think Iam addicted to msging... They are wrong, Bcoz They don\'t know that Iam addicted to my sweet Friends..!! BSLVYL"
ham,Ugh hopefully the asus ppl dont randomly do a reformat.
ham,"Haven't seen my facebook, huh? Lol!"
ham,"Mah b, I'll pick it up tomorrow"
ham,Still otside le..u come 2morrow maga..
ham,Do u still have plumbers tape and a wrench we could borrow?
spam,"Dear Voucher Holder, To claim this weeks offer, at you PC please go to http://www.e-tlp.co.uk/reward. Ts&Cs apply."
ham,It vl bcum more difficult..
spam,UR GOING 2 BAHAMAS! CallFREEFONE 08081560665 and speak to a live operator to claim either Bahamas cruise of£2000 CASH 18+only. To opt out txt X to 07786200117
我真的很喜欢建议。非常感谢你的帮助