1

我们有一个包含 20k 条推文的数据集,这些推文已经由我们的教授处理过,因此每个词的词性都在词之后定义……这个 pos 标记来自 Penn Treebank 项目。以下是一些例句:

+ 1005//CD I//PRP have//VBP to//TO second//JJ the//DT Garnier//NNP Fructis//NNP Brilliant//NNP Shine//NNP Wax//NNP .//. 
+ 1006//CD it//PRP is/be/VBZ everything//NN I//PRP have//VBP ever//RB wanted/want/VBD in//IN a//DT gel//NN .//. 
= 1007//CD TITLETITLE//NNP KelseysATrick//NNP ://: I//PRP miss//VBP my//PRP$ Pantene//NNP Pro-V//NNP ,//, 
+ 1008//CD KelseysATrick//NNP ://: I//PRP miss//VBP my//PRP$ Pantene//NNP Pro-V//NNP brunette//JJ expressions/expression/NNS shampoo//NN and//CC conditioner//NN .//. 
+ 1009//CD It//PRP made/make/VBD my//PRP$ hair//NN happier/happy/JJR than//IN this//DT Herbal//NNP Essence//NNP crap//NN .//. 
= 1010//CD TITLETITLE//NNP Best/Good/JJS CO//NNP Washing//NNP Conditioner//NNP ?//. Weaves/Weave/NNP and//CC non//FW weaves/weave/NNS 
+ 1011//CD Originally//RB posted/post/VBD by//IN CarmenKay//NNP I//PRP am/be/VBP in//IN love//NN with//IN the//DT Dove//NN conditioner//NN in//IN the//DT blue//JJ bottle//NN it//PRP always//RB works/work/VBZ wonders/wonder/NNS for//IN me//PRP !//. 
= 1012//CD ditto//NN 

其中第一个字符是句子的分类,句子中的每个单词都用它的 pos 标记。

读入数据时,Weka 有 pos 解析功能吗?目前我们已经剥离了 pos 标签并且没有使用它们,但我想它们对于提高分类器的准确性非常有帮助。

谢谢!

4

0 回答 0