java - 如何在 weka 中表示用于分类的文本？

Question

请让我知道如何在 weka 中表示文本分类的属性或类。通过使用什么属性可以进行分类？词频还是词？ARFF 格式的可能结构是什么？你能给我几行该结构的例子吗？

非常感谢您提前。

score 11 · Accepted Answer

One of the easiest alternatives is to start with an ARFF file for a two class problem like:

@relation corpus 

@attribute text string
@attribute class {pos,neg}

@data
'long text with words ... ',pos

The text is represented as a String type and the class is a nominal with two values.

Then you could apply two filters:

StringToWordVector that transforms the texts into a word vector representation. The filter uses an attribute for each word. You can tweak parameters to choose binary/frequency representation, stemming or stopwords. The best representation depends on the problem. If text are not long, usually binary representation is enough.
Reorder to move the class atribute to the last position, Weka assumes it is there.

You may find more info and other approaches to transform your data in this Weka wiki page: http://weka.wikispaces.com/Text+categorization+with+WEKA

score 0 · Accepted Answer

在 weka 中，您可以选择自己的属性。在此示例中，我们只有 2 个类，并且所有唯一词都用作属性。如果您选择词频作为属性，则如果该词在文本中出现两次，则分配“2”，否则分配“0”，如果该词仅出现一次，则分配“1”。

这是示例 .arff 格式。

@RELATION anyrelation

@ATTRIBUTE word1
@ATTRIBUTE word2
...
@ATTRIBUTE wordn
@ATTRIBUTE class {class1, class2}

@DATA
1,2,....,0,class1
0,3,....,1,class2

java - 如何在 weka 中表示用于分类的文本？

2 回答 2

Related

Reference