java - 使用 Weka 对作者博客性别进行分类

Question

我正在尝试使用 Java 中的 Weka 将作者的博客归类为男性或女性。我构建了一个名为 Weka 的类，它定义了要在训练集中使用的属性，然后调用一个方法从 Excel 工作表中加载所有已知数据。文件中的数据是这样组织的：每一行在单元格 0 中有博客文本，然后在单元格 1 中有一个 M 或一个 F。

博客文字 M 更多文字 F

我也在关注这个教程一点Weka Java Tutorial

当我运行程序时，我开始在 Eclipse 的控制台窗口中看到文本，但突然我收到一个红色错误，上面写着“未为给定的名义属性定义值！” 我不太确定为什么会这样。文本逐行变化，所以我认为不可能定义所有名义属性。谁能看到我在这里做错或愚蠢？我将不胜感激任何帮助。我已经坚持了几个小时。

代码：

    public class Weka
{
    static FastVector fvWekaAttributes;
    static Instances isTrainingSet;
    static Classifier cModel;

    public static void main(String[] args) throws Exception
    {



        // Declaring attributes
        Attribute stringAttribute = new Attribute("text", (FastVector) null);

        // Declaring a class attribute along with values
        FastVector fastVClassVal = new FastVector(2);
        fastVClassVal.addElement("M");
        fastVClassVal.addElement("F");

        Attribute classAttribute = new Attribute("theClass", fastVClassVal);

        // Declaring the feature vector
        fvWekaAttributes = new FastVector(2);
        fvWekaAttributes.addElement(stringAttribute);
        fvWekaAttributes.addElement(classAttribute);

        // create the training set
        isTrainingSet = new Instances("Rel", fvWekaAttributes, 10);

        // set class index
        isTrainingSet.setClassIndex(1);

        // create however many instances is in my excel file
        // and add it to the training set in a loop.
        Weka.LoadExcelWorkBook(isTrainingSet);
        Weka.TestSetWork();

    }

    public static void TestSetWork() throws Exception
    {
        // test the model
        Evaluation testing = new Evaluation(isTrainingSet);
        testing.evaluateModel(cModel, isTrainingSet);

        // printing the results....
        String strSummary = testing.toSummaryString();
        System.out.println(strSummary);

        // get confusion matrix.

        double[][] cmMatrix = testing.confusionMatrix();
        for (int i = 0; i < cmMatrix.length; i++)
        {
            for (int col = 0; col < cmMatrix.length; col++)
            {
                System.out.print(cmMatrix[i][col]);
                System.out.print("|");
            }
            System.out.println();
        }

    }

    public static void LoadExcelWorkBook(Instances trainingSet)
            throws Exception
    {
        System.out.println("LOADING EXCEL WORKBOOK!!!");
        Workbook wb = null;
        // opening excel file.

        try
        {
            wb = WorkbookFactory
                    .create(new File("C://blog-gender-dataset.xlsx"));

        } catch (IOException ieo)
        {
            ieo.printStackTrace();
        }

        // opening worksheet.
        Sheet sheet = wb.getSheetAt(0);

        StringToWordVector filter = new StringToWordVector();
        filter.setInputFormat(isTrainingSet);

        Instances dataFiltered = Filter.useFilter(isTrainingSet, filter);

        for (Row row : sheet)
        {

            Cell textCell = row.getCell(0);
            Cell MFCell = row.getCell(1);

            String blogText = textCell.getStringCellValue();
            String MFIndicator = MFCell.getStringCellValue();
            System.out.println("TEXT FROM EXCEL " + blogText);
            Instance iText = new Instance(2);

            iText.setValue((Attribute) fvWekaAttributes.elementAt(0), tweetText);
            iText.setValue((Attribute) fvWekaAttributes.elementAt(1),
                    MFIndicator);

            isTrainingSet.add(iText);

            cModel = (Classifier) new J48();
            cModel.buildClassifier(dataFiltered);

        }
    }

}

score 0 · Accepted Answer

“未为给定的名义属性定义值！” 当您构造的实例中，预期数据碰巧具有与您在 arff @attribute 部分中为给定名义属性定义的值不同的值时到达。例如，您将预期值定义为 "M" 或 "F" ，但您读取的值可能为空（N/A）等。解决方案是严格验证您的数据，调试/跟踪您加载的内容发生错误的属性，并将该值添加到该属性的可能值中 - 或者，如果这在您的情况下系统地出现，则将该属性定义为具有更通用的类型（字符串、数字、..）。

java - 使用 Weka 对作者博客性别进行分类

1 回答 1

Related

Reference