1

我正在为文本应用程序在 MOA 中进行增量学习。这需要创建一个以数字方式表示文本的 Instance 对象,例如词汇表中每个词干的 TF-IDF 分数。我的 MOA 版本是 2019.05.0。

我在 MOA 中寻找文本处理工具,但找不到。

我看到 Weka 有一个类StringToWordVector,所以我决定尝试一下。Weka 的类与 MOA 的类不一样,但是有一个类叫做WekaToSamoaInstanceConverter,我想我可以创建一个 Weka Instance,运行它StringToWordVector,然后将它转换为 MOA Instance。也许这是错误的轨道,或者这是正确的轨道,我在语法中遗漏了一些东西。

public static Instances convertDirectoryToInstances(String directory) throws Exception {
    //Create an object that reads training or test files from a directory.
    //In the future, I'll want to add one file at a time. That's not the part I'm worried about at the moment.
    TextDirectoryLoader loader = new TextDirectoryLoader();
    String[] options = new String[] {"-dir", directory, "-charset", "UTF-8"};
    loader.setOptions(options);
    loader.getStructure();

    //Create Weka Instances that represent unprocessed text.
    weka.core.Instances plainTextInstances = loader.getDataSet();

    //A StringToWordVector is a Filter that converts text to text vectors.
    //I'm not using any bells and whistles for this example, so I expect each Instance to be a set of terms in the document.
    StringToWordVector stringToWordVector = new StringToWordVector();
    stringToWordVector.setInputFormat(plainTextInstances);
    weka.core.Instances wekaWordVectors = Filter.useFilter(plainTextInstances, stringToWordVector);

    //A MOA Instance is different from a Weka Instance, so we need to convert them.
    WekaToSamoaInstanceConverter converter = new WekaToSamoaInstanceConverter();

    //This is what fails.
    Instances moaWordVectors = converter.samoaInstances(wekaWordVectors);
    return moaWordVectors;
}

wekaWordVectors.size()是子目录中的文件数,所以这就是我所期望的。

调用samoaInstances()失败。第 220 行尝试调用locateIndex(0). 0 处没有类,因此返回 -1。这个 -1 用作数组索引,所以我得到一个ArrayIndexOutOfBoundsException. 我不知道 0 级是什么意思,但我知道那ArrayIndexOutOfBoundsException意味着我做错了什么。

4

0 回答 0