machine-learning - 使用 WEKA API 在训练集和测试集上执行 LSA

Question

我需要使用 Weka 及其 AttributeSelection 算法 LatentSemanticAnalysis 来做文本分类。我将数据集拆分为要应用 LSA 的训练集和测试集。我已经阅读了一些关于 LSA 的帖子，但是我还没有找到如何使用它来分离数据集并保持它们的兼容性。这是我到目前为止但内存不足...：

AttributeSelection selecter = new AttributeSelection();
weka.attributeSelection.LatentSemanticAnalysis lsa = new weka.attributeSelection.LatentSemanticAnalysis();
Ranker rank = new Ranker();

selecter.setEvaluator(lsa);
selecter.setSearch(rank);
selecter.setRanking(true);

selecter.SelectAttributes(input);
Instances outputData = selecter.reduceDimensionality(input);

Edit1 针对@Jose 的回复，我添加了新版本的源代码。这会导致 OutOfMemoryError：

AttributeSelection filter = new AttributeSelection(); // package weka.filters.supervised.attribute!
LatentSemanticAnalysis lsa = new LatentSemanticAnalysis();
Ranker rank = new Ranker();
filter.setEvaluator(lsa);
filter.setSearch(rank);
filter.setInputFormat(train);

train = Filter.useFilter(train, filter);
test = Filter.useFilter(test, filter);

Edit2 我得到的错误：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at weka.core.matrix.Matrix.getArrayCopy(Matrix.java:301)
at weka.core.matrix.SingularValueDecomposition.<init>(SingularValueDecomposition.java:76)
at weka.core.matrix.Matrix.svd(Matrix.java:913)
at weka.attributeSelection.LatentSemanticAnalysis.buildAttributeConstructor(LatentSemanticAnalysis.java:511)
at weka.attributeSelection.LatentSemanticAnalysis.buildEvaluator(LatentSemanticAnalysis.java:416)
at weka.attributeSelection.AttributeSelection.SelectAttributes(AttributeSelection.java:596)
at weka.filters.supervised.attribute.AttributeSelection.batchFinished(AttributeSelection.java:455)
at weka.filters.Filter.useFilter(Filter.java:682)
at test.main(test.java:44)

score 2 · Accepted Answer

作为AttributeSelection过滤器，您可以将其以批处理模式（-b选项）一次应用于训练和测试子集，从而根据训练集中定义的维度表示测试数据集。

您可以在Use Weka in your Java code-Filter-Batch filtering中查看如何在程序中执行此操作。

machine-learning - 使用 WEKA API 在训练集和测试集上执行 LSA

1 回答 1

Related

Reference