0

我有一个包含 24231 行的 csv 文件。我想根据项目名称而不是对整个数据集的观察来应用 LOOCV。因此,如果我的数据集包含 15 个项目的信息,我希望有基于 14 个项目的训练集和基于另一个项目的测试集。

我依赖 weka 的 API,有什么可以自动化这个过程的吗?

4

1 回答 1

0

对于非数字属性,Weka 允许您通过Attribute.numValues()(有多少)和Attribute.value(int)(第 -th 值)检索唯一值。

package weka;

import weka.core.Attribute;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ConverterUtils;

public class LOOByValue {

  /**
   * 1st arg: ARFF file to load
   * 2nd arg: 0-based index in ARFF to use for class
   * 3rd arg: 0-based index in ARFF to use for LOO
   *
   * @param args    the command-line arguments
   * @throws Exception  if loading/processing of data fails
   */
  public static void main(String[] args) throws Exception {
    // load data
    Instances full = ConverterUtils.DataSource.read(args[0]);
    full.setClassIndex(Integer.parseInt(args[1]));
    int looCol = Integer.parseInt(args[2]);
    Attribute looAtt = full.attribute(looCol);
    if (looAtt.isNumeric())
      throw new IllegalStateException("Attribute cannot be numeric!");
    // iterate unique values to create train/test splits
    for (int i = 0; i < looAtt.numValues(); i++) {
      String value = looAtt.value(i);
      System.out.println("\n" + (i+1) + "/" + full.attribute(looCol).numValues() + ": " + value);
      Instances train = new Instances(full, full.numInstances());
      Instances test = new Instances(full, full.numInstances());
      for (int n = 0; n < full.numInstances(); n++) {
        Instance inst = full.instance(n);
        if (inst.stringValue(looCol).equals(value))
          test.add((Instance) inst.copy());
        else
          train.add((Instance) inst.copy());
      }
      train.compactify();
      test.compactify();
      // TODO do something with the data
      System.out.println("train size: " + train.numInstances());
      System.out.println("test size: " + test.numInstances());
    }
  }
}

使用Weka 的 anneal UCI 数据集surface-qualityfor leave-one-out,您可以生成如下内容:

1/5: ?
train size: 654
test size: 244

2/5: D
train size: 843
test size: 55

3/5: E
train size: 588
test size: 310

4/5: F
train size: 838
test size: 60

5/5: G
train size: 669
test size: 229
于 2021-11-15T22:17:43.977 回答