0

嗨,我试图在第 7 章(k-Mean Clustering)中运行 Mahout 中的示例。有人可以指导我如何使用 Mahout(0.7) 在 Hadoop 集群(单节点 CDH-4.2.1)中运行该示例

这些是我遵循的步骤:

  1. 将代码(来自Github)复制到我的 Eclipse IDE 中,在我的本地机器上。

  2. 将这些 jars 包含到我的 Eclipse 项目中。

hadoop-common-2.0.0-cdh4.2.1.jar

hadoop-hdfs-2.0.0-cdh4.2.1.jar

hadoop-mapreduce-client-core-2.0.0-cdh4.2.1.jar

mahout-core-0.7-cdh4.3.0.jar

mahout-core-0.7-cdh4.3.0-job.jar

mahout-math-0.7-cdh4.3.0.jar

  1. 制作了这个项目的 Jar 并将该 jar 复制到我的 Hadoop 集群中

  2. 执行了这个命令

user@INFPH01463U:~$ hadoop jar /home/user/apurv/Kmean.jar tryout.SimpleKMeansClustering

这给了我以下错误

Exception in thread "main" java.lang.NoClassDefFoundError: FileSystem
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
        at java.lang.Class.getMethod0(Class.java:2670)
        at java.lang.Class.getMethod(Class.java:1603)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:202)
Caused by: java.lang.ClassNotFoundException: FileSystem
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 5 more

任何人都可以帮助我解决我所缺少的或者我的执行方式错误吗?

其次,我想知道如何在 CSV 文件上运行 K-mean Clustering?

提前致谢 :)

4

1 回答 1

0

给定的代码具有误导性,代码

Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
    writer.append(new Text(cluster.getIdentifier()), cluster);
}
writer.close();

KMeansDriver.run(conf, new Path("testdata/points"), new Path("testdata/clusters"),
  new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10,
  true, false);

SequenceFile.Reader reader = new SequenceFile.Reader(fs,
    new Path("output/" + Cluster.CLUSTERED_POINTS_DIR
             + "/part-m-00000"), conf);

应该替换为

Kluster cluster = new Kluster(vec, i, new EuclideanDistanceMeasure());
    writer.append(new Text(cluster.getIdentifier()), cluster);
}
writer.close();

KMeansDriver.run(conf, new Path("testdata/points"), new Path("testdata/clusters"),
  new Path("output"), new EuclideanDistanceMeasure(), 0.001, 10,
  true, false);

SequenceFile.Reader reader = new SequenceFile.Reader(fs,
    new Path("output/" + Kluster.CLUSTERED_POINTS_DIR
             + "/part-m-00000"), conf);

Cluster 是一个接口,而Kluster是一个类。请查看Mahout API Javadoc了解更多信息。

要使用 csv 文件运行 kmeans,首先您必须创建一个 SequenceFile 以作为 KmeansDriver 中的参数传递。以下代码读取CSV文件“points.csv”的每一行并将其转换为向量并将其写入SequenceFile“points.seq”

try (
            BufferedReader reader = new BufferedReader(new FileReader("testdata2/points.csv"));
            SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,new Path("testdata2/points.seq"), LongWritable.class, VectorWritable.class)
        ) {
            String line;
            long counter = 0;
            while ((line = reader.readLine()) != null) {
                String[] c = line.split(",");
                if(c.length>1){
                    double[] d = new double[c.length];
                    for (int i = 0; i < c.length; i++)
                            d[i] = Double.parseDouble(c[i]);
                    Vector vec = new RandomAccessSparseVector(c.length);
                    vec.assign(d);

                VectorWritable writable = new VectorWritable();
                writable.set(vec);
                writer.append(new LongWritable(counter++), writable);
            }
        }
        writer.close();
    }

希望能帮助到你!!

于 2013-08-01T07:40:32.560 回答