hadoop - 使用 Hector 在 Cassandra 数据上运行 mapreduce

Question

我一直在尝试使用 Java-Client 'HECTOR' 对存储在 Cassandra 中的数据运行简单的 map-reduce 作业。

我已经成功运行了这篇漂亮的博文中解释的 hadoop-wordcount示例。我还阅读了Hadoop 支持文章。

但是我想要做的是在实现方面有点不同（wordcount 示例使用一个脚本，其中提到了 mapreduce-site.xml）。我希望有人帮助我了解如何在分布式模式下运行 map-reduce 作业，而不是在 cassandra 数据上从“HECTOR”本地运行。

我的代码在本地模式下成功运行 map-reduce 作业。但我想要的是在分布式模式下运行它们并将结果作为新的 ColumnFamily 写入 cassandra 键空间。

我可能必须在某个地方设置它（如上面提到的博客文章中所述）
$PATH_TO_HADOOP/conf/mapred-site.xml
以在分布式模式下运行它，但我不知道在哪里。

这是我的代码

public  class test_forum implements Tool {

private String KEYSPACE = "test_forum";
private String COLUMN_FAMILY ="posts";
private String OUTPUT_COLUMN_FAMILY = "output_post_count";
private static String CONF_COLUMN_NAME = "text";


public int run(String[] strings) throws Exception {

    Configuration conf = new Configuration();

    conf.set(CONF_COLUMN_NAME, "text");
    Job job = new Job(conf,"test_forum");

    job.setJarByClass(test_forum.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setReducerClass(ReducerToCassandra.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);

    job.setOutputKeyClass(ByteBuffer.class);
    job.setOutputValueClass(List.class);

    job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
    job.setInputFormatClass(ColumnFamilyInputFormat.class);


    System.out.println("Job Set");


    ConfigHelper.setRpcPort(job.getConfiguration(), "9160");
    ConfigHelper.setInitialAddress(job.getConfiguration(), "localhost");
    ConfigHelper.setPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");

    ConfigHelper.setInputColumnFamily(job.getConfiguration(),KEYSPACE,COLUMN_FAMILY);
    ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, OUTPUT_COLUMN_FAMILY);

    SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes("text")));

    ConfigHelper.setInputSlicePredicate(job.getConfiguration(),predicate);

    System.out.println("running job now..");

    boolean success = job.waitForCompletion(true);

    return success ? 0:1;  //To change body of implemented methods use File | Settings | File Templates.

}



public static class TokenizerMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, IntWritable>
{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    private ByteBuffer sourceColumn;
    protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
    throws IOException, InterruptedException
    {
        sourceColumn = ByteBufferUtil.bytes(context.getConfiguration().get(CONF_COLUMN_NAME));
    }

    public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context) throws IOException, InterruptedException
    {



        IColumn column = columns.get(sourceColumn);

        if (column == null)  {
            return;
        }

        String value = ByteBufferUtil.string(column.value());
        System.out.println("read " + key + ":" + value + " from " + context.getInputSplit());

        StringTokenizer itr = new StringTokenizer(value);

        while (itr.hasMoreTokens())
        {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }


}

    public static class ReducerToCassandra extends Reducer<Text, IntWritable, ByteBuffer, List<Mutation>>
{
    private ByteBuffer outputKey;

    public void reduce(Text word, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
    {
        int sum = 0;

        byte[] keyBytes = word.getBytes();
        outputKey = ByteBuffer.wrap(Arrays.copyOf(keyBytes, keyBytes.length));


        for (IntWritable val : values)
            sum += val.get();

        System.out.println(word.toString()+" -> "+sum);
        context.write(outputKey, Collections.singletonList(getMutation(word, sum)));

    }

    private static Mutation getMutation(Text word, int sum)
    {
        Column c = new Column();
        c.setName(Arrays.copyOf(word.getBytes(), word.getLength()));
        c.setValue(ByteBufferUtil.bytes(String.valueOf(sum)));
        c.setTimestamp(System.currentTimeMillis());

        Mutation m = new Mutation();
        m.setColumn_or_supercolumn(new ColumnOrSuperColumn());
        m.column_or_supercolumn.setColumn(c);
        System.out.println("Mutating");
        return m;

    }

}




public static void main(String[] args) throws Exception, ClassNotFoundException, InterruptedException {

    System.out.println("Working..!");

    int ret=ToolRunner.run(new Configuration(), new test_forum(), args);

    System.out.println("Done..!");

    System.exit(ret);

}

}

这是我收到的警告：

WARN  - JobClient                  - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN  - JobClient                  - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

但是代码运行成功，执行 map-reduce 任务，但我不知道它在哪里写入数据。

编辑：我没有在 cassandra 中创建 columnFamily 用于输出。因此，它不是写作。所以现在唯一的问题是如何在分布式模式下运行它。

谢谢你。

score 2 · Accepted Answer

你用你的班级创建了一个罐子吗？

Hadoop 需要一个 jar 才能在集群上传播您的作业类。如果没有，它会解释“没有作业 jar 文件集”错误，以及为什么不能在分布式模式下运行它。注意使用“hadoop jar ...”命令启动您的工作并添加您的jar依赖项（至少是apache-cassandra！）。提交作业时，您的 cassandra 服务器必须已启动并正在监听 thrift 端口。

顺便说一句，Hadoop 和 Cassandra 不需要 Hector。（ColumnFamilyInputFormat和ColumnFamilyOutputFormat）现在介绍如何自己读取（和写入）数据到 Cassandra。这就是为什么您必须配置RpcPort,InitialAdress和Partionner（并且您做到了）。

最后注意：ColumnFamilyOutputFormat不会创建你的输出列族，它必须已经存在，否则你在写的时候会报错。

希望这可以帮助，

贝努瓦

hadoop - 使用 Hector 在 Cassandra 数据上运行 mapreduce

1 回答 1

Related

Reference