问题标签 [accumulo]

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

0 投票
2 回答
4347 浏览

scala - How do I create a Spark RDD from Accumulo 1.6 in spark-notebook?

I have a Vagrant image with Spark Notebook, Spark, Accumulo 1.6, and Hadoop all running. From notebook, I can manually create a Scanner and pull test data from a table I created using one of the Accumulo examples:

will give the first ten rows of table data.

When I try to create the RDD thusly:

I get an RDD returned to me that I can't do much with due to the following error:

java.io.IOException: Input info has not been set. at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at org.apache.spark.rdd.RDD.count(RDD.scala:927)

This totally makes sense in light of the fact that I haven't specified any parameters as to which table to connect with, what the auths are, etc.

So my question is: What do I need to do from here to get those first ten rows of table data into my RDD?

update one Still doesn't work, but I did discover a few things. Turns out there are two nearly identical packages,

org.apache.accumulo.core.client.mapreduce

&

org.apache.accumulo.core.client.mapred

both have nearly identical members, except for the fact that some of the method signatures are different. not sure why both exist as there's no deprecation notice that I could see. I attempted to implement Sietse's answer with no joy. Below is what I did, and the responses:

import org.apache.hadoop.mapred.JobConf import org.apache.hadoop.conf.Configuration jobConf: org.apache.hadoop.mapred.JobConf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml

Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml

rdd2: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key, org.apache.accumulo.core.data.Value)] = HadoopRDD[1] at hadoopRDD at :62

java.io.IOException: Input info has not been set. at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630) at org.apache.accumulo.core.client.mapred.AbstractInputFormat.validateOptions(AbstractInputFormat.java:308) at org.apache.accumulo.core.client.mapred.AbstractInputFormat.getSplits(AbstractInputFormat.java:505) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.RDD.take(RDD.scala:1077) at org.apache.spark.rdd.RDD.first(RDD.scala:1110) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:69) at...

* edit 2 *

re: Holden's answer - still no joy:

rddX: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key, org.apache.accumulo.core.data.Value)] = NewHadoopRDD[0] at newAPIHadoopRDD at :58

Out[15]: NewHadoopRDD[0] at newAPIHadoopRDD at :58

java.io.IOException: Input info has not been set. at org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343) at org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.RDD.take(RDD.scala:1077) at org.apache.spark.rdd.RDD.first(RDD.scala:1110) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61) at

edit 3 -- progress!

i was able to figure out why the 'input INFO not set' error was occurring. the eagle-eyed among you will no doubt see the following code is missing a closing '('

as I'm doing this in spark-notebook, I'd been clicking the execute button and moving on because I wasn't seeing an error. what I forgot was that notebook is going to do what spark-shell will do when you leave off a closing ')' -- it will wait forever for you to add it. so the error was the result of the 'setConnectorInfo' method never getting executed.

unfortunately, I'm still unable to shove the accumulo table data into an RDD that's useable to me. when I execute

I get back

res15: Long = 10000

which is the correct response - there are 10,000 rows of data in the table I pointed to. however, when I try to grab the first element of data thusly:

I get the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.accumulo.core.data.Key

any thoughts on where to go from here?

edit 4 -- success!

the accepted answer + comments are 90% of the way there - except for the fact that the accumulo key/value need to be cast into something serializable. i got this working by invoking the .toString() method on both. i'll try to post something soon that's complete working code incase anyone else runs into the same issue.

0 投票
1 回答
61 浏览

java - 通过api设置Accumulo表

Accumulo 的新手,这可能听起来很傻,但我想知道如何通过 api 设置表格?文档肯定是缺失的。我已经能够找到

以及设置区域组:

从文档中,但我想知道如何采用第一种方法并构建表格。然后构建列。

提前致谢!

0 投票
1 回答
171 浏览

java - 如何检测 Acuumulo 何时完成主要压实?

我在主要压缩范围设置了一个迭代器,我用它来写入表。因此,每当我想进行重大压缩时,它就会开始。但是我想在写入过程完成后查询这个表。为此,我需要知道主要压缩是否已完成并且所有数据是否已写入表中。有没有办法让我知道这一点?

0 投票
1 回答
288 浏览

java - 从 api 获取 Accumulo 列族?

目前正在学习 Accumulo,我注意到我没有找到直接调用来确定条目的列族。我需要 Accumulo 表中的数据,格式为

例如:

这些点是我试图从中获取数据的地方:

所以很明显 key 和 value 很容易掌握,但我所说的“名称”对我来说也非常重要,也就是列族名称。

0 投票
2 回答
471 浏览

java - 通过 spark-notebook 填充 accumulo 1.6 突变对象时出现奇怪的错误

使用 spark-notebook 更新累积表。使用accumulo 文档accumulo 示例代码中指定的方法。以下是我在笔记本中逐字记录的内容,以及回复:

clientRqrdTble: org.apache.accumulo.core.cli.ClientOnRequiredTable = org.apache.accumulo.core.cli.ClientOnRequiredTable@6c6a18ed bwConfig: org.apache.accumulo.core.client.BatchWriterConfig = [maxMemory=52428800, maxLatency=120000, maxWriteThreads=3,超时=9223372036854775807] batchWriter: org.apache.accumulo.core.client.BatchWriter = org.apache.accumulo.core.client.impl.BatchWriterImpl@298aa736

rowIdS:字符串 = row_0736460000

突变:org.apache.accumulo.core.data.Mutation = org.apache.accumulo.core.data.Mutation@0

java.lang.IllegalStateException:在 org.apache.accumulo.core.data.Mutation.put( Mutation.java:163) 在 org.apache.accumulo.core.data.Mutation.put(Mutation.java:211)

我深入研究了代码,发现罪魁祸首是一个 if-catch,它正在检查 UnsynchronizedBuffer.Writer 缓冲区是否为空。行号不会对齐,因为这是与 1.6 accumulo-core jar 中的版本略有不同的版本 - 我已经查看了两者,但在这种情况下,不同之处并不重要。据我所知,该对象是在执行该方法之前创建的,并且没有被转储。

所以要么我在代码中遗漏了一些东西,要么有其他东西出现了。你们中有人知道可能导致这种行为的原因吗?

更新一

我已经使用 scala 控制台并通过直接的 java 1.8 执行了以下代码。它在 scala 中失败,但在 Java 中失败。在这一点上,我认为这是一个 Accumulo 问题。因此,我将打开一个 bug 票并深入挖掘源代码。如果我想出一个解决方案,我会在这里发布。

下面是Java形式的代码。那里有一些额外的东西,因为我想确保我可以连接到我使用 accumulo 批处理编写器示例创建的表:

更新二

已为此问题创建了Accumulo 错误票证。他们的目标是在 v1.7.0 中修复这个问题。在那之前,我在下面提供的解决方案是一种功能解决方法。

0 投票
2 回答
93 浏览

hadoop - Accumulo 表名称映射问题

我的 Accumulo 实例已损坏。当我启动我的实例时,它抛出了一个错误,指出我的元数据表已损坏。然后,我从 hdfs 的 Accumulo 目录中备份了我的数据,并对我的实例进行了初始化。我没有意识到 Accumulo 表名没有在我备份的数据中列出。表名是整数。有没有在整数和表名之间进行转换的地方?还是我通过初始化并吹走 Zookeeper 把自己搞砸了?

谢谢。

0 投票
1 回答
648 浏览

java - 累积插入到表中的行数

我在 Accumulo 的表中插入了一些行。有些行是新创建的,有些行是更新的。

如何在 Java 中找到插入或更新到累积表中的行数?

这是当前正在执行的操作,计数被视为写入表中的条目数。但是如果只插入了一些新的条目/行,而其他的要更新,则不能将计数视为输入到表中的新条目

0 投票
1 回答
247 浏览

java - 无法与远程 Accumulo 通信

我正在尝试连接到 Windows 主机上远程 Centos Virtual Box 上托管的 Accumulo 实例。

我没有添加进一步编写的代码,因为我没有得到 conn 实例并且无法从这里继续进行。

0 投票
1 回答
53 浏览

accumulo - 如何在 Accumulo 程序的映射器中获取输入文件名?

我正在尝试 wordcount 示例,我想打印找到该单词的文件的名称。但我不知道如何在 accumulo 的映射函数中获取输入拆分的名称

0 投票
1 回答
75 浏览

search - Accumulo 中的 setRange() 中的输入参数

我有如下代码:

我的问题是,有谁知道“新范围(输入)”中的“输入”部分应该是什么?是输入 RowId 吗?