问题标签 [sequencefile]
For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.
hadoop - 将空格分隔的文件(每行 = 向量)转换为 SequenceFile
我创建了如下的大文本文件(4 GB)。
每行描述一个向量,每列表示向量的每个元素。每个元素由一个空格分隔。
现在,我想使用 Apache Mahout 对所有向量执行 K-Means 聚类,但我收到了错误"not a SequenceFile"
。
如何创建格式符合 mahout 要求的文件?
hadoop - 如何将 -text HDFS 命令的输出复制到另一个文件中?
有什么方法可以使用 HDFS 命令将 hdfs 文件的文本内容复制到另一个文件系统中:
我可以使用 -cat 或任何方法将 -text 的输出打印到另一个文件中吗?:
hadoop - SequenceFiles and Hadoop streaming
I have a use case wherein I use hadoop streaming to run an executable as map process. In the input side, I have large number of sequence files. Each seq file has, says 8 keys and corresponding values which are list of float arrays. Instead of letting one map process to process one seq file, I prefer to allocate a group of seq files to one map process. Hence, I decided to merge all those seq files into one large file. Assume this big seq file is made up of 50,000 small seq files.
Now, is it possible to configure my hadoop streaming utility to allocate a portion of seq file to each map process?
How to make each map process gets the list of file names that they need to process? How can I retrieve these information in my map executable? The executable is plain groovy script designed to process stdin. In such cases, how my stdin will look like (how to determine key/value pairs, and what will be their contents) Or, since I have merged sequence files they become one big file and lost their individual identities which means that I cannot have their filenames and I need to play with bunch of sequence files' key/values?
I think, this big seq file will have key / value where key is filename and value is the contents of that file which in turn contains 8 keys and corresponding values? If this is the case, when hadoop splits this big file depending on the number of maps possible (lets say 10 map possible in my cluster), each map would get around 5000 keys and corresponding values? Then, in my map exec, how can I access these keys and values?
Any hint will greatly help
hadoop - 附加到现有的序列文件
在我的用例中,我需要一种将键/值对附加到现有序列文件的方法。怎么做?任何线索都会有很大帮助。我正在使用 hadoop 2x。
另外,我遇到了以下文档。谁能告诉我如何使用它来追加?
public static org.apache.hadoop.io.SequenceFile.Writer createWriter(FileContext fc, Configuration conf, Path name, Class keyClass, Class valClass, org.apache.hadoop.io.SequenceFile.CompressionType compressionType, CompressionCodec codec, org.apache. hadoop.io.SequenceFile.Metadata 元数据,EnumSet createFlag,org.apache.hadoop.fs.Options.CreateOpts...opts) 抛出 IOException
java - 如何将hadoop序列文件值更改为杰克逊解析器?
我有一个问题,我真的不知道该怎么办。我有一个包含网页链接的 Hadoop 序列文件。Hadoop 序列文件的每个条目,键是一个网页的 URL,值是它的属性和链接。该值实际上是 Json 格式。我想读取所有序列文件并将值传递给杰克逊解析器以获取链接,但它总是失败。这是我的代码:
文件“metadata-00000”是原始的 Hadoop 序列文件。如您所见,该值实际上是 json 格式,我想在 Jackson 解析器中对其进行分析。但是,这条线总是失败:
例外是:
那么我应该如何处理呢?如何将 Writable 值传输到 json 解析器?谢谢!
java - I'm in trouble in K-Means using Mapreduce (modified)
I think my code is not wrong but, it doesn't work correctly. This is K-means clustering using mapreduce. (https://github.com/30stm/K-Means-using-mapreduce/tree/master)
Make a dataset using DatasetWriter.java, and make centroids using CreateCentroids.java. Then, excute KMeansClusteringJob.java
This code works at the first iteration, but It doesn't work from second iteration. I checked map function and reduce function, I think the problem is reduce function. (Map function finds closest centroid from each point. Reduce function calculate new centroid and replace the new one.) After first iteration, cen.seq (centroid file) is imperfect.
Somebody help me ;)
p.s : I wrote a question about reduce code, my original problem is this one.
hadoop - 使用 Pig 写入 SequenceFile 失败
我想将一些 Pig 变量存储到 Hadoop SequenceFile,以便运行外部 MapReduce 作业。
假设我的数据具有 (chararray, int) 架构:
我写了这个存储函数:
这个猪代码:
但是,存储失败,我收到此错误:
有什么办法可以解决吗??
hadoop - 是否可以在没有(错误)使用异常的情况下检查 HDFS 上的文件是否是 SequenceFile?
我想SequenceFile
从客户端应用程序中读取来自 HDFS 的特定内容。我可以通过使用来做到这一点SequenceFile.Reader
,它工作正常。但是是否也可以通过分析抛出的s来检查文件是否是其他文件?SequenceFile
IOException
hadoop - 在 Spark Java 中将文本文件转换为序列格式
在 Spark Java 中,如何将文本文件转换为序列文件?以下是我的代码:
我得到了下面的错误。
有人有什么主意吗?谢谢!
c# - hadoop 中的序列文件格式
是否有任何选项可以使用 c# 代码将 Hadoop 分布式文件系统文件编写为序列文件。如果是这样,你可以建议我一个链接或其他详细信息