我目前正在使用 Hadoop 0.20.2 和旧 API。我想做一个地图侧连接。我有一个图形数据集,它由两个文件组成,一个带有边,另一个带有节点。边的格式为“目标源标签”,节点为“nodeID '-' label 0”,并按第一个值排序。
小型玩具示例(6 个节点和 8 个边)一切正常,但是当我扩大规模时,它只会加入输入文件的第一部分。我找不到原因,因为一切都应该是正确的格式和排序。我错过了什么吗?
对于“调试”,我只是打印了带有在地图上缩进的 TupleWritable 中的值的键。可以在这里看到,http ://pastebin.com/rcNx2r5c 。
节点文件为 9416 字节,边缘文件为 15797 字节,并在结果中完全输出。但是在键 98 之后连接停止。然后它首先输出边,然后输出节点,但是节点和边都有键 99。
我对 CompositeInputFormat 的工作设置:
conf.setInputFormat(CompositeInputFormat.class);
Path[] input = new Path[] { inputNodes, inputEdges};
conf.set("key.value.separator.in.input.line", "\t");
conf.set("mapred.join.expr", CompositeInputFormat.compose(
"outer", KeyValueTextInputFormat.class,
input )
);
任何帮助将不胜感激,所以提前致谢!
编辑:我已经解决了这个问题。对于那些感兴趣的人;问题是 KeyValueTextInputFormat。它具有作为 Text 的键和值,我应该有 LongWritable 键和 Text 值。虽然我认为这不是问题,但它似乎失败了。所以我根据KeyValueTextInputFormat制作了自己的输入格式。
public class KeyValueLongInputFormat extends FileInputFormat<LongWritable, Text> implements JobConfigurable {
private CompressionCodecFactory compressionCodecs = null;
@Override
public void configure(JobConf conf) {
compressionCodecs = new CompressionCodecFactory(conf);
}
protected boolean isSplitable(FileSystem fs, Path file) {
return compressionCodecs.getCodec(file) == null;
}
@Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
return new KeyValueLongLineRecordReader(job, (FileSplit) genericSplit);
}
}
和
public class KeyValueLongLineRecordReader implements RecordReader<LongWritable, Text> {
private final LineRecordReader lineRecordReader;
private byte separator = (byte) '\t';
private LongWritable dummyKey;
private Text innerValue;
public Class getKeyClass() {
return LongWritable.class;
}
public LongWritable createKey() {
return new LongWritable();
}
public Text createValue() {
return new Text();
}
public KeyValueLongLineRecordReader(Configuration job, FileSplit split) throws IOException {
lineRecordReader = new LineRecordReader(job, split);
dummyKey = lineRecordReader.createKey();
innerValue = lineRecordReader.createValue();
String sepStr = job.get("key.value.separator.in.input.line", "\t");
this.separator = (byte) sepStr.charAt(0);
}
public static int findSeparator(byte[] utf, int start, int length, byte sep) {
for (int i = start; i < (start + length); i++) {
if (utf[i] == sep) {
return i;
}
}
return -1;
}
/** Read key/value pair in a line. */
public synchronized boolean next(LongWritable key, Text value) throws IOException {
LongWritable tKey = key;
Text tValue = value;
byte[] line = null;
int lineLen = -1;
if (lineRecordReader.next(dummyKey, innerValue)) {
line = innerValue.getBytes();
lineLen = innerValue.getLength();
} else {
return false;
}
if (line == null)
return false;
int pos = findSeparator(line, 0, lineLen, this.separator);
if (pos == -1) {
tKey.set(Long.valueOf(new String(line, 0, lineLen)));
tValue.set("");
} else {
int keyLen = pos;
byte[] keyBytes = new byte[keyLen];
System.arraycopy(line, 0, keyBytes, 0, keyLen);
int valLen = lineLen - keyLen - 1;
byte[] valBytes = new byte[valLen];
System.arraycopy(line, pos + 1, valBytes, 0, valLen);
tKey.set(Long.valueOf(new String(keyBytes)));
tValue.set(valBytes);
}
return true;
}
public float getProgress() {
return lineRecordReader.getProgress();
}
public synchronized long getPos() throws IOException {
return lineRecordReader.getPos();
}
public synchronized void close() throws IOException {
lineRecordReader.close();
}
}
解决了我的问题。