0

我正在使用以下文件从文件中读取字节:

FileSystem fs = config.getHDFS();
            try {

                Path path = new Path(dirName + '/' + fileName);

                byte[] bytes = new byte[(int)fs.getFileStatus(path)
                        .getLen()];
                in = fs.open(path);

                in.read(bytes);
                result = new DataInputStream(new ByteArrayInputStream(bytes));
            } catch (Exception e) {
                e.printStackTrace();
                if (in != null) {
                    try {
                        in.close();
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }
                }
            }

我正在读取的目录中有大约 15,000 个文件。在某一点之后,我在 in.read(bytes) 行上得到了这个异常:

2012-05-31 14:11:45,477 [INFO:main] (DFSInputStream.java:414) - Failed to connect to /165.36.80.28:50010, add to deadNodes and continue
java.io.EOFException
        at java.io.DataInputStream.readShort(DataInputStream.java:298)
        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Status.read(DataTransferProtocol.java:115)
        at org.apache.hadoop.hdfs.BlockReader.newBlockReader(BlockReader.java:427)
        at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:725)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:390)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514)
        at java.io.DataInputStream.read(DataInputStream.java:83)

抛出的另一个异常是:

2012-05-31 15:09:14,849 [INFO:main] (DFSInputStream.java:414) - Failed to connect to /165.36.80.28:50010, add to deadNodes and continue
java.net.SocketException: No buffer space available (maximum connections reached?): connect
    at sun.nio.ch.Net.connect(Native Method)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373)
    at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:719)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:390)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:514)
    at java.io.DataInputStream.read(DataInputStream.java:83)

请告知可能是什么问题。

4

1 回答 1

3

您忽略了 from 的返回值in.read,并假设您可以一口气读取整个文件。不要那样做。循环直到read返回 -1 或者您已经读取了尽可能多的数据。我不清楚你是否真的应该getLen()像这样信任 - 如果文件在两个调用之间增长(或缩小)会发生什么?

我建议创建一个ByteArrayOutputStream要写入的和一个小的(16K?)缓冲区作为临时存储,然后循环循环 - 读入缓冲区,将那么多字节写入输出流,起泡,冲洗,重复直到read返回 -1 表示流的末端。然后,您可以从您的数据中取出数据ByteArrayOutputStream并像以前一样将其放入ByteArrayInputStream

编辑:快速代码,未经测试 - 在Guava中有类似(更好)的代码,顺便说一句。

public static byte[] readFully(InputStream stream) throws IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    byte[] buffer = new byte[16 * 1024];
    int bytesRead;
    while ((bytesRead = stream.read(buffer)) > 0) {
        baos.write(buffer, 0, bytesRead);
    }
    return baos.toByteArray();
}

然后只需使用:

in = fs.open(path);
byte[] data = readFully(in);
result = new DataInputStream(new ByteArrayInputStream(data));

另请注意,您应该在一个finally块中关闭您的流,而不仅仅是在异常情况下。我也建议不要Exception自己抓住。

于 2012-05-31T19:11:43.303 回答