hadoop - 用于 Flume 接收器文件的 Hadoop Streaming MapReduce - FileNotFoundException

Question

我收到以下异常：

java.io.FileNotFoundException: File does not exist: /log1/20131025/2013102509_at1.1382659200021.tmp
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975)
...

在 MR 作业运行时。Flume 将文件名从xxx.tmpto更改xxx. 为 MR 任务找不到文件（MR 正在尝试读取xxx.tmp）并抛出错误。

我不知道如何避免 FileNotFoundException。

我正在通过 Hadoop 流 ( $hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar...)运行 MR 作业

有排除 xxx.tmp 文件的选项吗？

score 1 · Accepted Answer

我有同样的经历，我通过在我的水槽配置文件中添加 hdfs 接收器配置来解决它 hdfs.inUsePrefix = . hdfs.inUseSuffix = .temp

我使用了 “。”的“hdfs.inUsePrefix”值。为了在仍然流式传输的同时从我的 Hive 查询中隐藏文件。

问题

我注意到在外部表上的配置单元上运行“选择查询”时，这些表具有文件流式传输到其中的位置，我遇到了这样的错误

java.io.FileNotFoundException: File does not exist: hdfs://hmaster:9000/data/etl/sdp/statistics/ppasinterface/some/path/to/a/partition/some_files.tmp

检查水槽日志文件显示将 some_file.tmp 重命名为 some.file 是失败的原因。

您可以参考“Hari Shreedharan 的使用 Flume”一书 [第 177/178 页，如果使用 epub] 也可以查看http://flume.apache.org/FlumeUserGuide.html#hdfs-sink

score 0 · Accepted Answer

我解决这个问题的方法是通过分区，假设我想对来自flume的数据运行hadoop查询，flume将使用时间戳发布数据（您可以使用时间戳拦截器配置接收器以使用目标目录中的日期） .

之后，你确保你没有读取最新的分区（最近的一天），我个人在某个完全不同的地方有一个主存储，我通过读取上一个时期的分区定期汇总 hdfs 下沉的数据。

例如，水槽将事件放在文件夹 2013-10-27-01 上，因为这是凌晨 1 点的数据，我想每小时处理一次。凌晨 2 点，我运行一个 hadoop 将这些数据移动到主存储，但只是这个，我没有从 2013-10-27-02 读取，这是水槽现在正在写入的文件夹（凌晨 2 点）

flume.conf 的例子

...
agent.sources.avroSource.interceptors = timestamp
agent.sources.avroSource.interceptors.timestamp.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
...
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = /path/to/target/%y-%m-%d/
...

hadoop - 用于 Flume 接收器文件的 Hadoop Streaming MapReduce - FileNotFoundException

2 回答 2

问题

Related

Reference