hadoop - 如何使用 Flume 在源上执行预处理并将真实文件名保留在 hdfs 接收器中

Question

我是使用 Apache Flume 的新手，我很难理解它是如何工作的。为了解释我的问题，所以我解释了我的需要和我做了什么。

我想在 csv 文件目录（这些文件每 5 分钟构建一次）和 HDFS 集群之间配置一个流。

我确定“假脱机目录”源和 HDFS 接收器是我需要的。那就是给我这个flume.conf文件

agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = hdfsSink

# For each one of the sources, the type is defined
agent.sources.seqGenSrc.type = spooldir
agent.sources.seqGenSrc.spoolDir = /home/user/data

# The channel can be defined as follows.
agent.sources.seqGenSrc.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = hdfs://localhost/Flume/data
agent.sinks.hdfsSink.hdfs.fileType = DataStream

agent.sinks.hdfsSink.hdfs.writeFormat=Text    

#Specify the channel the sink should use
agent.sinks.hdfsSink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100

结果是输入文件在我的本地文件系统上被重命名为“.complete”，并且数据被上传到 HDFS 上，新名称我猜是唯一的，由 Flume 生成。

这几乎是我所需要的。

但在上传之前，我想做一些文件特定的操作（删除标题，转义逗号..）。我不知道该怎么做，我考虑使用拦截器。但是，当数据在水槽中时，它会在事件中转换并流式传输。在他的点上，没有文件的知识。

否则，文件名中会写入原始时间事件，所以我希望这个时间与我的事件相关联，而不是与当前日期相关联。

我还想将原始文件名保留在 hdfs 中（其中有一些有用的信息）。

有人有什么建议可以帮助我吗？

score 1 · Accepted Answer

如果您指定，原始文件名可以保留为头文件

agent.sources.seqGenSrc.fileHeader=true

然后可以在您的接收器中检索它。

如果要操作文件中的数据，请使用拦截器。您应该知道，事件基本上是假脱机目录中文件中的一行。

最后但同样重要的是，您需要使用 fileHeader 属性将事件通过管道传回正确的文件。这可以通过指定接收器中的路径来实现，如下所示：

agent.sinks.hdfsSink.hdfs.path = hdfs://localhost/Flume/data/%{file}

您可以使用 Prefix 和 Suffix 进一步配置文件名：

hdfs.filePrefix FlumeData   Name prefixed to files created by Flume in hdfs directory
hdfs.fileSuffix –   Suffix to append to file (eg .avro - NOTE: period is not automatically added)

hadoop - 如何使用 Flume 在源上执行预处理并将真实文件名保留在 hdfs 接收器中

1 回答 1

Related

Reference