1

我已经在 CentOS 6.5//64 位的 HA/自动故障转移配置中配置了 Apache Hadoop 2 集群。我已经安装了 Flume 1.5 (apache-flume-1.5.0-bin.tar.gz)。我想使用 Flume/Hive 和一些关键词过滤来分析 twitter 数据。见下图: 这里是 hadoop2 配置文件内容。(仅重要属性)。

核心站点.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>

hdfs-site.xml

<property><name>dfs.ha.namenodes.mycluster</name><value>nn1,nn2</value><final>true</final></property>
<property><name>dfs.namenode.rpc-address.mycluster.nn1</name><value>nn1.mycluster1.com:9000</value></property>
<property><name>dfs.namenode.rpc-address.mycluster.nn2</name><value>nn2.mycluster1.com:9000</value></property>
<property><name>dfs.namenode.http-address.mycluster.nn1</name><value>nn1.mycluster1.com:50070</value></property>
<property><name>dfs.namenode.http-address.mycluster.nn2</name><value>nn2.mycluster1.com:50070</value></property>

以下是flume配置文件内容:

水槽-env.sh

JAVA_HOME=/usr/java/jdk1.7.0_60
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"

推特.conf

# Name the components on this agent
TwitterAgent.sources = Twitter
TwitterAgent.sinks = HDFS
TwitterAgent.channels = MemChannel

# Describe/configure the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = **************
TwitterAgent.sources.Twitter.consumerSecret = **********
TwitterAgent.sources.Twitter.accessToken = **************
TwitterAgent.sources.Twitter.accessTokenSecret = **************

TwitterAgent.sources.Twitter.maxBatchSize = 1000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 1000

TwitterAgent.sources.Twitter.keywords=hadoop, big data, analytics, bigdata, cloudera, data science, mapreduce, mahout, nosql

TwitterAgent.sources.Twitter.bind = localhost
TwitterAgent.sources.Twitter.port = 44444

# Describe the sink
TwitterAgent.sinks.HDFS.type = logger
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.hdfs.path=/user/flume/tweets/20140814/1_55
TwitterAgent.sinks.HDFS.fileType = DataStream
TwitterAgent.sinks.HDFS.writeFormat = Text
TwitterAgent.sinks.HDFS.batchSize = 100
TwitterAgent.sinks.HDFS.rollSize = 0
TwitterAgent.sinks.HDFS.rollCount = 100
TwitterAgent.sinks.HDFS.rollInterval = 100

# Use a channel which buffers events in memory
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

我正在执行以下命令。

flume-ng agent --conf conf --conf-file conf/twitter.conf --name TwitterAgent -Dflume.root.logger=INFO,console

我有以下问题/问题。

  • a)-它接缝关键字过滤不起作用。我在配置文件中设置了错误的属性吗?
  • b)-此过程不会在 hdfs 上复制 /user/flume/tweets/20140814/1_55 上的任何文件。
  • c)-Twitter/API 访问令牌的访问级别是只读的。我需要读写权限吗?
  • d)-使用 hdfs.path 样式是否正确,就像我使用 twitter.conf 一样?
  • e)-进程正在执行而不是停止,不确定它将根据什么标准停止。

它继续显示以下输出。

14/08/14 03:58:14 INFO twitter.TwitterSource: Processed 45,000 docs
14/08/14 03:58:14 INFO twitter.TwitterSource: Total docs indexed: 45,000, total skipped docs: 0
14/08/14 03:58:14 INFO twitter.TwitterSource:     53 docs/second
14/08/14 03:58:14 INFO twitter.TwitterSource: Run took 846 seconds and processed:
14/08/14 03:58:14 INFO twitter.TwitterSource:     0.013 MB/sec sent to index
14/08/14 03:58:14 INFO twitter.TwitterSource:     11.111 MB text sent to index
14/08/14 03:58:14 INFO twitter.TwitterSource: There were 0 exceptions ignored:
14/08/14 03:58:14 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:15 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:16 INFO twitter.TwitterSource: Processed 45,100 docs
14/08/14 03:58:16 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:17 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:18 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:18 INFO twitter.TwitterSource: Processed 45,200 docs
14/08/14 03:58:19 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:20 INFO twitter.TwitterSource: Processed 45,300 docs
14/08/14 03:58:20 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:21 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:22 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:22 INFO twitter.TwitterSource: Processed 45,400 docs
14/08/14 03:58:23 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:24 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:24 INFO twitter.TwitterSource: Processed 45,500 docs
14/08/14 03:58:25 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:26 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:26 INFO twitter.TwitterSource: Processed 45,600 docs
14/08/14 03:58:27 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:28 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:28 INFO twitter.TwitterSource: Processed 45,700 docs
14/08/14 03:58:29 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:30 INFO twitter.TwitterSource: Processed 45,800 docs
14/08/14 03:58:30 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:31 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:32 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:32 INFO twitter.TwitterSource: Processed 45,900 docs
14/08/14 03:58:33 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:34 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:34 INFO twitter.TwitterSource: Processed 46,000 docs
14/08/14 03:58:34 INFO twitter.TwitterSource: Total docs indexed: 46,000, total skipped docs: 0
14/08/14 03:58:34 INFO twitter.TwitterSource:     53 docs/second
14/08/14 03:58:34 INFO twitter.TwitterSource: Run took 867 seconds and processed:
14/08/14 03:58:34 INFO twitter.TwitterSource:     0.013 MB/sec sent to index
14/08/14 03:58:34 INFO twitter.TwitterSource:     11.36 MB text sent to index
14/08/14 03:58:34 INFO twitter.TwitterSource: There were 0 exceptions ignored:

任何人都可以帮助我,我错过了什么?

在用于此任务之前,我是否应该使用 Maven 重新构建 Flume?

4

2 回答 2

1

无需授予 Twitter/API 访问令牌的读写权限?您使用 hdfs.path 样式的方式也是正确的。

要解决主要问题(不复制文件),请执行以下更改:

conf/twitter.conf 文件的变化

  • 一个)-

替换以下行:( TwitterAgent.sinks.HDFS.type = logger )

带有以下行: TwitterAgent.sinks.HDFS.type = hdfs

  • b)-

注释以下行:

#TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

使用以下(Apache 类)

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

flume-env.conf 中的变化

评论如下:(无需设置此值)

#FLUME_CLASSPATH=""

为以下属性设置适当的值:

hdfs.filePrefix         
hdfs.fileSuffix         
hdfs.inUsePrefix        
hdfs.inUseSuffix        
hdfs.rollInterval       
hdfs.rollSize           
hdfs.rollCount          
hdfs.idleTimeout        
hdfs.batchSize          
hdfs.fileType   
hdfs.maxOpenFiles   
hdfs.minBlockReplicas   
hdfs.writeFormat    
hdfs.callTimeout    
hdfs.threadsPoolSize    
hdfs.rollTimerPoolSize  
hdfs.kerberosPrincipal  
hdfs.kerberosKeytab 
hdfs.proxyUser  
hdfs.round  
hdfs.roundValue 
hdfs.roundUnit  
hdfs.timeZone   
hdfs.useLocalTimeStamp  
hdfs.closeTries 
hdfs.retryInterval  

要查看更多详细信息,请参阅以下链接:

https://flume.apache.org/FlumeUserGuide.html

于 2014-08-15T21:30:43.347 回答
1

在这里,您只能看到正在处理的事件,但您看不到这些特定事件带来的实际 json 文件或 json 字符串。也许是因为接收器是“记录器”,根据您的模式记录所有内容。解决:

在您的 log4j.properties 上,相应地替换配置:

flume.root.logger=ALL,LOGFILE

干杯!

于 2014-10-01T16:11:51.353 回答