hdfs - Cloudera 5.4.2：使用 Flume 和 Twitter 流时 Avro 块大小无效或太大

Question

当我尝试 Cloudera 5.4.2 时有一个小问题。根据这篇文章

Apache Flume - 获取 Twitter 数据 http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm

它尝试使用 Flume 和 twitter 流获取推文以进行数据分析。一切顺利，创建 Twitter 应用程序，在 HDFS 上创建目录，配置 Flume 然后开始获取数据，在推文之上创建模式。

那么，问题来了。Twitter 流将推文转换为 Avro 格式并将 Avro 事件发送到下游 HDFS 接收器，当 Avro 支持的 Hive 表加载数据时，我收到错误消息“Avro 块大小无效或太大”。

哦，什么是avro块和块大小的限制？我可以改变它吗？根据这个消息是什么意思？是文件的错吗？是某些唱片的错吗？如果 Twitter 的流媒体遇到错误数据，它应该核心化。如果可以将推文转换为 Avro 格式，反过来，Avro 数据应该可以正确读取，对吧？

我也尝试了 avro-tools-1.7.7.jar

java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232

{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}

{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`･ω･´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)

at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more

同样的问题。我google了很多，根本没有答案。

如果你也遇到这个问题，谁能给我一个解决方案？或者，如果您完全了解 Avro 的东西或下面的 Twitter 流，有人可以提供线索。

这真是一个有趣的问题。想想看。

score 0 · Accepted Answer

使用 Cloudera TwitterSource

否则会遇到这个问题。

无法将 twitter avro 数据正确加载到 hive 表中

在文章中：这是 apache TwitterSource

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.

不过应该是cloudera TwitterSource：

https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

而且不要随便下载pre build jar，因为我们的cloudera版本是5.4.2，否则会报这个错误：

由于 JAR 冲突而无法运行 Flume

您应该使用 maven 编译它

https://github.com/cloudera/cdh-twitter-example

下载并编译：flume-sources.1.0-SNAPSHOT.jar。这个 jar 包含 Cloudera TwitterSource 的实现。

脚步：

wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip

sudo yum install apache-maven 放到flume plugins目录下：

/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar

mvn包

注意：yum 更新到最新版本，否则由于某些安全问题编译（mvn 包）失败。

hdfs - Cloudera 5.4.2：使用 Flume 和 Twitter 流时 Avro 块大小无效或太大

1 回答 1

Related

Reference