检查以下链接。
http://zdatainc.com/2014/07/real-time-streaming-apache-storm-apache-kafka/
实现 Kafka 生产者 在这里,定义了用于测试我们集群的 Kafka 生产者代码的主要部分。在主类中,我们设置数据管道和线程:
LOGGER.debug("Setting up streams");
PipedInputStream send = new PipedInputStream(BUFFER_LEN);
PipedOutputStream input = new PipedOutputStream(send);
LOGGER.debug("Setting up connections");
LOGGER.debug("Setting up file reader");
BufferedFileReader reader = new BufferedFileReader(filename, input);
LOGGER.debug("Setting up kafka producer");
KafkaProducer kafkaProducer = new KafkaProducer(topic, send);
LOGGER.debug("Spinning up threads");
Thread source = new Thread(reader);
Thread kafka = new Thread(kafkaProducer);
source.start();
kafka.start();
LOGGER.debug("Joining");
kafka.join();
The BufferedFileReader in its own thread reads off the data from disk:
rd = new BufferedReader(new FileReader(this.fileToRead));
wd = new BufferedWriter(new OutputStreamWriter(this.outputStream, ENC));
int b = -1;
while ((b = rd.read()) != -1)
{
wd.write(b);
}
Finally, the KafkaProducer sends asynchronous messages to the Kafka Cluster:
rd = new BufferedReader(new InputStreamReader(this.inputStream, ENC));
String line = null;
producer = new Producer<Integer, String>(conf);
while ((line = rd.readLine()) != null)
{
producer.send(new KeyedMessage<Integer, String>(this.topic, line));
}
Doing these operations on separate threads gives us the benefit of having disk reads not block the Kafka producer or vice-versa, enabling maximum performance tunable by the size of the buffer.
Implementing the Storm Topology
Topology Definition
Moving onward to Storm, here we define the topology and how each bolt will be talking to each other:
TopologyBuilder topology = new TopologyBuilder();
topology.setSpout("kafka_spout", new KafkaSpout(kafkaConf), 4);
topology.setBolt("twitter_filter", new TwitterFilterBolt(), 4)
.shuffleGrouping("kafka_spout");
topology.setBolt("text_filter", new TextFilterBolt(), 4)
.shuffleGrouping("twitter_filter");
topology.setBolt("stemming", new StemmingBolt(), 4)
.shuffleGrouping("text_filter");
topology.setBolt("positive", new PositiveSentimentBolt(), 4)
.shuffleGrouping("stemming");
topology.setBolt("negative", new NegativeSentimentBolt(), 4)
.shuffleGrouping("stemming");
topology.setBolt("join", new JoinSentimentsBolt(), 4)
.fieldsGrouping("positive", new Fields("tweet_id"))
.fieldsGrouping("negative", new Fields("tweet_id"));
topology.setBolt("score", new SentimentScoringBolt(), 4)
.shuffleGrouping("join");
topology.setBolt("hdfs", new HDFSBolt(), 4)
.shuffleGrouping("score");
topology.setBolt("nodejs", new NodeNotifierBolt(), 4)
.shuffleGrouping("score");
值得注意的是,数据会被打乱到每个螺栓,直到连接时除外,因为将相同的推文提供给连接螺栓的同一实例非常重要。