我正在使用融合平台 3.2.0,kafka 0.10.2。
我一直在运行融合控制中心,它运行了大约一周,但它现在处于某种不稳定状态。该问题似乎源于控制中心内部使用的 kafka 流逻辑。
我遇到了许多错误和异常。我不确定什么是相关的,什么是杂乱无章的,但我会粘贴我看到的内容。
当我退回经纪人时,我会看到许多这样的消息:
[2017-04-04 17:20:04,387] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/var/lib/kafka/_confluent-controlcenter-3-1-0-1-error-topic-3/00000000000000000000.index) has non-zero size but the last offset is 0 which is no larger than the base offset 0.}. deleting /var/lib/kafka/_confluent-controlcenter-3-1-0-1-error-topic-3/00000000000000000000.timeindex, /var/lib/kafka/_confluent-controlcenter-3-1-0-1-error-topic-3/00000000000000000000.index and rebuilding index... (kafka.log.Log)
[2017-04-04 17:20:04,387] INFO Recovering unflushed segment 0 in log _confluent-controlcenter-3-1-0-1-error-topic-3. (kafka.log.Log)
[2017-04-04 17:20:04,388] INFO Completed load of log _confluent-controlcenter-3-1-0-1-error-topic-3 with 1 log segments and log end offset 0 in 2 ms (kafka.log.Log)
当我反弹控制中心时,它将开始启动,但随后会不断重新平衡:
[2017-04-04 17:22:20,607] INFO tocheck=[Store{name=KSTREAM-OUTEROTHER-0000000107-store, rollup=false}, Store{name=KSTREAM-OUTERTHIS-0000000106-store, rollup=false}, Store{name=Group, rollup=true}, Store{name=MonitoringStream, rollup=true}, Store{name=TriggerActionsStore, rollup=false}, Store{name=MonitoringMessageAggregatorWindows, rollup=true}, Store{name=MonitoringTriggerStore, rollup=false}, Store{name=aggregatedTopicPartitionTableWindows, rollup=true}, Store{name=MonitoringVerifierStore, rollup=false}, Store{name=TriggerEventsStore, rollup=false}, Store{name=AlertHistoryStore, rollup=false}, Store{name=MetricsAggregateStore, rollup=false}] (io.confluent.controlcenter.streams.KafkaStreamsManager:115)
[2017-04-04 17:22:20,607] INFO streams in state=REBALANCING (io.confluent.controlcenter.streams.KafkaStreamsManager:137)
它这样做了很长时间,然后开始吐出这些异常,我认为这可能是根本原因:
[2017-04-04 17:29:02,732] WARN Could not create task 2_1. Will retry. (org.apache.kafka.streams.processor.internals.StreamThread:1184)
org.apache.kafka.streams.errors.LockException: task [2_1] Failed to lock the state directory: /var/lib/confluent/control-center/1/kafka-streams/_confluent-controlcenter-3-2-0-1/2_1
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:102)
at org.apache.kafka.streams.processor.internals.AbstractTask.<init>(AbstractTask.java:73)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:108)
at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:834)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:1207)
at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:1180)
at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:937)
at org.apache.kafka.streams.processor.internals.StreamThread.access$500(StreamThread.java:69)
at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:255)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:339)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:303)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1030)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:582)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:368)
然后它回到再平衡循环。很长一段时间后它似乎放弃了这样做,并且控制中心从未完成启动。
我不知道下一步该尝试什么。我尝试将控制中心数据目录配置指向不同的目录,但仍然发生相同的锁定异常。我确定只有一个控制中心实例在运行。zookeeper 日志中似乎没有什么特别突出的。
如果有人对此有任何调试技巧,将不胜感激。