0

我正在尝试独立运行每个 cadence 服务,以便我可以轻松地扩展和扩展它们。我的团队正在使用 docker-swarm,我们使用 Portainer UI 管理所有内容。到目前为止,我已经能够将前端服务扩展为拥有两个副本,但是如果我对匹配服务做同样的事情,我将通过DecisionTaskTimedOut工作流执行获得很多收益。最终,执行将成功完成,但需要很长时间。有一个想法,两个匹配的服务副本需要 2 分钟,而只有一个只需要 7 秒。

这是一个测试环境。我正在使用 dockerized cassand db(由于某些预算限制,我们不能使用真正的 cassand db)也许这就是问题所在?Docker 映像配置有以下环境变量:

RINGPOP_BOOTSTRAP_MODE=dns
KEYSPACE=cadence
BIND_ON_IP=0.0.0.0
SKIP_SCHEMA_SETUP=false
VISIBILITY_KEYSPACE=cadence_visibility
CASSANDRA_HOSTNAME=soap_cassandra
RINGPOP_SEEDS=soap_cadence_frontend:7933,soap_cadence_history:7934,soap_cadence_worker:7939
CADENCE_HOME=/etc/cadence
SERVICES=matching

您可以假设您在上面没有看到的任何其他环境变量的默认值

RINGPOP_SEEDS 是分配给每个 cadence 服务的服务名称,如果声明的副本超过 1 个,docker-swarm 将从它们中创建一个 DNS 条目以及负载均衡器。

匹配服务似乎正确启动,日志:

{"level":"info","ts":"2021-02-18T22:47:36.296Z","msg":"Created RPC dispatcher and listening","service":"cadence-matching","address":"0.0.0.0:7935","logging-call-at":"rpc.go:81"},
{"level":"warn","ts":"2021-02-18T22:47:36.321Z","msg":"Failed to fetch key from dynamic config","key":"system.advancedVisibilityWritingMode","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.336Z","msg":"Add new peers by DNS lookup","address":"0.0.0.0","addresses":"[0.0.0.0:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.321Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"0.0.0.0:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Starting service matching","logging-call-at":"server.go:217"},
{"level":"warn","ts":"2021-02-18T22:47:36.441Z","msg":"Failed to fetch key from dynamic config","key":"matching.throttledLogRPS","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"127.0.0.1:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.442Z","msg":"Add new peers by DNS lookup","address":"127.0.0.1","addresses":"[127.0.0.1:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.713Z","msg":"matching starting","service":"cadence-matching","logging-call-at":"service.go:90"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"RuntimeMetricsReporter started","service":"cadence-matching","logging-call-at":"runtime.go:169"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"PProf not started due to port not set","logging-call-at":"pprof.go:64"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-matching","addresses":"[[::]:7935]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-worker","addresses":"[[::]:7939]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.800Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-frontend","addresses":"[[::]:7933]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"service started","service":"cadence-matching","logging-call-at":"resourceImpl.go:383"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"matching started","service":"cadence-matching","logging-call-at":"service.go:99"}

工作流执行时,我可以在日志中看到以下错误:

{"level":"error","ts":"2021-02-18T22:17:07.281Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:52:03.740Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 16, db rangeID: 17","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1500002,"next-number":1500002,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:10:10.971Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"store-operation":"create-task","error":"Failed to create task. TaskList: FeaTaskList, taskListType: 1, rangeID: 94, db rangeID: 95","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"number":9300001,"next-number":9300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:09:53.345Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:53:56.145Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 17, db rangeID: 18","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1600001,"next-number":1600001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}

我目前使用的docker镜像版本是:ubercadence/server:0.15.1

有没有办法解决这个问题?

4

1 回答 1

0

我最好的猜测是问题BIND_ON_IP=0.0.0.0。每个实例都应使用唯一的 hostIP:Port 作为其地址。因为它是 all 0.0.0.0,所以每个服务只有在使用一个实例运行时才能工作。因为多实例会有冲突。

但是,前端服务不是问题,因为 FE 是有状态的。匹配/历史会遇到这个问题---

HostA 使用 0.0.0.0:7935 将其注册到数学服务,然后 HostB 尝试执行相同操作。这将导致一致性哈希环不稳定。任务列表所有权不断在 HostA 和 HostB 之间切换。

要解决此问题,您需要让每个实例使用自己的 hostIP。就像在 K8s 中使用 PodIP。

解决此问题后,您将在 FE/history 中的日志中看到它们成功连接到两个匹配主机:

{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-matching","addresses":"[HostA_IP:7935, HostB_IP:7935]","logging-call-at":"rpServiceResolver.go:246"},

请参阅 Cadence Helm 图表中的示例,了解我们如何为 K8s 执行此操作:https ://github.com/banzaicloud/banzai-charts/blob/87cf2946434c22cb963fea47b662ea85974ecfc0/cadence/templates/server-configmap.yaml#L82

于 2021-02-19T21:33:27.837 回答