0

我遇到了直接运行器的tensorflow-data-validation问题,以从超过 400GB 的一些大型数据集生成统计信息。似乎所有工作人员在“Keepalive 看门狗被解雇”的错误消息后都停止了工作。关闭交通。” 这似乎是一个grpc keepalive 超时。

E0804 17:49:07.419950276   44806 chttp2_transport.cc:2881]   ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport.
2020-08-04 17:49:07  local_job_service.py : INFO  Worker: severity: ERROR timestamp {   seconds: 1596563347   nanos: 420487403 } message: "Python sdk harness failed: \nTraceback (most recent call last):\n  File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 158, in main\n    sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 213, in run\n    for work_request in self._control_stub.Control(get_responses()):\n  File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 416, in __next__\n    return self._next()\n  File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog timeout\"\n\tdebug_error_string = \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received from peer ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent call last):\n  File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 158, in main\n    sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 213, in run\n    for work_request in self._control_stub.Control(get_responses()):\n  File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 416, in __next__\n    return self._next()\n  File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog timeout\"\n\tdebug_error_string = \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received from peer ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location: "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161" thread: "MainThread"
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globalse
  File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 248, in <module>
    main(sys.argv)
  File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 158, in main
    sdk_pipeline_options.view_as(ProfilingOptions))).run()
  File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 213, in run
    for work_request in self._control_stub.Control(get_responses()):
  File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "keepalive watchdog timeout"
        debug_error_string = "{"created":"@1596563347.420024732","description":"Error received from peer ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive watchdog timeout","grpc_status":14}"
4

0 回答 0