1

我在一个节点上运行 dask-scheduler,而我的 dask-worker 在另一个节点上运行。我从第三个节点向 dask-scheduler 提交任务。

它有时会抛出distributed.utils

错误 - 现有数据导出:无法调整对象大小

我正在使用 python 2.7、tornado 4.5.2、tensorflow 1.3.0

INFO:tensorflow:Restoring parameters from /home/mapr/mano/slim_data/flowers/model/inception/inception_v3.ckpt
distributed.utils - ERROR - Existing exports of data: object cannot be re-sized
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
    result[0] = yield make_coro()
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/distributed/variable.py", line 179, in _get
    client=self.client.id)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/distributed/core.py", line 464, in send_recv_from_rpc
    result = yield send_recv(comm=comm, op=key, **kwargs)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/distributed/core.py", line 348, in send_recv
    yield comm.write(msg)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/usr/lib/python2.7/site-packages/distributed/comm/tcp.py", line 218, in write
    future = stream.write(frame)
  File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 406, in write
    self._handle_write()
  File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 872, in _handle_write
    del self._write_buffer[:self._write_buffer_pos]
BufferError: Existing exports of data: object cannot be re-sized
distributed.worker - WARNING -  Compute Failed
Function:  my_task
args:      ({'upper': '1.4', 'trainable_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'checkpoint_path': '/home/mapr/mano/slim_data/flowers/model/inception/inception_v3.ckpt', 'log_every_n_steps': '1', 'dataset_split_name': 'train', 'learning_rate': '0.01', 'train_dir': '/home/mapr/mano/slim_data/flowers/train_dir/train_outs_19', 'clone_on_cpu': 'True', 'batch_size': '32', 'resize_method': '3', 'hue_max_delta': '0.3', 'lower': '0.6', 'trace_every_n_steps': '1', 'script_name': 'train_image_classifier.py', 'checkpoint_exclude_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'dataset_dir': '/home/mapr/mano/slim_data/flowers/slim_data_dir', 'max_number_of_steps': '4', 'model_name': 'inception_v3', 'dataset_name': 'flowers'})
kwargs:    {}
Exception: BufferError('Existing exports of data: object cannot be re-sized',)

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/mapr/mano/slim_data/flowers/train_dir/train_outs_19/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 2.6281 (19.799 sec/step)
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global step 2: loss = nan (7.406 sec/step)
INFO:tensorflow:global step 3: loss = nan (6.953 sec/step)
INFO:tensorflow:global step 4: loss = nan (6.840 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

我很确定这与 dask 有关。

4

0 回答 0