0

我按照在 GCP 上构建 kubeflow的教程进行操作。

在最后一步,在部署代码并使用 CPU 进行训练之后。

kustomize build . |kubectl apply -f -

分布式tensorflow遇到这个问题

tensorflow.python.framework.errors_impl.NotFoundError:/tmp/tmprIn1Il/model.ckpt-1_temp_a890dac1971040119aba4921dd5f631a;没有这样的文件或目录
[[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:ps/replica:0/task :0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, conv_layer1/conv2d/bias, conv_layer1/conv2d/kernel, conv_layer2/conv2d/bias, conv_layer2/conv2d/kernel ,dense/bias,dense/kernel,dense_1/bias,dense_1/kernel,global_step)]]

我发现了类似的错误报告,但不知道如何解决。

4

1 回答 1

0

从错误报告。

您可以通过在工作程序和参数服务器上使用共享文件系统(例如 HDFS、GCS 或 NFS 挂载在同一挂载点)来解决此问题。

只需将数据放在 GCS 上,它就可以正常工作。

模型.py

import tensorflow_datasets as tfds
import tensorflow as tf

# tfds works in both Eager and Graph modes
tf.enable_eager_execution()

# See available datasets
print(tfds.list_builders())

ds_train, ds_test = tfds.load(name="mnist", split=["train", "test"], data_dir="gs://kubeflow-tf-bucket", batch_size=-1)
ds_train = tfds.as_numpy(ds_train)
ds_test = tfds.as_numpy(ds_test)

(x_train, y_train) = ds_train['image'], ds_train['label']
(x_test, y_test) = ds_test['image'], ds_test['label']
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
print(model.evaluate(x_test, y_test))
于 2019-06-02T09:23:43.490 回答