tensorflow-datasets - 如果 a 被打乱，tf.data.Dataset.zip(a, b) 会改变元素的顺序

Question

我正在准备一个数据集，然后在存储输出之前训练一个模型（出于知识蒸馏的目的）

为了以 tfrecords 格式存储它们，我需要使用 .zip() 函数。

我使用以下代码重现了错误/错误。我的实际训练文件有数百行，所以我没有在这里包含它们。

我使用张量流 2.1。和 ubuntu 18.04 上的 python 3.7

我无法解决的问题是：

数据被打乱（没关系）。但是在压缩后，元组的顺序彼此不同（这是不行的）。

import tensorflow as tf 
ds = tf.data.Dataset.from_tensor_slices([1,2,3,4, 5])

#prepare dataset for training
batch_size=2
ds = ds.cache().repeat().shuffle(buffer_size=5, reshuffle_each_iteration=True).batch(batch_size)

#create model. here: map identity function
model = tf.keras.models.Sequential([tf.keras.layers.Lambda(lambda x: x , input_shape=(1,))])

#train with model.fit()

#make predictions. 
pred = model.predict(ds, steps=5//batch_size)

#prepare for saving to tfrecords
ds = ds.unbatch()
ds = ds.take(5)
pred = tf.data.Dataset.from_tensor_slices(pred)
combined = tf.data.Dataset.zip((ds, pred))

#show unwanted behaviour
for (a),c in combined:
    print(a,c)

代码片段的输出显示每行的元素不匹配。（例如第 1 行：3 应该映射到 3）

tf.Tensor(3, shape=(), dtype=int32) tf.Tensor([4.], shape=(1,), dtype=float32)
tf.Tensor(1, shape=(), dtype=int32) tf.Tensor([1.], shape=(1,), dtype=float32)
tf.Tensor(4, shape=(), dtype=int32) tf.Tensor([1.], shape=(1,), dtype=float32)
tf.Tensor(3, shape=(), dtype=int32) tf.Tensor([2.], shape=(1,), dtype=float32)

score 0 · Accepted Answer

TensorFlow 将第一个轴打乱，因此如果您的张量形状是，(x,)这将改变元素的顺序；这是一个测试

a = tf.data.Dataset.from_tensor_slices(tf.constant([[x] for x in range(10)]))
b = tf.data.Dataset.from_tensor_slices(tf.constant([[x] for x in range(10)]))
c = tf.data.Dataset.zip((a,b)).shuffle(10)
for i,j in c.batch(1):
    print(i.numpy(),j.numpy())

输出是

[[3]] [[3]]
[[6]] [[6]]
[[5]] [[5]]
[[8]] [[8]]
[[7]] [[7]]
[[1]] [[1]]
[[2]] [[2]]
[[0]] [[0]]
[[9]] [[9]]
[[4]] [[4]]

如您所见，订单已保留，但每个张量的第一个轴上的项目已被打乱。

score 0 · Accepted Answer

Tensorflow 在数据集的每次迭代中应用 shuffle。Zip 是其中一种迭代，这就是为什么 model.predict 中的顺序与 zip 中的顺序不匹配的原因（两次都有洗牌）

无论如何，为了进行预测，您实际上并不需要对数据集进行洗牌。预测不应依赖于模型在先前预测中看到的内容。

tensorflow-datasets - 如果 a 被打乱，tf.data.Dataset.zip(a, b) 会改变元素的顺序

我无法解决的问题是：

2 回答 2

Related

Reference