python-3.x - 在 TensorFlow 中获取数据集的长度

Question

source_dataset = tf.data.TextLineDataset('primary.csv')
target_dataset = tf.data.TextLineDataset('secondary.csv')
dataset = tf.data.Dataset.zip((source_dataset, target_dataset))
dataset = dataset.shard(10000, 0)
dataset = dataset.map(lambda source, target: (tf.string_to_number(tf.string_split([source], delimiter=',').values, tf.int32),
                                              tf.string_to_number(tf.string_split([target], delimiter=',').values, tf.int32)))
dataset = dataset.map(lambda source, target: (source, tf.concat(([start_token], target), axis=0), tf.concat((target, [end_token]), axis=0)))
dataset = dataset.map(lambda source, target_in, target_out: (source, tf.size(source), target_in, target_out, tf.size(target_in)))

dataset = dataset.shuffle(NUM_SAMPLES)  #This is the important line of code

我想完全洗牌我的整个数据集，但shuffle()需要提取一些样本，并且tf.Size()不适用于tf.data.Dataset.

我怎样才能正确洗牌？

score 2 · Accepted Answer

我正在使用 tf.data.FixedLengthRecordDataset() 并遇到了类似的问题。就我而言，我试图只获取一定比例的原始数据。由于我知道所有记录都有固定长度，因此我的解决方法是：

totalBytes = sum([os.path.getsize(os.path.join(filepath, filename)) for filename in os.listdir(filepath)])
numRecordsToTake = tf.cast(0.01 * percentage * totalBytes / bytesPerRecord, tf.int64)
dataset = tf.data.FixedLengthRecordDataset(filenames, recordBytes).take(numRecordsToTake)

在您的情况下，我的建议是直接在 python 中计算“primary.csv”和“secondary.csv”中的记录数。或者，我认为出于您的目的，设置 buffer_size 参数实际上并不需要计算文件。根据关于 buffer_size 含义的公认答案，大于数据集中元素数量的数字将确保整个数据集的统一洗牌。因此，只需输入一个非常大的数字（您认为会超过数据集的大小）就可以了。

score 0 · Accepted Answer

从 TensorFlow 2 开始，可以通过cardinality()函数轻松检索数据集的长度。

dataset = tf.data.Dataset.range(42)
#both print 42 
dataset_length_v1 = tf.data.experimental.cardinality(dataset).numpy())
dataset_length_v2 = dataset.cardinality().numpy()

注意：当使用谓词（例如过滤器）时，返回的长度可能是 -2。可以在这里查阅解释，否则只需阅读以下段落：

如果使用过滤谓词，基数可能返回值 -2，因此未知；如果您确实在数据集上使用过滤谓词，请确保您已经以另一种方式计算了数据集的长度（例如，在应用之前的 pandas 数据帧的长度.from_tensor_slices()。

python-3.x - 在 TensorFlow 中获取数据集的长度

2 回答 2

Related

Reference