python - 在输入函数中使用 Dataset API 时，Tensorflow Estimator.predict_scores 未产生正确数量的预测

Question

我正在使用 tensorflow 1.5，我对这种我无法解释的奇怪行为感到困惑。
我制作了一个最小的例子：

import tensorflow as tf
import numpy as np


def input_function(x, y, batch_size=128, shuffle=True, n_epochs=None):
    data_set = tf.data.Dataset.from_tensor_slices({"x": x, "y": y})
    if shuffle:
        data_set = data_set.shuffle(buffer_size=1024, seed=None, reshuffle_each_iteration=True)
    data_set = data_set.batch(batch_size)
    data_set = data_set.repeat(n_epochs)
    iterator = data_set.make_one_shot_iterator()
    example = iterator.get_next()
    return {"features": example["x"]}, example["y"]


def main():
    n_samples = 256
    n_features = 16
    n_labels = 1

    x = np.random.rand(n_samples, n_features).astype(np.float32)
    y = np.random.rand(n_samples, n_labels).astype(np.float32)

    feature_column = tf.contrib.layers.real_valued_column(column_name='features', dimension=n_features)
    estimator = tf.contrib.learn.DNNRegressor([10], [feature_column], optimizer=tf.train.AdamOptimizer())

    estimator.fit(input_fn=lambda: input_function(x, y, batch_size=128, shuffle=True, n_epochs=32))
    pred = estimator.predict_scores(input_fn=lambda: input_function(x, y, batch_size=16, shuffle=False, n_epochs=1))
    print("len(pred) = {} (should be {})".format(len(list(pred)), n_samples))


if __name__ == '__main__':
    main()

在此示例中，对“fit”的调用似乎工作正常（但我不确定），但对“predict_scores”的调用仅产生 batch_size (=16) 预测而不是 n_samples (=256)。我做错了什么？
如果我使用 tf.esimator.inputs.numpy_input_fn，这个问题就会消失，尽管最终我将不得不使用一个使用 TFRecordDataset 从 tfrecord 文件中读取大量训练数据的输入函数，类似于此处显示的内容： https： //www.tensorflow.org/programmers_guide/datasets#using_high-level_apis
任何帮助将不胜感激。

score 0 · Accepted Answer

这是tf.contrib.learn.Estimator类中的一个错误，它错误地假设输入是常量，并且只读取一个批次，而不是多次运行输入函数来获取所有数据。和类已被弃用tf.contrib.learn.Estimator并tf.contrib.learn.DNNRegressor计划删除，因此它们不太可能被修复。

但是，tf.estimator.DNNRegressor该类已修复为可与一起使用tf.data，您可以修改代码以使用它，如下所示：

def main():
    n_samples = 256
    n_features = 16
    n_labels = 1

    x = np.random.rand(n_samples, n_features).astype(np.float32)
    y = np.random.rand(n_samples, n_labels).astype(np.float32)

    feature_column = tf.contrib.layers.real_valued_column(
        column_name='features', dimension=n_features)

    # Use the `tf.estimator.DNNRegressor` constructor instead of
    # `tf.contrib.learn.DNNRegressor`.
    estimator = tf.estimator.DNNRegressor(
        [10], [feature_column], optimizer=tf.train.AdamOptimizer())

    # Replace `estimator.fit()` with `estimator.train()`.
    estimator.train(input_fn=lambda: input_function(
        x, y, batch_size=128, shuffle=True, n_epochs=32))

    # Replace `estimator.predict_scores()` with `estimator.predict()`.
    pred = estimator.predict(input_fn=lambda: input_function(
        x, y, batch_size=16, shuffle=False, n_epochs=1))

    print("len(pred) = {} (should be {})".format(len(list(pred)), n_samples))

python - 在输入函数中使用 Dataset API 时，Tensorflow Estimator.predict_scores 未产生正确数量的预测

1 回答 1

Related

Reference