15

I am able to train my model and use ML Engine for prediction but my results don't include any identifying information. This works fine when submitting one row at a time for prediction but when submitting multiple rows I have no way of connecting the prediction back to the original input data. The GCP documentation discusses using instance keys but I can't find any example code that trains and predicts using an instance key. Taking the GCP census example how would I update the input functions to pass a unique ID through the graph and ignore it during training yet return the unique ID with predictions? Or alternatively if anyone knows of a different example already using keys that would help as well.

From Census Estimator Sample

def serving_input_fn():
    feature_placeholders = {
      column.name: tf.placeholder(column.dtype, [None])
      for column in INPUT_COLUMNS
    }

    features = {
      key: tf.expand_dims(tensor, -1)
      for key, tensor in feature_placeholders.items()
    }

    return input_fn_utils.InputFnOps(
      features,
      None,
      feature_placeholders
    )


def generate_input_fn(filenames,
                  num_epochs=None,
                  shuffle=True,
                  skip_header_lines=0,
                  batch_size=40):

    def _input_fn():
        files = tf.concat([
          tf.train.match_filenames_once(filename)
          for filename in filenames
        ], axis=0)

        filename_queue = tf.train.string_input_producer(
          files, num_epochs=num_epochs, shuffle=shuffle)
        reader = tf.TextLineReader(skip_header_lines=skip_header_lines)

        _, rows = reader.read_up_to(filename_queue, num_records=batch_size)

        row_columns = tf.expand_dims(rows, -1)
        columns = tf.decode_csv(row_columns, record_defaults=CSV_COLUMN_DEFAULTS)
        features = dict(zip(CSV_COLUMNS, columns))

        # Remove unused columns
        for col in UNUSED_COLUMNS:
          features.pop(col)

        if shuffle:
           features = tf.train.shuffle_batch(
             features,
             batch_size,
             capacity=batch_size * 10,
             min_after_dequeue=batch_size*2 + 1,
             num_threads=multiprocessing.cpu_count(),
             enqueue_many=True,
             allow_smaller_final_batch=True
           )
        label_tensor = parse_label_column(features.pop(LABEL_COLUMN))
        return features, label_tensor

    return _input_fn

Update: I was able to use the suggested code from this answer below I just needed to alter it slightly to update the output alternatives in the model_fn_ops instead of just the prediction dict. However, this only works if my serving input function is coded for json inputs similar to this. My serving input function was previously modeled after the CSV serving input function in the Census Core Sample.

I think my problem is coming from the build_standardized_signature_def function and even more so the is_classification_problem function that it calls. The input dict length using the csv serving function is 1 so this logic ends up using the classification_signature_def which only ends up displaying the scores (which turns out are actually the probabilities) whereas the input dict length is greater than 1 with the json serving input function and instead the predict_signature_def is used which includes all of the outputs.

4

2 回答 2

8

更新:在 1.3 版中,contrib 估计器(例如 tf.contrib.learn.DNNClassifier)被更改为从核心估计器类 tf.estimator.Estimator 继承,这与它的前身不同,将模型函数隐藏为私有类成员,所以您需要estimator.model_fn在下面的解决方案中替换为estimator._model_fn.

Josh 的回答将您指向 Flowers 示例,如果您想使用自定义估算器,这是一个很好的解决方案。如果您想坚持使用罐装估算器(例如tf.contrib.learn.DNNClassifiers),您可以将其包装在自定义估算器中,以添加对键的支持。(注意:我认为当他们进入核心时,罐头估算器很可能会获得关键支持)。

KEY = 'key'
def key_model_fn_gen(estimator):
    def _model_fn(features, labels, mode, params):
        key = features.pop(KEY, None)
        model_fn_ops = estimator.model_fn(
           features=features, labels=labels, mode=mode, params=params)
        if key:
            model_fn_ops.predictions[KEY] = key
            # This line makes it so the exported SavedModel will also require a key
            model_fn_ops.output_alternatives[None][1][KEY] = key
        return model_fn_ops
    return _model_fn

my_key_estimator = tf.contrib.learn.Estimator(
    model_fn=key_model_fn_gen(
        tf.contrib.learn.DNNClassifier(model_dir=model_dir...)
    ),
    model_dir=model_dir
)

my_key_estimator然后可以完全像您使用的那样DNNClassifier使用,除了它需要一个名称'key'来自 input_fns 的功能(预测、评估和训练)。

EDIT2:您还需要将相应的输入张量添加到您选择的预测输入函数中。例如,一个新的 JSON 服务输入 fn 看起来像:

def json_serving_input_fn():
  inputs = # ... input_dict as before
  inputs[KEY] = tf.placeholder([None], dtype=tf.int64)
  features = # .. feature dict made from input_dict as before
  tf.contrib.learn.InputFnOps(features, None, inputs)

(在 1.2 和 1.3 之间略有不同,tf.contrib.learn.InputFnOps替换为tf.estimator.export.ServingInputReceiver,并且在 1.3 中不再需要将张量填充到 2 级)

然后 ML Engine 将发送一个名为“key”的张量和您的预测请求,该张量将传递给您的模型,并通过您的预测。

EDIT3:修改key_model_fn_gen为支持忽略缺失的键值。EDIT4:添加了预测键

于 2017-06-08T18:44:22.067 回答
4

好问题。Cloud ML Engine flowers 示例通过使用 tf.identity 操作将字符串从输入直接传递到输出来执行此操作。这是图形构建过程中的相关行。

keys_placeholder = tf.placeholder(tf.string, shape=[None])
inputs = {
    'key': keys_placeholder,
    'image_bytes': tensors.input_jpeg
}

# To extract the id, we need to add the identity function.
keys = tf.identity(keys_placeholder)
outputs = {
   'key': keys,
   'prediction': tensors.predictions[0],
   'scores': tensors.predictions[1]
}

对于批量预测,您需要在实例记录中插入“key”:“some_key_value”。对于在线预测,您可以使用 JSON 请求查询上图,例如:

{'instances' : [
    {'key': 'first_key', 'image_bytes' : {'b64': ...}}, 
    {'key': 'second_key', 'image_bytes': {'b64': ...}}
    ]
}
于 2017-06-06T15:02:36.050 回答