python - TensorFlow - `keys` 或 `default_value` 与表数据类型不匹配

Question

^{（python、机器学习和 TensorFlow 的完全新手）}

我正在尝试将TensorFlow 线性模型教程从他们的官方文档改编为 ICU 机器学习存储库中的鲍鱼数据集。目的是从其他给定数据中猜测鲍鱼的年轮（年龄）。

运行以下程序时，我得到以下信息：

File "/home/lawrence/tensorflow3.5/lib/python3.5/site-packages/tensorflow             /python/ops/lookup_ops.py", line 220, in lookup
(self._key_dtype, keys.dtype))
TypeError: Signature mismatch. Keys must be dtype <dtype: 'string'>, got <dtype: 'int32'>.

该错误在第 220 行的 lookup_ops.py 中被抛出，并记录为在以下情况下抛出：

    Raises:
      TypeError: when `keys` or `default_value` doesn't match the table data types.

从调试parse_csv()来看，似乎所有张量都是用正确的类型创建的。

你能解释一下出了什么问题吗？我相信我正在遵循教程代码逻辑并且无法弄清楚这一点。

源代码：

import tensorflow as tf
import shutil

_CSV_COLUMNS = [
    'sex', 'length', 'diameter', 'height', 'whole_weight',
    'shucked_weight', 'viscera_weight', 'shell_weight', 'rings'
]

_CSV_COLUMN_DEFAULTS = [['M'], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0]]

_NUM_EXAMPLES = {
    'train': 3000,
    'validation': 1177,
}

def build_model_columns():
  """Builds a set of wide feature columns."""
   # Continuous columns
  sex = tf.feature_column.categorical_column_with_hash_bucket('sex', hash_bucket_size=1000)
  length = tf.feature_column.numeric_column('length', dtype=tf.float32)
  diameter = tf.feature_column.numeric_column('diameter', dtype=tf.float32)
  height = tf.feature_column.numeric_column('height', dtype=tf.float32)
  whole_weight = tf.feature_column.numeric_column('whole_weight', dtype=tf.float32)
  shucked_weight = tf.feature_column.numeric_column('shucked_weight', dtype=tf.float32)
  viscera_weight = tf.feature_column.numeric_column('viscera_weight', dtype=tf.float32)
  shell_weight = tf.feature_column.numeric_column('shell_weight', dtype=tf.float32)

  base_columns = [sex, length, diameter, height, whole_weight,
                  shucked_weight, viscera_weight, shell_weight]

  return base_columns

def build_estimator():
  """Build an estimator appropriate for the given model type."""
  base_columns = build_model_columns()

  return tf.estimator.LinearClassifier(
      model_dir="~/models/albones/",
      feature_columns=base_columns,
      label_vocabulary=_CSV_COLUMNS)


 def input_fn(data_file, num_epochs, shuffle, batch_size):
   """Generate an input function for the Estimator."""
   assert tf.gfile.Exists(data_file), (
      '%s not found. Please make sure you have either run data_download.py or '
      'set both arguments --train_data and --test_data.' % data_file)

  def parse_csv(value):
      print('Parsing', data_file)
      columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
      features = dict(zip(_CSV_COLUMNS, columns))
      labels = features.pop('rings')

      return features, labels

  # Extract lines from input files using the Dataset API.
  dataset = tf.data.TextLineDataset(data_file)

  if shuffle:
    dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])

  dataset = dataset.map(parse_csv)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = dataset.repeat(num_epochs)
  dataset = dataset.batch(batch_size)

  iterator = dataset.make_one_shot_iterator()
  features, labels = iterator.get_next()

  return features, labels

def main(unused_argv):
  # Clean up the model directory if present
  shutil.rmtree("/home/lawrence/models/albones/", ignore_errors=True)
  model = build_estimator()

  # Train and evaluate the model every `FLAGS.epochs_per_eval` epochs.
  for n in range(40 // 2):
    model.train(input_fn=lambda: input_fn(
        "/home/lawrence/abalone.data", 2, True, 40))

    results = model.evaluate(input_fn=lambda: input_fn(
        "/home/lawrence/abalone.data", 1, False, 40))

    # Display evaluation metrics
    print('Results at epoch', (n + 1) * 2)
    print('-' * 60)

    for key in sorted(results):
      print('%s: %s' % (key, results[key]))


if __name__ == '__main__':
    tf.logging.set_verbosity(tf.logging.INFO)
    tf.app.run(main=main)

以下是来自abalone.names的数据集列的分类：

Name            Data Type   Meas.   Description
----            ---------   -----   -----------
Sex             nominal             M, F, [or] I (infant)
Length          continuous  mm      Longest shell measurement
Diameter        continuous  mm      perpendicular to length
Height          continuous  mm      with meat in shell
Whole weight    continuous  grams   whole abalone
Shucked weight  continuous  grams   weight of meat
Viscera weight  continuous  grams   gut weight (after bleeding)
Shell weight    continuous  grams   after being dried
Rings           integer             +1.5 gives the age in years

数据集条目按此顺序显示为常见的分隔值，并带有新条目的新行。

score 1 · Accepted Answer

你几乎做对了所有事情。问题在于估计量的定义。

任务是预测Rings列，它是一个整数，所以看起来像是一个回归问题。但是你决定做一个分类任务，这也是有效的：

def build_estimator():
  """Build an estimator appropriate for the given model type."""
  base_columns = build_model_columns()

  return tf.estimator.LinearClassifier(
      model_dir="~/models/albones/",
      feature_columns=base_columns,
      label_vocabulary=_CSV_COLUMNS)

默认情况下，tf.estimator.LinearClassifier假定二进制分类，即n_classes=2. 在你的情况下，这显然不是真的 - 这是第一个错误。您还设置label_vocabulary了，tensorflow 将其解释为标签列中的一组可能值。这就是它需要tf.stringdtype 的原因。因为Rings是一个整数，所以你根本不需要label_vocabulary。

将它们结合在一起：

def build_estimator():
  """Build an estimator appropriate for the given model type."""
  base_columns = build_model_columns()

  return tf.estimator.LinearClassifier(
    model_dir="~/models/albones/",
    feature_columns=base_columns,
    n_classes=30)

我建议你也试试tf.estimator.LinearRegressor，这可能会更准确。

python - TensorFlow - `keys` 或 `default_value` 与表数据类型不匹配

1 回答 1

Related

Reference