python - Tensorflow - 批处理问题

Question

我对 tensorflow 很陌生，我正在尝试使用批处理从我的 csv 文件中进行训练。

这是我用于读取 csv 文件并进行批处理的代码

filename_queue = tf.train.string_input_producer(
    ['BCHARTS-BITSTAMPUSD.csv'], shuffle=False, name='filename_queue')

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[0.], [0.], [0.], [0.], [0.],[0.],[0.],[0.]]
xy = tf.decode_csv(value, record_defaults=record_defaults)

# collect batches of csv in
train_x_batch, train_y_batch = \
    tf.train.batch([xy[0:-1], xy[-1:]], batch_size=100)

这是培训：

# initialize
sess = tf.Session()
sess.run(tf.global_variables_initializer())

# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)


# train my model
for epoch in range(training_epochs):
    avg_cost = 0
    total_batch = int(2193 / batch_size)

    for i in range(total_batch):
        batch_xs, batch_ys = sess.run([train_x_batch, train_y_batch])
        feed_dict = {X: batch_xs, Y: batch_ys}
        c, _ = sess.run([cost, optimizer], feed_dict=feed_dict)
        avg_cost += c / total_batch

    print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))

coord.request_stop()
coord.join(threads)

这是我的问题：

1.

我的 csv 文件有 2193 条记录，我的批处理大小是 100。所以我想要的是：在每个 'epoch' 中，从 'first record' 开始，训练 21 批 100 条记录，最后 1 批 93 条记录。所以总共22批。

但是，我发现每批都有 100 个大小 - 即使是最后一个。此外，它不是从第二个“纪元”开始的“第一条记录”。

2.

如何获取记录大小（在本例中为 2193）？我应该硬编码吗？或者还有其他聪明的方法吗？我使用了tendor.get_shape().as_list() 但它不适用于batch_xs。它只是返回我空的形状 []。

score 1 · Accepted Answer

我们最近在 TensorFlow 中添加了一个名为的新 API tf.contrib.data，可以更轻松地解决此类问题。（基于“队列运行器”的 API 使得在精确的纪元边界上编写计算变得困难，因为纪元边界会丢失。）

下面是一个如何tf.contrib.data重写程序的示例：

lines = tf.contrib.data.TextLineDataset("BCHARTS-BITSTAMPUSD.csv")

def decode(line):
  record_defaults = [[0.], [0.], [0.], [0.], [0.],[0.],[0.],[0.]]
  xy = tf.decode_csv(value, record_defaults=record_defaults)
  return xy[0:-1], xy[-1:]

decoded = lines.map(decode)

batched = decoded.batch(100)

iterator = batched.make_initializable_iterator()

train_x_batch, train_y_batch = iterator.get_next()

那么训练部分可以变得简单一些：

# initialize
sess = tf.Session()
sess.run(tf.global_variables_initializer())

# train my model
for epoch in range(training_epochs):
  avg_cost = 0
  total_batch = int(2193 / batch_size)

  total_cost = 0.0
  total_batch = 0

  # Re-initialize the iterator for another epoch.
  sess.run(iterator.initializer)

  while True:

    # NOTE: It is inefficient to make a separate sess.run() call to get each batch 
    # of input data and then feed it into a different sess.run() call. For better
    # performance, define your training graph to take train_x_batch and train_y_batch
    # directly as inputs.
    try:
      batch_xs, batch_ys = sess.run([train_x_batch, train_y_batch])
    except tf.errors.OutOfRangeError:
      break

    feed_dict = {X: batch_xs, Y: batch_ys}
    c, _ = sess.run([cost, optimizer], feed_dict=feed_dict)
    total_cost += c
    total_batch += batch_xs.shape[0]

  avg_cost = total_cost / total_batch

  print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))

有关如何使用新 API 的更多详细信息，请参阅“导入数据”程序员指南。

python - Tensorflow - 批处理问题

1 回答 1

Related

Reference