python - TensorFlow - 一次读取 TFRecords 中的所有示例？

Question

您如何一次读取 TFRecords 中的所有示例？

我一直在使用类似于fully_connected_reader 示例中的方法中tf.parse_single_example给出的代码来读取单个示例。但是，我想一次针对我的整个验证数据集运行网络，因此想全部加载它们。read_and_decode

我不完全确定，但文档似乎建议我可以使用tf.parse_example而不是tf.parse_single_example一次加载整个 TFRecords 文件。我似乎无法让它工作。我猜这与我如何指定功能有关，但我不确定如何在功能规范中声明有多个示例。

换句话说，我尝试使用类似于：

reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_example(serialized_example, features={
    'image_raw': tf.FixedLenFeature([], tf.string),
    'label': tf.FixedLenFeature([], tf.int64),
})

不起作用，我认为这是因为这些功能不会同时出现多个示例（但同样，我不确定）。[这导致错误ValueError: Shape () must have rank 1]

这是一次读取所有记录的正确方法吗？如果是这样，我需要改变什么才能真正阅读记录？非常感谢！

score 24 · Accepted Answer

为了清楚起见，我在一个 .tfrecords 文件中有几千张图像，它们是 720 x 720 rgb png 文件。标签是 0,1,2,3 之一。

我也尝试使用 parse_example，但无法使其工作，但此解决方案适用于 parse_single_example。

缺点是现在我必须知道每个 .tf 记录中有多少项目，这有点令人沮丧。如果我找到更好的方法，我会更新答案。另外，要小心超出 .tfrecords 文件中的记录数量，如果您循环过去最后一条记录，它将从第一条记录重新开始

诀窍是让队列运行者使用协调器。

我在这里留下了一些代码来保存正在读取的图像，以便您可以验证图像是否正确。

from PIL import Image
import numpy as np
import tensorflow as tf

def read_and_decode(filename_queue):
 reader = tf.TFRecordReader()
 _, serialized_example = reader.read(filename_queue)
 features = tf.parse_single_example(
  serialized_example,
  # Defaults are not specified since both keys are required.
  features={
      'image_raw': tf.FixedLenFeature([], tf.string),
      'label': tf.FixedLenFeature([], tf.int64),
      'height': tf.FixedLenFeature([], tf.int64),
      'width': tf.FixedLenFeature([], tf.int64),
      'depth': tf.FixedLenFeature([], tf.int64)
  })
 image = tf.decode_raw(features['image_raw'], tf.uint8)
 label = tf.cast(features['label'], tf.int32)
 height = tf.cast(features['height'], tf.int32)
 width = tf.cast(features['width'], tf.int32)
 depth = tf.cast(features['depth'], tf.int32)
 return image, label, height, width, depth


def get_all_records(FILE):
 with tf.Session() as sess:
   filename_queue = tf.train.string_input_producer([ FILE ])
   image, label, height, width, depth = read_and_decode(filename_queue)
   image = tf.reshape(image, tf.pack([height, width, 3]))
   image.set_shape([720,720,3])
   init_op = tf.initialize_all_variables()
   sess.run(init_op)
   coord = tf.train.Coordinator()
   threads = tf.train.start_queue_runners(coord=coord)
   for i in range(2053):
     example, l = sess.run([image, label])
     img = Image.fromarray(example, 'RGB')
     img.save( "output/" + str(i) + '-train.png')

     print (example,l)
   coord.request_stop()
   coord.join(threads)

get_all_records('/path/to/train-0.tfrecords')

score 12 · Accepted Answer

要一次性读取所有数据，您需要传递num_epochs给string_input_producer. 当读取所有记录时.read，reader 方法会抛出错误，您可以捕获该错误。简化示例：

import tensorflow as tf

def read_and_decode(filename_queue):
 reader = tf.TFRecordReader()
 _, serialized_example = reader.read(filename_queue)
 features = tf.parse_single_example(
  serialized_example,
  features={
      'image_raw': tf.FixedLenFeature([], tf.string)
  })
 image = tf.decode_raw(features['image_raw'], tf.uint8)
 return image


def get_all_records(FILE):
 with tf.Session() as sess:
   filename_queue = tf.train.string_input_producer([FILE], num_epochs=1)
   image = read_and_decode(filename_queue)
   init_op = tf.initialize_all_variables()
   sess.run(init_op)
   coord = tf.train.Coordinator()
   threads = tf.train.start_queue_runners(coord=coord)
   try:
     while True:
       example = sess.run([image])
   except tf.errors.OutOfRangeError, e:
     coord.request_stop(e)
   finally:
     coord.request_stop()
     coord.join(threads)

get_all_records('/path/to/train-0.tfrecords')

并且要使用tf.parse_example（比快tf.parse_single_example），您需要先批量处理这样的示例：

batch = tf.train.batch([serialized_example], num_examples, capacity=num_examples)
parsed_examples = tf.parse_example(batch, feature_spec)

不幸的是，这样你需要事先知道示例的数量。

score 12 · Accepted Answer

如果您需要一次从 TFRecord 读取所有数据，您可以使用tf_record_iterator仅在几行代码中编写更简单的解决方案：

从 TFRecords 文件中读取记录的迭代器。

为此，您只需：

创建一个例子
迭代来自迭代器的记录
解析每条记录并根据其类型读取每个特征

这是一个示例，说明如何阅读每种类型。

example = tf.train.Example()
for record in tf.python_io.tf_record_iterator(<tfrecord_file>):
    example.ParseFromString(record)
    f = example.features.feature
    v1 = f['int64 feature'].int64_list.value[0]
    v2 = f['float feature'].float_list.value[0]
    v3 = f['bytes feature'].bytes_list.value[0]
    # for bytes you might want to represent them in a different way (based on what they were before saving)
    # something like `np.fromstring(f['img'].bytes_list.value[0], dtype=np.uint8
    # Now do something with your v1/v2/v3

score 9 · Accepted Answer

您还可以使用tf.python_io.tf_record_iterator手动迭代TFRecord.

我用下面的插图代码测试它：

import tensorflow as tf

X = [[1, 2],
     [3, 4],
     [5, 6]]


def _int_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))


def dump_tfrecord(data, out_file):
    writer = tf.python_io.TFRecordWriter(out_file)
    for x in data:
        example = tf.train.Example(
            features=tf.train.Features(feature={
                'x': _int_feature(x)
            })
        )
        writer.write(example.SerializeToString())
    writer.close()


def load_tfrecord(file_name):
    features = {'x': tf.FixedLenFeature([2], tf.int64)}
    data = []
    for s_example in tf.python_io.tf_record_iterator(file_name):
        example = tf.parse_single_example(s_example, features=features)
        data.append(tf.expand_dims(example['x'], 0))
    return tf.concat(0, data)


if __name__ == "__main__":
    dump_tfrecord(X, 'test_tfrecord')
    print('dump ok')
    data = load_tfrecord('test_tfrecord')

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        Y = sess.run([data])
        print(Y)

当然，您必须使用自己的feature规范。

缺点是我不怎么用这种方式使用多线程。但是，我们阅读所有示例的大多数情况是在评估验证数据集时，它通常不是很大。所以我认为效率可能不是瓶颈。

而且我在测试这个问题的时候还有一个问题，就是我必须指定特征长度。而不是tf.FixedLenFeature([], tf.int64)，我必须写tf.FixedLenFeature([2], tf.int64)，否则，InvalidArgumentError发生了。我不知道如何避免这种情况。

Python：3.4
张量流：0.12.0

score 4 · Accepted Answer

我不知道它是否仍然是一个活跃的话题。我想分享迄今为止我所知道的最佳实践，不过这是一年前的问题。

在 tensorflow 中，我们有一个非常有用的方法来解决这样的问题——读取或迭代整个输入数据，并随机生成测试数据集的训练。'tf.train.shuffle_batch' 可以根据您的行为生成基于输入流的数据集（如 reader.read()）。例如，您可以通过提供如下参数列表来生成一组 1000 个数据集：

reader = tf.TFRecordReader()
_, serialized = reader.read(filename_queue)
features = tf.parse_single_example(
    serialized,
    features={
        'label': tf.FixedLenFeature([], tf.string),
        'image': tf.FixedLenFeature([], tf.string)
    }
)
record_image = tf.decode_raw(features['image'], tf.uint8)

image = tf.reshape(record_image, [500, 500, 1])
label = tf.cast(features['label'], tf.string)
min_after_dequeue = 10
batch_size = 1000
capacity = min_after_dequeue + 3 * batch_size
image_batch, label_batch = tf.train.shuffle_batch(
    [image, label], batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue
)

score 1 · Accepted Answer

Besides, if you don't think 'tf.train.shuffle_batch' is the way you need. You may try combination of tf.TFRecordReader().read_up_to() and tf.parse_example() as well. Here's the example for your reference:

def read_tfrecords(folder_name, bs):
    filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once(glob.glob(folder_name + "/*.tfrecords")))
    reader = tf.TFRecordReader()
    _, serialized = reader.read_up_to(filename_queue, bs)
    features = tf.parse_example(
        serialized,
        features={
            'label': tf.FixedLenFeature([], tf.string),
            'image': tf.FixedLenFeature([], tf.string)
        }
    )
    record_image = tf.decode_raw(features['image'], tf.uint8)
    image = tf.reshape(record_image, [-1, 250, 151, 1])
    label = tf.cast(features['label'], tf.string)
    return image, label

score 0 · Accepted Answer

1）但是如果我们有几个 tfrecord 文件，我们如何循环所有这些文件并获取我们设置的所有标签（图像，标签）然后一次绘制所有标签？2）如果我们有 100 个类和每批 64 个的不平衡数据集，我们如何确保每次上完所有类时假设我们取了 128 个类，所以在训练期间从每个类中至少选择一个？

python - TensorFlow - 一次读取 TFRecords 中的所有示例？

7 回答 7

Related

Reference