我有一个以路径长度为键的路径数据集。网络需要接收相同长度的路径并对它们进行批处理。我正在寻找一种方法(使用Dataset API)根据路径长度分布选择路径长度(就像 一样简单P(length) = (number of paths for this length) / (total number of paths)
,然后为这个长度取一批路径。我打算写 tfrecord 文件(我不认为我可以使用将这些路径直接存储在单独目录中的python字典(每个长度)。如何使用Dataset API构造队列,从每个目录返回batch_size tfrecords - 但根据路径长度选择要返回的下一个目录分布?所以我想我可以为每个路径长度构造一个数据集,如下所示:
datasets = {}
for length in path_dict.keys():
paths_dir = '/../tfrecords/%d' % length
filenames = [os.path.join(paths_dir, x) for x in os.listdir(paths_dir)]
datasets[length] = tf.contrib.data.TFRecordDataset(filenames).batch(batch_size)
但是如何从x
基于长度分布选择长度的数据集中选择下一个元素的数据集构建一个新数据集?
(编辑:我可能做错了,例如,我可以每个长度有一个 tfrecord 或其他东西,感谢反馈)
EDIT2:在旧 API 中,我可以执行以下操作:
def _random_filenames(tfrecords_dir, batch_size, examples):
paths = OrderedDict()
# tfrecords_dirs names is just the length of paths they contain
tfrecords_dirs = sorted(map(int, os.listdir(tfrecords_dir)))
for d in tfrecords_dirs:
dir_ = os.path.join(tfrecords_dir, str(d))
files = os.listdir(dir_)
paths[d] = sorted(os.path.join(dir_, f) for f in files)
# Calculate path statistics
total_paths = sum(len(v) for v in paths.values())
print('Total paths %d' % total_paths)
stats = [(len(v) / total_paths) for v in paths.values()]
lengths = list(paths.keys())
filenames = []
for j in range(0, examples, batch_size):
pick = int(np.random.choice(lengths, p=stats))
filenames.extend(list(np.random.choice(paths[pick], size=batch_size)))
return filenames
然后从文件名产生一个队列:
def _get_filename_queue(filenames):
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames,
# epochs makes no sense here as we will just repeat the same filenames
# shuffle even less so as we need batches of equal length
num_epochs=None, shuffle=False, )
return filename_queue
我正在寻找一种将其转换为新数据集 API 的方法