python - Theano - 共享变量作为大型数据集的函数输入

Question

我是 Theano 的新手……如果这很明显，我深表歉意。

我正在尝试根据LeNet 教程训练 CNN 。与本教程的一个主要区别是我的数据集太大而无法放入内存，因此我必须在训练期间加载每个批次。

原来的模型是这样的：

train_model = theano.function(
    [index],
    cost,
    updates=updates,
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],
        y: train_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

...这对我不起作用，因为它假设它train_set_x完全加载在内存中。

所以我切换到这个：

train_model = theano.function([x,y], cost, updates=updates)

并试图用以下方式调用它：

data, target = load_data(minibatch_index)  # load_data returns typical numpy.ndarrays for a given minibatch

data_shared = theano.shared(np.asarray(data, dtype=theano.config.floatX), borrow=True)
target_shared = T.cast(theano.shared(np.asarray(target, dtype=theano.config.floatX), borrow=True), 'int32')

cost_ij = train_model(data_shared ,target_shared )

但得到：

TypeError: ('Bad input argument to theano function with name ":103" at index 0(0-based)', '期望一个类似数组的对象，但找到了一个变量：也许你正试图在一个 (可能共享）变量而不是数字数组？'）

所以我想我不能使用共享变量作为 Theano 函数的输入。但是，我应该如何进行……？

score 5 · Accepted Answer

编译的 Theano 函数的所有输入（即调用的输出theano.function(...)）应该始终是具体值，通常是标量或 numpy 数组。共享变量是一种包装 numpy 数组并将其视为符号变量的方法，但是当数据作为输入传递时，这不是必需的。

因此，您应该能够省略将数据和目标值包装为共享变量，而是执行以下操作：

cost_ij = train_model(data, target)

请注意，如果您使用的是 GPU，这意味着您的数据将驻留在计算机的主内存中，并且您作为输入传递的每个部分都需要单独复制到 GPU 内存，从而增加开销并减慢速度。另请注意，您必须将数据分开并仅传递其中的一部分；如果整个数据集不适合 GPU 内存，这种方法的改变将不允许您一次对整个数据集进行 GPU 计算。

python - Theano - 共享变量作为大型数据集的函数输入

1 回答 1

Related

Reference