我想使用 colab GPU 运行我的 theano 代码,因此我正在尝试为此目的更改 theano 标志。我努力了
import os
os.environ['THEANO_FLAGS'] = """ device=cuda0,force_device=True,blas.ldflags="-L/usr/lib/ -lblas", floatX=float32, mode=FAST_RUN, lib.cnmem=.5, profile=True, CUDA_LAUNCH_BLOCKING=1 """
import theano
和
!printf """[global]\\ndevice = cuda\\nfloatX = float32\\nforce_device=True\\nmode=FAST_RUN\\nlib.cnmem=.5\\nprofile=True\\nCUDA_LAUNCH_BLOCKING=1""" > ~/.theanorc
!cat ~/.theanorc
但他们似乎都没有工作。因为(根据分析器)所有操作都是特定于 CPU 的(ElemWise,而不是 GpuElemWise,没有 GpuFromHost 等)。
我试过这段代码:
import numpy
import theano
import theano.tensor as T
input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])
TS = theano.shared(input_data.astype('float32'), "training-set")
E = theano.shared(output_data.astype('float32'), "expected")
W1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))
O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T)).astype('float32')
gradient = T.grad(cost=cost, wrt=W1).astype('float32')
update = [[W1, W1 - gradient * numpy.float32(0.0001)]]
train = theano.function([], cost, updates=update, allow_input_downcast=True, profile = True)
for i in range(1000):
train()
train.profile.summary()
并收到以下输出:
Function profiling
==================
Message: <ipython-input-20-49bdedf42dbb>:27
Time in 1000 calls to Function.__call__: 1.391292e-02s
Time in Function.fn.__call__: 7.742643e-03s (55.651%)
Time in thunks: 3.543854e-03s (25.472%)
Total compile time: 5.829549e-02s
Number of Apply nodes: 16
Theano Optimizer time: 4.293251e-02s
Theano validate time: 7.207394e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 1.048517e-02s
Import time 0.000000e+00s
Node make_thunk time 9.668112e-03s
Node InplaceDimShuffle{x,x}(Subtensor{int64}.0) time 1.002550e-03s
Node InplaceDimShuffle{1,0}(training-set) time 9.713173e-04s
Node InplaceDimShuffle{x,x}(Subtensor{int64}.0) time 9.384155e-04s
Node Gemm{inplace}(<TensorType(float32, matrix)>, TensorConstant{-1e-04}, Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set, TensorConstant{1.0}) time 7.627010e-04s
Node Gemm{no_inplace}(expected, TensorConstant{-1.0}, <TensorType(float32, matrix)>, training-set.T, TensorConstant{1.0}) time 7.226467e-04s
Time in all call to theano.grad() 2.316711e-01s
Time since theano import 1824.793s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
28.5% 28.5% 0.001s 5.05e-07s C 2000 2 theano.tensor.blas.Gemm
20.6% 49.1% 0.001s 1.46e-07s C 5000 5 theano.tensor.elemwise.Elemwise
18.5% 67.6% 0.001s 2.18e-07s C 3000 3 theano.tensor.elemwise.DimShuffle
12.8% 80.4% 0.000s 4.54e-07s C 1000 1 theano.tensor.elemwise.Sum
9.3% 89.7% 0.000s 1.65e-07s C 2000 2 theano.tensor.subtensor.Subtensor
6.1% 95.8% 0.000s 1.08e-07s C 2000 2 theano.compile.ops.Shape_i
4.2% 100.0% 0.000s 1.50e-07s C 1000 1 theano.tensor.opt.MakeVector
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
15.9% 15.9% 0.001s 5.65e-07s C 1000 1 Gemm{no_inplace}
12.8% 28.8% 0.000s 4.54e-07s C 1000 1 Sum{acc_dtype=float64}
12.6% 41.3% 0.000s 4.45e-07s C 1000 1 Gemm{inplace}
11.0% 52.3% 0.000s 1.94e-07s C 2000 2 InplaceDimShuffle{x,x}
9.3% 61.6% 0.000s 1.65e-07s C 2000 2 Subtensor{int64}
7.5% 69.1% 0.000s 2.66e-07s C 1000 1 InplaceDimShuffle{1,0}
5.8% 74.9% 0.000s 2.05e-07s C 1000 1 Elemwise{mul,no_inplace}
5.4% 80.2% 0.000s 1.91e-07s C 1000 1 Elemwise{Composite{((i0 * i1) / i2)}}
5.0% 85.3% 0.000s 1.78e-07s C 1000 1 Elemwise{Cast{float32}}
4.2% 89.5% 0.000s 1.50e-07s C 1000 1 MakeVector{dtype='int64'}
3.1% 92.6% 0.000s 1.08e-07s C 1000 1 Shape_i{0}
3.0% 95.6% 0.000s 1.08e-07s C 1000 1 Shape_i{1}
2.2% 97.8% 0.000s 7.94e-08s C 1000 1 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
2.2% 100.0% 0.000s 7.68e-08s C 1000 1 Elemwise{Sqr}[(0, 0)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
15.9% 15.9% 0.001s 5.65e-07s 1000 3 Gemm{no_inplace}(expected, TensorConstant{-1.0}, <TensorType(float32, matrix)>, training-set.T, TensorConstant{1.0})
12.8% 28.8% 0.000s 4.54e-07s 1000 14 Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0)
12.6% 41.3% 0.000s 4.45e-07s 1000 13 Gemm{inplace}(<TensorType(float32, matrix)>, TensorConstant{-1e-04}, Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set, TensorConstant{1.0})
7.5% 48.8% 0.000s 2.66e-07s 1000 0 InplaceDimShuffle{1,0}(training-set)
6.1% 54.9% 0.000s 2.15e-07s 1000 7 Subtensor{int64}(Elemwise{Cast{float32}}.0, Constant{1})
5.8% 60.6% 0.000s 2.05e-07s 1000 10 Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0)
5.6% 66.2% 0.000s 1.97e-07s 1000 8 InplaceDimShuffle{x,x}(Subtensor{int64}.0)
5.4% 71.6% 0.000s 1.92e-07s 1000 9 InplaceDimShuffle{x,x}(Subtensor{int64}.0)
5.4% 77.0% 0.000s 1.91e-07s 1000 11 Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Gemm{no_inplace}.0, Elemwise{mul,no_inplace}.0)
5.0% 82.0% 0.000s 1.78e-07s 1000 5 Elemwise{Cast{float32}}(MakeVector{dtype='int64'}.0)
4.2% 86.3% 0.000s 1.50e-07s 1000 4 MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0)
3.2% 89.5% 0.000s 1.15e-07s 1000 6 Subtensor{int64}(Elemwise{Cast{float32}}.0, Constant{0})
3.1% 92.6% 0.000s 1.08e-07s 1000 2 Shape_i{0}(expected)
3.0% 95.6% 0.000s 1.08e-07s 1000 1 Shape_i{1}(expected)
2.2% 97.8% 0.000s 7.94e-08s 1000 15 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0)
2.2% 100.0% 0.000s 7.68e-08s 1000 12 Elemwise{Sqr}[(0, 0)](Gemm{no_inplace}.0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
- Try the Theano flag floatX=float32
提前感谢您的帮助。