我知道大多数用户使用 TensorFlow 或 PyTorch 作为建模框架,但我正在尝试转换一个用 paddle 编写的模型(称为ernie-doc)以使其在 kaggle 上运行,我猜发生了一些 GPU 连接问题。
!pip install -q -U paddlepaddle-gpu
import paddle
import paddle.fluid as fluid
paddle.enable_static()
# as the document suggests, check
fluid.install_check.run_check()
它运行成功
Running Verify Fluid Program ...
Your Paddle Fluid works well on SINGLE GPU or CPU.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid
然而,当适合模型时,事情会变得很奇怪
sys.path.append(os.path.abspath("/kaggle/input/erniedoc/ernie-doc"))
from finetune.classifier import create_model, evaluate
...
print("use gpu...")
place = fluid.CUDAPlace(0)
startup_prog = fluid.Program()
train_program = fluid.Program()
origin_train_program = train_program
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, startup_prog)
...
outputs = evaluate(exe, train_program, train_pyreader, graph_vars,
train_mems_vars, tower_mems_np,
"train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
...
它抱怨,
RuntimeError Traceback (most recent call last)
<ipython-input-8-51b504e78714> in main(args)
163 outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars,
164 train_mems_vars, tower_mems_np,
--> 165 "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
166 tower_mems_np = outputs['tower_mems_np']
167
...
/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py in _run_program(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
1230 else:
1231 self._default_executor.run_prepared_ctx(ctx, scope, False, False,
-> 1232 False)
1233 arr = scope.find_var(fetch_var_name).get_fetch_list()
1234 tensors = arr._move_to_list()
RuntimeError:
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
1 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
2 paddle::memory::allocation::AlignedAllocator::AllocateImpl(unsigned long)
3 paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long)
4 paddle::memory::allocation::Allocator::Allocate(unsigned long)
5 paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
6 paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
7 paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
8 paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
9 paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
10 paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
11 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
12 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
13 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
14 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
15 paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool)
16 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 432.000244MB memory on GPU 0, 15.811646GB memory has been allocated and available memory is only 89.750000MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)
这是怎么回事,脚本是根据官方修改的,并且分配了一些内存,所以我假设GPU已经连接并且脚本到目前为止没有错误,但是为什么会这样,GPU:0处理16GB内存并且没有其他东西在运行。之后检查 GPU 信息
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 41C P0 35W / 250W | 16191MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
我应该停止某个过程还是做其他事情?任何建议将不胜感激!