2

环境:

ubuntu-18.04

蟒蛇 - 2.7.15rc1

GPU 0:GeForce RTX 2080Ti 和

GPU 1:P1000 Quadro

CUDA:9.1.85

张量流:1.12.0

pip install magenta-gpu

如果我这样做pip install magenta然后运行melody_rnn_train,则使用我的自定义 sequence_example training_melodies.tfrecord 完成训练步骤。

但是当我在同一个数据集上运行 melody_rnn_trainpip install magenta-gpu之后pip uninstall magenta,我遇到了“分段错误”。我可以看到它正在尝试使用 GPU 0:NVIDIA Geforce。

我运行的命令是:

./.local/bin/melody_rnn_train --config=attention_rnn --run_dir=~/music/run1 --sequence_example_file=~/music/my_midi_sequence_examples/training_melodies.tfrecord --hparams="batch_size=1,rnn_layer_sizes=[64,64]" --num_training_steps=20000

使用gdb python的回溯,分段错误如下:

    (gdb) bt
    0  0x00007fff4631ec08 in ?? ()     from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
    1  0x00007fff4631f114 in ?? ()     from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
    2  0x00007fff45e08850 in ?? ()     from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
    3  0x00007fff45e2b452 in ?? ()     from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
    4  0x00007fff45e2c1de in ?? ()     from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
    5  0x00007fff453d2416 in ?? ()     from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
    6  0x00007fff453d317b in cudnnGetConvolutionBackwardFilterWorkspaceSize ()     from /usr/lib/x86_64-linux-gnu/libcudnn.so.7
    7  0x00007fff6184f184 in stream_executor::cuda::(anonymous namespace)::AllocateCudnnConvolutionBackwardFilterWorkspace(stream_executor::Stream*, stream_executor::cuda::(anonymous namespace)::CudnnHandle const&, stream_executor::cuda::(anonymous namespace)::CudnnTensorDescriptor const&, stream_executor::cuda::(anonymous namespace)::CudnnFilterDescriptor const&, stream_executor::cuda::(anonymous namespace)::CudnnConvolutionDescriptor const&, stream_executor::cuda::(anonymous namespace)::CudnnTensorDescriptor const&, stream_executor::dnn::AlgorithmDesc*, stream_executor::ScratchAllocator*) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    8  0x00007fff6184f597 in stream_executor::cuda::(anonymous namespace)::GetCudnnConvolutionBackwardFilterAlgorithm(stream_executor::Stream*, stream_executor::cuda::(anonymous namespace)::CudnnHandle const&, stream_executor::dnn::AlgorithmConfig const&, stream_executor::cuda::(anonymous namespace)::CudnnTensorDescriptor const&, stream_executor::cuda::(anonymous namespace)::CudnnFilterDescriptor const&, stream_executor::cuda::(anonymous namespace)::CudnnConvolutionDescriptor const&, stream_executor::cuda::(anonymous namespace)::CudnnTensorDescriptor const&, stream_executor::ScratchAllocator*, stream_executor::DeviceMemory<unsigned char>*) [clone .constprop.315] ()     from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    9  0x00007fff6185c7c3 in tensorflow::Status stream_executor::cuda::CudnnSupport::DoConvolveBackwardFilterImpl<float>(stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    10 0x00007fff6185d212 in stream_executor::cuda::CudnnSupport::DoConvolveBackwardFilter(stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    11 0x00007fff617efb2c in stream_executor::Stream::ThenConvolveBackwardFilterWithAlgorithm(stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    12 0x00007fff679e088a in tensorflow::LaunchConv2DBackpropFilterOp<Eigen::GpuDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, tensorflow::Tensor*, tensorflow::TensorFormat) ()     from .local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
    13 0x00007fff679e12d0 in tensorflow::Conv2DSlowBackpropFilterOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
    14 0x00007fff613ee911 in tensorflow::BaseGPUDevice::ComputeHelper(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    15 0x00007fff613eee32 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    16 0x00007fff61438a56 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    17 0x00007fff61438eea in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()        1}>::_M_invoke(std::_Any_data const&) ()     from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    18 0x00007fff614a81ea in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    19 0x00007fff614a7242 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()        1}>::_M_invoke(std::_Any_data const&) ()
   from .local/lib/python2.7/site-packages/tensorflow/python/../libtensorflow_framework.so
    20 0x00007fff57c128f0 in ?? ()     from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
    21 0x00007ffff77cc6db in start_thread (arg=0x7ffde27fc700) at pthread_create.c:463
    23 0x00007ffff7b0588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

CPU 变体对我来说工作得很好,但由于这个分段错误,我无法运行 GPU 变体。

如果我在设置过程中遗漏了什么,有人可以告诉我吗?

4

0 回答 0