python - 在 tensorflow gpu 中训练卷积神经网络时出现“Python 已停止工作”

Question

这是我编写的一个 tensorflow 代码，用于测试只有 1 个卷积和池化层的卷积神经网络，其中只有 512 个神经元的 1 个全连接层。

我的数据集只有 2 张图片： http: //imgur.com/et1Sn1k和http://imgur.com/ZWxOGgO

当我训练我的网络窗口时会弹出“Python 已停止”（ss：http: //imgur.com/tc5jWlA）

这是我的代码：

from scipy.misc import imread
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image as Image

base_image = imread("base_image.jpg")
subject_image = imread("subject_image.jpg")

base_image = np.resize(base_image, [1024, 768, 3])
subject_image = np.resize(subject_image, [1024, 768, 3])

images = []

images.append(base_image)
images.append(subject_image)

# hyper parameters

epochs = 10
batch_size = 2
learning_rate = 0.01
n_classes = 2
import tensorflow as tf

# Model

x = tf.placeholder('float', [2, 1024, 768, 3])
y = tf.placeholder('float', [2])

weights = {
            "conv": tf.random_normal([10, 10, 3, 32]),
            "fc": tf.random_normal([-1, 512]), #7*7*64
            "out": tf.Variable(tf.random_normal([512, n_classes]))
}

conv = tf.nn.conv2d(x, filter=weights['conv'], strides=[1, 1, 1, 1], padding="SAME")
conv = tf.nn.max_pool(value=conv, ksize=[1,2,2,1], strides=[1,2,2,1], padding="SAME")

fc = tf.reshape(conv, shape=[2,-1])
fc = tf.nn.relu(tf.matmul(fc, weights['fc']))

output = tf.matmul(fc, weights['out'])

loss = tf.reduce_mean((output - y)**2)

train = tf.train.AdamOptimizer(learning_rate).minimize(loss)

sess = tf.Session()

sess.run(tf.global_variables_initializer())

for i in range(epochs):
    sess.run(train, feed_dict={x: images, y: [0, 1]})
    print(i)

输出：

2017-07-09 02:29:43.688699: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-07-09 02:29:43.689131: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-09 02:29:43.689504: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-09 02:29:43.689998: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-09 02:29:43.690380: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-09 02:29:43.690646: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-09 02:29:43.691117: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-09 02:29:43.691436: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-09 02:29:44.197766: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 960M
major: 5 minor: 0 memoryClockRate (GHz) 1.176
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.35GiB
2017-07-09 02:29:44.198207: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:961] DMA: 0 
2017-07-09 02:29:44.198391: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0:   Y 
2017-07-09 02:29:44.198643: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0)
2017-07-09 02:29:44.881448: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: Dimension -1 must be >= 0
2017-07-09 02:29:44.881875: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: Dimension -1 must be >= 0
     [[Node: random_normal_1/RandomStandardNormal = RandomStandardNormal[T=DT_INT32, dtype=DT_FLOAT, seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/gpu:0"](random_normal_1/shape)]]
2017-07-09 02:29:44.882604: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\framework\op_kernel.cc:1158] Invalid argument: Dimension -1 must be >= 0
     [[Node: random_normal_1/RandomStandardNormal = RandomStandardNormal[T=DT_INT32, dtype=DT_FLOAT, seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/gpu:0"](random_normal_1/shape)]]
2017-07-09 02:29:45.362917: E c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\stream_executor\cuda\cuda_dnn.cc:352] Loaded runtime CuDNN library: 6021 (compatibility version 6000) but source was compiled with 5105 (compatibility version 5100).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2017-07-09 02:29:45.364154: F c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\kernels\conv_ops.cc:671] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
[Finished in 22.7s with exit code 3221226505]
[shell_cmd: python -u "E:\workspace_py\convolotion_neural_net.py"]
[dir: E:\workspace_py]
[path: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\libnvvp;E:\Program Files\Python 3.5\Scripts\;E:\Program Files\Python 3.5\;C:\ProgramData\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Python27\Scripts;C:\Program Files\Microsoft SQL Server\130\Tools\Binn\;E:\Program Files\MATLAB\runtime\win64;E:\Program Files\MATLAB\bin;E:\Program Files\MATLAB\polyspace\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\ProgramData\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Python27\Scripts;C:\Program Files\Microsoft SQL Server\130\Tools\Binn\;E:\Program Files\MATLAB\runtime\win64;E:\Program Files\MATLAB\bin;E:\Program Files\MATLAB\polyspace\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Users\guita\AppData\Local\Microsoft\WindowsApps;E:\Program Files\Python27\Scipts;e:\Program Files (x86)\Microsoft VS Code\bin;C:\Users\guita\AppData\Local\atom\bin]

我的电脑规格（联想 Y50）：Nvidia GTX 960m 4 GB 内存，Intel I7 4th gen，8 GB RAM

Python 3.5 + 带 GPU 的 TensorFlow

score 1 · Accepted Answer

所以我设法自己修复它。发生这种情况是因为我的 cuDNN 版本是 6.0，但 tensorflow 最好是 5.1。代码中也有一些微小的错误，但这无关紧要：p

score 0 · Accepted Answer

您的描述“闻起来”缺乏资源，很可能是内存。

如果每次运行代码时都会发生这种情况，那么问题可能出在您的代码中。
我建议您使用memory_profiler或之类的工具来分析其内存消耗heapy。

如果它只是偶尔发生，那么与您的 Python 进程同时运行的某个其他进程正在消耗如此多的资源，以至于没有足够的资源留给您的 Python 进程。

python - 在 tensorflow gpu 中训练卷积神经网络时出现“Python 已停止工作”

2 回答 2

Related

Reference