python - TensorFlow、GPU 或代码中的错误？'无法处理内核分页请求'

Question

我所做的

基本上执行这个 TF 图（通过sess.run(self.minimizer, feed_dict=...)）：

#
# Create the dataset and iterator
#
train_data_shape = (
    (None, data.get_feature_amount()),
    (None, data.get_target_amount()),
)
train_data = tf.data.Dataset.from_generator(
    data.generate_train_data,
    output_types=(config.get_dtype(), config.get_dtype()),
    output_shapes=train_data_shape,
)
train_data = train_data.repeat()
train_data = train_data.prefetch(1)
train_data = train_data.apply(tf.data.experimental.unbatch())
train_data = train_data.shuffle(self.config.get_batch_size())
train_data = train_data.batch(self.config.get_batch_size(), drop_remainder=False)
test_data_shape = (
    (data.get_test_data_amount(), data.get_feature_amount()),
    (data.get_test_data_amount(), data.get_target_amount()),
)
test_data = tf.data.Dataset.from_generator(
    data.generate_test_data,
    output_types=(config.get_dtype(), config.get_dtype()),
    output_shapes=test_data_shape,
)
test_data = test_data.repeat()
self.train_iterator = train_data.make_initializable_iterator()
self.test_iterator = test_data.make_initializable_iterator()
self.iterator_handle = tf.placeholder(tf.string, shape=[])
data_iterator = tf.data.Iterator.from_string_handle(self.iterator_handle, train_data.output_types, train_data.output_shapes)

#
# Create the structure of the DNN (hidden layers and stuff)
#
previous_layer_size = data.get_feature_amount()
target_amount = data.get_target_amount()
(self.features, self.targets) = data_iterator.get_next(name="layer_in")
x = tf.cast(tf.transpose(self.features), config.get_dtype())
x_is_sparse = True
y = self.targets
self.layers = [(None, None, self.features)]
for i, l in enumerate(config.get_hidden_layer_structure()):
    w = tf.Variable(
        tf.random_normal((previous_layer_size, l), 0, 0.05, dtype=config.get_dtype(), seed=config.get_seed()),
        name="weights_" + str(i))
    b = tf.Variable(tf.random_normal((l, 1), 0, 0.05, dtype=config.get_dtype(), seed=config.get_seed()),
                    name="bias_" + str(i))
    sum_w = tf.matmul(w, x, True, b_is_sparse=x_is_sparse)
    x_is_sparse = False
    x = tf.nn.relu(tf.add(sum_w, b), "layer_" + str(i))
    self.layers.append((w, b, x))
    previous_layer_size = l
w = tf.Variable(
    tf.random_normal((previous_layer_size, target_amount), 0, 0.05, dtype=config.get_dtype(), seed=config.get_seed()),
    name="weights_end")
b = tf.Variable(tf.random_normal((target_amount, 1), 0, 0.05, dtype=config.get_dtype(), seed=config.get_seed()),
                name="bias_end")
logit = tf.add(tf.matmul(w, x, True), b, "logit")
predictions = tf.transpose(tf.concat((tf.nn.sigmoid(logit[:1, :]), logit[1:, :]), axis=0), name="predictions")
self.layers.append((w, b, predictions))

#
# Create the evaluation of the DNN (loss/error and stuff)
#
error = tf.subtract(predictions, y, "error")
self.mae = tf.reduce_mean(tf.abs(error), axis=0, name="MAE")

#
# Create the optimization of the DNN (gradient and stuff)
#
regularization = tf.constant(0, dtype=config.get_dtype())
for w, _, _ in self.layers:
    if w is not None:
        regularization += tf.nn.l2_loss(w)
regularization *= config.get_lambda()
loss = tf.losses.sigmoid_cross_entropy(y[:, 0], logit[0, :]) * 500 + tf.reduce_sum(self.mae[1:]) + regularization

# grads_and_vars is a list of tuples (gradient, variable)
parameters = [w for w, _, _ in self.layers if w is not None] + [b for _, b, _ in self.layers if b is not None]
grads_and_vars = config.get_optimizer().compute_gradients(loss, var_list=parameters)
capped_grads_and_vars = [(tf.clip_by_value(gv[0], -5., 5.), gv[1]) for gv in grads_and_vars]
# apply the capped gradients.
self.minimizer = config.get_optimizer().apply_gradients(capped_grads_and_vars)

怎么了

有时（大约 200000 次呼叫中的 1 次）sess.run没有返回 - 它被卡住了 - 就像......冻结了。我的 python 程序当然不会自行退出。最奇怪的是：它突然占用了大量的 RAM（比如 10 GB），除非我重新启动计算机，否则我再也无法释放了。

看看这个，这很荒谬：

# ps --sort -%mem aux
用户 PID %CPU %MEM VSZ RSS TTY STAT 开始时间命令
我 13671 37.1 11.2 19321984 2764652 pts/2 Sl+ 13:02 186:10 python3 ./main.py -sp 32ki auto
我 7204 5.1 4.0 6869916 1005076 ? Sl 12:07 28:36 /snap/pycharm-professional/109/jre64/bin/java -classp
[...]

＃ 自由的
              可用的免费共享缓冲区/缓存总数
电话：24636156 16067024 6846816 313488 1722316 7896620
交换：8368124 442624 7925500

# 杀死 -15 13671

# ps --sort -%mem aux
用户 PID %CPU %MEM VSZ RSS TTY STAT 开始时间命令
我 7204 5.1 4.0 6869916 1005076 ? Sl 12:07 28:39 /snap/pycharm-professional/109/jre64/bin/java -classpath /snap/pycharm-professional/109/lib/bootstrap.jar:/snap/pycharm-professional/109/lib/extensions .jar:/snap/pycharm-pro
[...]

＃ 自由的
              可用的免费共享缓冲区/缓存总数
手机：24636156 16034456 6849972 336212 1751728 7906252
交换：8368124 437760 7930364

/var/log/syslog关于这个时间范围的sais：

1 月 9 日 15:05:38 MPC-6 内核：[10906.994646] BUG：无法在 ffffdd4803208b60 处理内核分页请求
1 月 9 日 15:05:38 MPC-6 内核：[10906.994656] IP：kfree+0x53/0x180
1 月 9 日 15:05:38 MPC-6 内核：[10906.994657] PGD 0 P4D 0
1 月 9 日 15:05:38 MPC-6 内核：[10906.994661] 糟糕：0000 [#1] SMP NOPTI
Jan 9 15:05:38 MPC-6 kernel: [10906.994663] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc aufs ip6table_filter ip6_tables iptable_filter pci_stub vboxpci(OE) vboxnetadp(OE ) vboxnetflt(OE) vboxdrv(OE) cmac rfcomm bnep binfmt_misc edac_mce_amd kvm_amd kvm irqbypass btusb btrtl crct10dif_pclmul snd_usb_audio btbcm crc32_pclmul btintel ghash_clmulni_intel snd_usbmidi_lib pcbc snd_seq_midi bluetooth snd_seq_midi_event snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi joydev aesni_intel aes_x86_64 input_leds crypto_simd snd_hda_intel glue_helper cryptd ecdh_generic snd_rawmidi snd_hda_codec snd_hda_core snd_seq snd_hwdep snd_pcm
Jan 9 15:05:38 MPC-6 kernel: [10906.994697] snd_seq_device k10temp fam15h_power snd_timer shpchp snd soundcore mac_hid sch_fq_codel cuse lm92 nct6775 hwmon_vid parport_pc ppdev lp parport ip_tables x_tables autofs4 uas usb_storage amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched (OE) hid_generic mxm_wmi amdkcl(OE) i2c_algo_bit amd_iommu_v2 drm_kms_helper pata_acpi syscopyarea sysfillrect r8169 sysimgblt fb_sys_fops mii usbhid nvme i2c_piix4 drm hid ahci pata_atiixp libahci nvme_core wmi
1 月 9 日 15:05:38 MPC-6 内核：[10906.994721] CPU：7 PID：23629 通信：kworker/u16:2 污染：G OE 4.15.0-43-generic #46-Ubuntu
1 月 9 日 15:05:38 MPC-6 内核：[10906.994723] 硬件名称：由 OEM 填写 由 OEM/970A-G/3.1 填写，BIOS P1.20 01/12/2016
1 月 9 日 15:05:38 MPC-6 内核：[10906.994812] 工作队列：kfd_restore_wq restore_process_worker [amdgpu]
1 月 9 日 15:05:38 MPC-6 内核：[10906.994815] RIP：0010:kfree+0x53/0x180
1 月 9 日 15:05:38 MPC-6 内核：[10906.994816] RSP：0018：ffffb837083c3a70 EFLAGS：00010282
1 月 9 日 15:05:38 MPC-6 内核：[10906.994818] RAX：ffff89dbe11420d0 RBX：ffffb8370822d000 RCX：0000000000008600
1 月 9 日 15:05:38 MPC-6 内核：[10906.994819] RDX：00000000ffffffe4 RSI：ffff89dc56a78fd0 RDI：0000762940000000
1 月 9 日 15:05:38 MPC-6 内核：[10906.994821] RBP：ffffb837083c3a88 R08：000000000000ec00 R09：ffff89d928a2ec70
1 月 9 日 15:05:38 MPC-6 内核：[10906.994822] R10：ffffdd4803208b40 R11：0000000000000d00 R12：ffffb8370822cf60
1 月 9 日 15:05:38 MPC-6 内核：[10906.994823] R13：ffffffffc03cece4 R14：0000000000000083 R15：0000000000000200
1 月 9 日 15:05:38 MPC-6 内核：[10906.994825] FS：0000000000000000(0000) GS:ffff89dc7edc0000(0000) knlGS:0000000000000000
1 月 9 日 15:05:38 MPC-6 内核：[10906.994826] CS：0010 DS：0000 ES：0000 CR0：0000000080050033
1 月 9 日 15:05:38 MPC-6 内核：[10906.994828] CR2：ffffdd4803208b60 CR3：00000004ec276000 CR4：00000000000406e0
1 月 9 日 15:05:38 MPC-6 内核：[10906.994829] 调用跟踪：
1 月 9 日 15:05:38 MPC-6 内核：[10906.994875] amdgpu_vram_mgr_new+0x314/0x370 [amdgpu]
1 月 9 日 15:05:38 MPC-6 内核：[10906.994882] amdttm_bo_mem_space+0x2ea/0x470 [amdttm]
1 月 9 日 15:05:38 MPC-6 内核：[10906.994887] amdttm_bo_validate+0xbf/0x140 [amdttm]
1 月 9 日 15:05:38 MPC-6 内核：[10906.994931]？amdgpu_vm_bo_update+0x3fc/0x7e0 [amdgpu]
1 月 9 日 15:05:38 MPC-6 内核：[10906.994982] amdgpu_amdkfd_bo_validate+0x82/0x160 [amdgpu]
1 月 9 日 15:05:38 MPC-6 内核：[10906.995034] amdgpu_amdkfd_gpuvm_restore_process_bos+0x273/0x4c0 [amdgpu]
1 月 9 日 15:05:38 MPC-6 内核：[10906.995038]？irq_work_queue+0x99/0xa0
1 月 9 日 15:05:38 MPC-6 内核：[10906.995040]？控制台解锁+0x2e5/0x4e0
1 月 9 日 15:05:38 MPC-6 内核：[10906.995090] restore_process_worker+0x52/0x1f0 [amdgpu]
1 月 9 日 15:05:38 MPC-6 内核：[10906.995092] process_one_work+0x1de/0x410
1 月 9 日 15:05:38 MPC-6 内核：[10906.995094] worker_thread+0x32/0x410
1 月 9 日 15:05:38 MPC-6 内核：[10906.995097] kthread+0x121/0x140
1 月 9 日 15:05:38 MPC-6 内核：[10906.995098]？process_one_work+0x410/0x410
1 月 9 日 15:05:38 MPC-6 内核：[10906.995100]？kthread_create_worker_on_cpu+0x70/0x70
1 月 9 日 15:05:38 MPC-6 内核：[10906.995103]？do_syscall_64+0x73/0x130
1 月 9 日 15:05:38 MPC-6 内核：[10906.995105]？系统退出+0x17/0x20
1 月 9 日 15:05:38 MPC-6 内核：[10906.995108] ret_from_fork+0x22/0x40
1 月 9 日 15:05:38 MPC-6 内核：[10906.995109] 代码：00 80 49 01 da 0f 82 39 01 00 00 48 c7 c7 00 00 00 80 48 2b 3d 37 46 20 01 49 01 fa 49 c1 ea 0c 9 c1 e2 06 4c 03 15 15 46 20 01 8b 42 20 48 8d 50 ff a8 01 4c 0f 45 d2 49 8b 52 20 48 8d 42
1 月 9 日 15:05:38 MPC-6 内核：[10906.995133] RIP：kfree+0x53/0x180 RSP：ffffb837083c3a70
1 月 9 日 15:05:38 MPC-6 内核：[10906.995134] CR2：ffffdd4803208b60
1 月 9 日 15:05:38 MPC-6 内核：[10906.995136] ---[ 结束跟踪 49a520b043b99039 ]---

我的规格

操作系统是linux
GPU 是AMD Radeon RX Vega 64
CPU是AMD FX-8320
RAM 是1333 DDR-3 的 2 倍 4GB 和 2 倍 8GB
主板是华擎 970A-G 3.1
Tensorflow 是我通过pip3 install tensorflow-rocm（版本 1.12.0）安装时得到的

在我 100% 确定这是 TF 的错误而不是我做错了什么之前，我不想在GitHub 上报告错误。

python - TensorFlow、GPU 或代码中的错误？'无法处理内核分页请求'

我所做的

怎么了

我的规格

0 回答 0

Related

Reference