2

有时,我们的一个 Linux MySQL 数据库服务器会挂起一段时间,从而产生一长串活动查询。10-20 分钟后,服务恢复正常。

这发生在 kern.log 中:

May 14 13:58:05 edu02 kernel: [2375521.716786] INFO: task jbd2/md2-8:1419 blocked for more than 120 seconds.
May 14 13:58:05 edu02 kernel: [2375521.716870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 13:58:05 edu02 kernel: [2375521.716953] jbd2/md2-8    D ffff88043b5a23f0     0  1419      2 0x00000000
May 14 13:58:05 edu02 kernel: [2375521.716958]  ffff88043b5a23f0 0000000000000046 00000000000000ff ffff88043b4e69c0
May 14 13:58:05 edu02 kernel: [2375521.716963]  00000000ffffffff ffffffff8118bac3 000000000000f9e0 ffff88043d099fd8
May 14 13:58:05 edu02 kernel: [2375521.716967]  00000000000157c0 00000000000157c0 ffff88043a411cc0 ffff88043a411fb8
May 14 13:58:05 edu02 kernel: [2375521.716971] Call Trace:
May 14 13:58:05 edu02 kernel: [2375521.716981]  [<ffffffff8118bac3>] ? generic_make_request+0x299/0x2f9
May 14 13:58:05 edu02 kernel: [2375521.716988]  [<ffffffff810168ec>] ? read_tsc+0xa/0x20
May 14 13:58:05 edu02 kernel: [2375521.716994]  [<ffffffff810757f6>] ? timekeeping_get_ns+0xe/0x2e
May 14 13:58:05 edu02 kernel: [2375521.716999]  [<ffffffff8111a5f2>] ? sync_buffer+0x0/0x40
May 14 13:58:05 edu02 kernel: [2375521.717004]  [<ffffffff8130b4a2>] ? io_schedule+0x73/0xb7
May 14 13:58:05 edu02 kernel: [2375521.717007]  [<ffffffff8111a62d>] ? sync_buffer+0x3b/0x40
May 14 13:58:05 edu02 kernel: [2375521.717010]  [<ffffffff8130b9af>] ? __wait_on_bit+0x41/0x70
May 14 13:58:05 edu02 kernel: [2375521.717014]  [<ffffffff8111a5f2>] ? sync_buffer+0x0/0x40
May 14 13:58:05 edu02 kernel: [2375521.717017]  [<ffffffff8130ba49>] ? out_of_line_wait_on_bit+0x6b/0x77
May 14 13:58:05 edu02 kernel: [2375521.717021]  [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 13:58:05 edu02 kernel: [2375521.717035]  [<ffffffffa0138a22>] ? jbd2_journal_commit_transaction+0xbe5/0x115b [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717043]  [<ffffffffa013e8c1>] ? kjournald2+0xbe/0x206 [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717046]  [<ffffffff810661e2>] ? autoremove_wake_function+0x0/0x2e
May 14 13:58:05 edu02 kernel: [2375521.717052]  [<ffffffffa013e803>] ? kjournald2+0x0/0x206 [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717055]  [<ffffffff81065f15>] ? kthread+0x79/0x81
May 14 13:58:05 edu02 kernel: [2375521.717058]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
May 14 13:58:05 edu02 kernel: [2375521.717062]  [<ffffffff81065e9c>] ? kthread+0x0/0x81
May 14 13:58:05 edu02 kernel: [2375521.717064]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
May 14 13:58:05 edu02 kernel: [2375521.717080] INFO: task mysqld:5879 blocked for more than 120 seconds.
May 14 13:58:05 edu02 kernel: [2375521.717132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 13:58:05 edu02 kernel: [2375521.717215] mysqld        D ffff88000facfb30     0  5879   2279 0x00000000
May 14 13:58:05 edu02 kernel: [2375521.717219]  ffff88043e46f300 0000000000000082 ffff88043c703d08 ffff88043c703d04
May 14 13:58:05 edu02 kernel: [2375521.717223]  0000000000000008 ffffffffa017942a 000000000000f9e0 ffff88043c703fd8
May 14 13:58:05 edu02 kernel: [2375521.717226]  00000000000157c0 00000000000157c0 ffff88043ad09cc0 ffff88043ad09fb8
May 14 13:58:05 edu02 kernel: [2375521.717230] Call Trace:
May 14 13:58:05 edu02 kernel: [2375521.717247]  [<ffffffffa017942a>] ? ext4_journal_start_sb+0xd4/0x10e [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717256]  [<ffffffffa0179350>] ? __ext4_journal_stop+0x63/0x69 [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717264]  [<ffffffffa01634de>] ? ext4_da_writepages+0x4e2/0x65c [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717269]  [<ffffffff810be235>] ? sync_page+0x0/0x46
May 14 13:58:05 edu02 kernel: [2375521.717272]  [<ffffffff8130b4a2>] ? io_schedule+0x73/0xb7
May 14 13:58:05 edu02 kernel: [2375521.717275]  [<ffffffff810be276>] ? sync_page+0x41/0x46
May 14 13:58:05 edu02 kernel: [2375521.717278]  [<ffffffff8130b9af>] ? __wait_on_bit+0x41/0x70
May 14 13:58:05 edu02 kernel: [2375521.717282]  [<ffffffff810be3fa>] ? wait_on_page_bit+0x6b/0x71
May 14 13:58:05 edu02 kernel: [2375521.717285]  [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 13:58:05 edu02 kernel: [2375521.717289]  [<ffffffff810c6222>] ? pagevec_lookup_tag+0x1a/0x21
May 14 13:58:05 edu02 kernel: [2375521.717293]  [<ffffffff810bebda>] ? wait_on_page_writeback_range+0x69/0x11b
May 14 13:58:05 edu02 kernel: [2375521.717297]  [<ffffffff810bed00>] ? __filemap_fdatawrite_range+0x4b/0x54
May 14 13:58:05 edu02 kernel: [2375521.717301]  [<ffffffff810bed4c>] ? filemap_write_and_wait_range+0x43/0x52
May 14 13:58:05 edu02 kernel: [2375521.717305]  [<ffffffff81118106>] ? vfs_fsync_range+0x55/0x9e
May 14 13:58:05 edu02 kernel: [2375521.717308]  [<ffffffff811181ce>] ? do_fsync+0x28/0x39
May 14 13:58:05 edu02 kernel: [2375521.717311]  [<ffffffff811181fd>] ? sys_fsync+0xb/0x10
May 14 13:58:05 edu02 kernel: [2375521.717315]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
May 14 14:04:05 edu02 kernel: [2375881.716569] INFO: task mysqld:20799 blocked for more than 120 seconds.
May 14 14:04:05 edu02 kernel: [2375881.716626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 14:04:05 edu02 kernel: [2375881.716710] mysqld        D 0000000000000000     0 20799   2279 0x00000000
May 14 14:04:05 edu02 kernel: [2375881.716716]  ffff88043e46d640 0000000000000082 0000000000000000 ffff8800839be930
May 14 14:04:05 edu02 kernel: [2375881.716720]  ffff88043b4e4680 ffff88043b4e4680 000000000000f9e0 ffff88009592ffd8
May 14 14:04:05 edu02 kernel: [2375881.716724]  00000000000157c0 00000000000157c0 ffff880014059cc0 ffff880014059fb8
May 14 14:04:05 edu02 kernel: [2375881.716729] Call Trace:
May 14 14:04:05 edu02 kernel: [2375881.716740]  [<ffffffff81066103>] ? bit_waitqueue+0x10/0xa0
May 14 14:04:05 edu02 kernel: [2375881.716766]  [<ffffffffa01371ba>] ? do_get_write_access+0x22c/0x452 [jbd2]
May 14 14:04:05 edu02 kernel: [2375881.716771]  [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 14:04:05 edu02 kernel: [2375881.716777]  [<ffffffffa0137402>] ? jbd2_journal_get_write_access+0x22/0x33 [jbd2]
May 14 14:04:05 edu02 kernel: [2375881.716796]  [<ffffffffa017fee6>] ? __ext4_journal_get_write_access+0x4e/0x56 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716804]  [<ffffffffa015fe4c>] ? ext4_reserve_inode_write+0x37/0x73 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716811]  [<ffffffffa015fec3>] ? ext4_mark_inode_dirty+0x3b/0x1c4 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716820]  [<ffffffffa017942a>] ? ext4_journal_start_sb+0xd4/0x10e [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716826]  [<ffffffff8130b4c2>] ? io_schedule+0x93/0xb7
May 14 14:04:05 edu02 kernel: [2375881.716833]  [<ffffffffa016016e>] ? ext4_dirty_inode+0x30/0x46 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716840]  [<ffffffff811148f1>] ? __mark_inode_dirty+0x25/0x14a
May 14 14:04:05 edu02 kernel: [2375881.716844]  [<ffffffff8110b958>] ? touch_atime+0x10a/0x133
May 14 14:04:05 edu02 kernel: [2375881.716850]  [<ffffffff810bf8f0>] ? generic_file_aio_read+0x499/0x536
May 14 14:04:05 edu02 kernel: [2375881.716855]  [<ffffffff810f9c45>] ? do_sync_read+0xce/0x113
May 14 14:04:05 edu02 kernel: [2375881.716858]  [<ffffffff810661e2>] ? autoremove_wake_function+0x0/0x2e
May 14 14:04:05 edu02 kernel: [2375881.716862]  [<ffffffff81105a05>] ? user_path_at+0x52/0x79
May 14 14:04:05 edu02 kernel: [2375881.716866]  [<ffffffff810fa668>] ? vfs_read+0xa6/0xff
May 14 14:04:05 edu02 kernel: [2375881.716869]  [<ffffffff810fa77d>] ? sys_read+0x45/0x6e
May 14 14:04:05 edu02 kernel: [2375881.716873]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b

堆栈跟踪使我认为它与文件系统/日志或磁盘相关,但我不能确定。当驱动器发生故障时,dmesg 中通常会出现其他错误,并且服务器中的两个驱动器的 SMART 状态都很好。

什么会导致这种挂起?

4

2 回答 2

0

嗯,这是一个警告,一个调试功能。您的内核启用了 CONFIG_DETECT_HUNG_TASK 功能。根据 CONFIG_DETECT_HUNG_TASK 描述(取自内核配置选项帮助部分 http://lxr.linux.no/linux+ */lib/Kconfig.debug#L264):

 "When a hung task is detected, the kernel will print the  current stack trace
 (which you should report), but the task will stay in uninterruptible state."

而且,这就是原因。您几乎没有进程在不间断状态下停留超过 120 秒。如果要清理它,只需执行以下操作:

          "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
于 2013-05-15T15:45:48.303 回答
0

升级到更新的内核(Debian Wheezy)神奇地解​​决了这个问题。

于 2013-07-09T21:55:48.970 回答