有时,我们的一个 Linux MySQL 数据库服务器会挂起一段时间,从而产生一长串活动查询。10-20 分钟后,服务恢复正常。
这发生在 kern.log 中:
May 14 13:58:05 edu02 kernel: [2375521.716786] INFO: task jbd2/md2-8:1419 blocked for more than 120 seconds.
May 14 13:58:05 edu02 kernel: [2375521.716870] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 13:58:05 edu02 kernel: [2375521.716953] jbd2/md2-8 D ffff88043b5a23f0 0 1419 2 0x00000000
May 14 13:58:05 edu02 kernel: [2375521.716958] ffff88043b5a23f0 0000000000000046 00000000000000ff ffff88043b4e69c0
May 14 13:58:05 edu02 kernel: [2375521.716963] 00000000ffffffff ffffffff8118bac3 000000000000f9e0 ffff88043d099fd8
May 14 13:58:05 edu02 kernel: [2375521.716967] 00000000000157c0 00000000000157c0 ffff88043a411cc0 ffff88043a411fb8
May 14 13:58:05 edu02 kernel: [2375521.716971] Call Trace:
May 14 13:58:05 edu02 kernel: [2375521.716981] [<ffffffff8118bac3>] ? generic_make_request+0x299/0x2f9
May 14 13:58:05 edu02 kernel: [2375521.716988] [<ffffffff810168ec>] ? read_tsc+0xa/0x20
May 14 13:58:05 edu02 kernel: [2375521.716994] [<ffffffff810757f6>] ? timekeeping_get_ns+0xe/0x2e
May 14 13:58:05 edu02 kernel: [2375521.716999] [<ffffffff8111a5f2>] ? sync_buffer+0x0/0x40
May 14 13:58:05 edu02 kernel: [2375521.717004] [<ffffffff8130b4a2>] ? io_schedule+0x73/0xb7
May 14 13:58:05 edu02 kernel: [2375521.717007] [<ffffffff8111a62d>] ? sync_buffer+0x3b/0x40
May 14 13:58:05 edu02 kernel: [2375521.717010] [<ffffffff8130b9af>] ? __wait_on_bit+0x41/0x70
May 14 13:58:05 edu02 kernel: [2375521.717014] [<ffffffff8111a5f2>] ? sync_buffer+0x0/0x40
May 14 13:58:05 edu02 kernel: [2375521.717017] [<ffffffff8130ba49>] ? out_of_line_wait_on_bit+0x6b/0x77
May 14 13:58:05 edu02 kernel: [2375521.717021] [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 13:58:05 edu02 kernel: [2375521.717035] [<ffffffffa0138a22>] ? jbd2_journal_commit_transaction+0xbe5/0x115b [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717043] [<ffffffffa013e8c1>] ? kjournald2+0xbe/0x206 [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717046] [<ffffffff810661e2>] ? autoremove_wake_function+0x0/0x2e
May 14 13:58:05 edu02 kernel: [2375521.717052] [<ffffffffa013e803>] ? kjournald2+0x0/0x206 [jbd2]
May 14 13:58:05 edu02 kernel: [2375521.717055] [<ffffffff81065f15>] ? kthread+0x79/0x81
May 14 13:58:05 edu02 kernel: [2375521.717058] [<ffffffff81011baa>] ? child_rip+0xa/0x20
May 14 13:58:05 edu02 kernel: [2375521.717062] [<ffffffff81065e9c>] ? kthread+0x0/0x81
May 14 13:58:05 edu02 kernel: [2375521.717064] [<ffffffff81011ba0>] ? child_rip+0x0/0x20
May 14 13:58:05 edu02 kernel: [2375521.717080] INFO: task mysqld:5879 blocked for more than 120 seconds.
May 14 13:58:05 edu02 kernel: [2375521.717132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 13:58:05 edu02 kernel: [2375521.717215] mysqld D ffff88000facfb30 0 5879 2279 0x00000000
May 14 13:58:05 edu02 kernel: [2375521.717219] ffff88043e46f300 0000000000000082 ffff88043c703d08 ffff88043c703d04
May 14 13:58:05 edu02 kernel: [2375521.717223] 0000000000000008 ffffffffa017942a 000000000000f9e0 ffff88043c703fd8
May 14 13:58:05 edu02 kernel: [2375521.717226] 00000000000157c0 00000000000157c0 ffff88043ad09cc0 ffff88043ad09fb8
May 14 13:58:05 edu02 kernel: [2375521.717230] Call Trace:
May 14 13:58:05 edu02 kernel: [2375521.717247] [<ffffffffa017942a>] ? ext4_journal_start_sb+0xd4/0x10e [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717256] [<ffffffffa0179350>] ? __ext4_journal_stop+0x63/0x69 [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717264] [<ffffffffa01634de>] ? ext4_da_writepages+0x4e2/0x65c [ext4]
May 14 13:58:05 edu02 kernel: [2375521.717269] [<ffffffff810be235>] ? sync_page+0x0/0x46
May 14 13:58:05 edu02 kernel: [2375521.717272] [<ffffffff8130b4a2>] ? io_schedule+0x73/0xb7
May 14 13:58:05 edu02 kernel: [2375521.717275] [<ffffffff810be276>] ? sync_page+0x41/0x46
May 14 13:58:05 edu02 kernel: [2375521.717278] [<ffffffff8130b9af>] ? __wait_on_bit+0x41/0x70
May 14 13:58:05 edu02 kernel: [2375521.717282] [<ffffffff810be3fa>] ? wait_on_page_bit+0x6b/0x71
May 14 13:58:05 edu02 kernel: [2375521.717285] [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 13:58:05 edu02 kernel: [2375521.717289] [<ffffffff810c6222>] ? pagevec_lookup_tag+0x1a/0x21
May 14 13:58:05 edu02 kernel: [2375521.717293] [<ffffffff810bebda>] ? wait_on_page_writeback_range+0x69/0x11b
May 14 13:58:05 edu02 kernel: [2375521.717297] [<ffffffff810bed00>] ? __filemap_fdatawrite_range+0x4b/0x54
May 14 13:58:05 edu02 kernel: [2375521.717301] [<ffffffff810bed4c>] ? filemap_write_and_wait_range+0x43/0x52
May 14 13:58:05 edu02 kernel: [2375521.717305] [<ffffffff81118106>] ? vfs_fsync_range+0x55/0x9e
May 14 13:58:05 edu02 kernel: [2375521.717308] [<ffffffff811181ce>] ? do_fsync+0x28/0x39
May 14 13:58:05 edu02 kernel: [2375521.717311] [<ffffffff811181fd>] ? sys_fsync+0xb/0x10
May 14 13:58:05 edu02 kernel: [2375521.717315] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
May 14 14:04:05 edu02 kernel: [2375881.716569] INFO: task mysqld:20799 blocked for more than 120 seconds.
May 14 14:04:05 edu02 kernel: [2375881.716626] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 14 14:04:05 edu02 kernel: [2375881.716710] mysqld D 0000000000000000 0 20799 2279 0x00000000
May 14 14:04:05 edu02 kernel: [2375881.716716] ffff88043e46d640 0000000000000082 0000000000000000 ffff8800839be930
May 14 14:04:05 edu02 kernel: [2375881.716720] ffff88043b4e4680 ffff88043b4e4680 000000000000f9e0 ffff88009592ffd8
May 14 14:04:05 edu02 kernel: [2375881.716724] 00000000000157c0 00000000000157c0 ffff880014059cc0 ffff880014059fb8
May 14 14:04:05 edu02 kernel: [2375881.716729] Call Trace:
May 14 14:04:05 edu02 kernel: [2375881.716740] [<ffffffff81066103>] ? bit_waitqueue+0x10/0xa0
May 14 14:04:05 edu02 kernel: [2375881.716766] [<ffffffffa01371ba>] ? do_get_write_access+0x22c/0x452 [jbd2]
May 14 14:04:05 edu02 kernel: [2375881.716771] [<ffffffff81066210>] ? wake_bit_function+0x0/0x23
May 14 14:04:05 edu02 kernel: [2375881.716777] [<ffffffffa0137402>] ? jbd2_journal_get_write_access+0x22/0x33 [jbd2]
May 14 14:04:05 edu02 kernel: [2375881.716796] [<ffffffffa017fee6>] ? __ext4_journal_get_write_access+0x4e/0x56 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716804] [<ffffffffa015fe4c>] ? ext4_reserve_inode_write+0x37/0x73 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716811] [<ffffffffa015fec3>] ? ext4_mark_inode_dirty+0x3b/0x1c4 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716820] [<ffffffffa017942a>] ? ext4_journal_start_sb+0xd4/0x10e [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716826] [<ffffffff8130b4c2>] ? io_schedule+0x93/0xb7
May 14 14:04:05 edu02 kernel: [2375881.716833] [<ffffffffa016016e>] ? ext4_dirty_inode+0x30/0x46 [ext4]
May 14 14:04:05 edu02 kernel: [2375881.716840] [<ffffffff811148f1>] ? __mark_inode_dirty+0x25/0x14a
May 14 14:04:05 edu02 kernel: [2375881.716844] [<ffffffff8110b958>] ? touch_atime+0x10a/0x133
May 14 14:04:05 edu02 kernel: [2375881.716850] [<ffffffff810bf8f0>] ? generic_file_aio_read+0x499/0x536
May 14 14:04:05 edu02 kernel: [2375881.716855] [<ffffffff810f9c45>] ? do_sync_read+0xce/0x113
May 14 14:04:05 edu02 kernel: [2375881.716858] [<ffffffff810661e2>] ? autoremove_wake_function+0x0/0x2e
May 14 14:04:05 edu02 kernel: [2375881.716862] [<ffffffff81105a05>] ? user_path_at+0x52/0x79
May 14 14:04:05 edu02 kernel: [2375881.716866] [<ffffffff810fa668>] ? vfs_read+0xa6/0xff
May 14 14:04:05 edu02 kernel: [2375881.716869] [<ffffffff810fa77d>] ? sys_read+0x45/0x6e
May 14 14:04:05 edu02 kernel: [2375881.716873] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
堆栈跟踪使我认为它与文件系统/日志或磁盘相关,但我不能确定。当驱动器发生故障时,dmesg 中通常会出现其他错误,并且服务器中的两个驱动器的 SMART 状态都很好。
什么会导致这种挂起?