linux-kernel - 进程 blk_mq_run_work_fn 期间工作队列挂起

Question

工作队列挂在我的板子上（ARM-Linux）。一开始板子可以用ssh连接。然后，连接正常，但无法进入提示符。我用sysrq捕获了一些信息，sysrq info（partial）像这样：

kworker/0:2H    R  running task        0 19783      2 0x00000028
Workqueue: kblockd blk_mq_run_work_fn
Call trace:
 __switch_to+0xf4/0x120
 __schedule+0x248/0x460
 preempt_schedule_common+0x24/0x4c
 preempt_schedule+0x28/0x30
 _raw_spin_unlock_irqrestore+0x30/0x4c
 __wake_up_common_lock+0x88/0xc4
 __wake_up+0x14/0x1c
 wake_up_bit+0x78/0xa0
 end_buffer_read_sync+0x44/0xa4
 end_bio_bh_io_sync+0x30/0x60
 bio_endio+0xdc/0x110
 blk_update_request+0xb8/0x250
 mtd_blktrans_work+0xdc/0x1a0
 mtd_queue_rq+0x50/0x84
 blk_mq_dispatch_rq_list+0xa8/0x43c
 blk_mq_do_dispatch_sched+0x78/0x110
 blk_mq_sched_dispatch_requests+0x118/0x190
 __blk_mq_run_hw_queue+0xc4/0x114
 blk_mq_run_work_fn+0x1c/0x24
 process_one_work+0x1c8/0x324
 worker_thread+0x68/0x3ac
 kthread+0x13c/0x150
 ret_from_fork+0x10/0x1c
ipc_Session2    D    0  8552   8441 0x00000000
Call trace:
 __switch_to+0xf4/0x120
 __schedule+0x248/0x460
 schedule+0x40/0xe0
 squashfs_cache_get+0x2f8/0x340
 squashfs_get_datablock+0x1c/0x24
 squashfs_readpage_block+0x34/0x90
 squashfs_readpage+0x240/0x27c
 read_pages.isra.0+0x118/0x180
 __do_page_cache_readahead+0x19c/0x1c0
 do_sync_mmap_readahead+0xcc/0x174
 filemap_fault+0x548/0x6e0
 __do_fault+0x38/0xfc
 do_fault+0xb4/0x1b0
 handle_pte_fault+0x68/0x19c
 __handle_mm_fault+0xcc/0x120
 handle_mm_fault+0x8c/0xd4
 do_page_fault+0x11c/0x3e0
 do_translation_fault+0xa4/0xb0
 do_mem_abort+0x3c/0xa0
 do_el0_ia_bp_hardening+0x3c/0xb0
 el0_ia+0x18/0x1c
ipc_Session3    D    0  8598   8441 0x00000000
Call trace:
 __switch_to+0xf4/0x120
 __schedule+0x248/0x460
 schedule+0x40/0xe0
 squashfs_cache_get+0x2f8/0x340
 squashfs_get_datablock+0x1c/0x24
 squashfs_readpage_block+0x34/0x90
 squashfs_readpage+0x240/0x27c
 read_pages.isra.0+0x118/0x180
 __do_page_cache_readahead+0x19c/0x1c0
 do_sync_mmap_readahead+0xcc/0x174
 filemap_fault+0x548/0x6e0
 __do_fault+0x38/0xfc
 do_fault+0xb4/0x1b0
 handle_pte_fault+0x68/0x19c
 __handle_mm_fault+0xcc/0x120
 handle_mm_fault+0x8c/0xd4
 do_page_fault+0x11c/0x3e0
 do_translation_fault+0xa4/0xb0
 do_mem_abort+0x3c/0xa0
 do_el0_ia_bp_hardening+0x3c/0xb0
 el0_ia+0x18/0x1c
...
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
  pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    pending: vmstat_shepherd
workqueue events_power_efficient: flags=0x80
  pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 refcnt=4
    pending: phy_state_machine, neigh_periodic_work, do_cache_clean
workqueue mm_percpu_wq: flags=0x8
  pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    pending: vmstat_update
workqueue writeback: flags=0x4a
  pwq 4: cpus=0-1 flags=0x4 nice=0 active=2/256 refcnt=4
    in-flight: 8294:wb_workfn wb_workfn
workqueue kblockd: flags=0x18                                        
  pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256 refcnt=3      
    in-flight: 19783:blk_mq_run_work_fn
    pending: blk_mq_run_work_fn
workqueue mmc_complete: flags=0x18
  pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
    pending: mmc_blk_mq_complete_work
pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=21394s workers=3 idle: 6724 1804  
pool 4: cpus=0-1 flags=0x4 nice=0 hung=0s workers=3 idle: 19890 12972

如上图，pool 1 挂了 5.9 小时（21394s），可能是 blk_mq_run_work_fn（最有可能）或者mmc_blk_mq_complete_work 。而且很多线程或进程是 D 状态，如图：

/usr/bin# top
Mem: 487160K used, 12976K free, 1172K shrd, 9344K buff, 51200K cached
CPU:   0% usr  54% sys   0% nic   0% idle  45% io   0% irq   0% sirq
Load average: 90.99 90.18 88.74 5/226 30760
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
 8445  8441 root     D     458m  94%  50% /usr/bin/app
30760 29801 root     R     3300   1%   5% top
 8444  8441 root     D     9688   2%   0% /usr/bin/Daemon
  329     1 root     S     3724   1%   0% /sbin/logd -S 1024
  384     1 root     S     3440   1%   0% /usr/sbin/crond -f -c /etc/crontabs -
  205     1 root     S     3440   1%   0% /bin/ash --login
29592     1 root     D     3440   1%   0% -ash
 8939     1 root     D     3440   1%   0% -ash
10680     1 root     D     3440   1%   0% -ash
 9210     1 root     D     3440   1%   0% -ash

谁能告诉我为什么会发生这种情况，以及如何处理这个问题？谢谢

linux-kernel - 进程 blk_mq_run_work_fn 期间工作队列挂起

0 回答 0

Related

Reference