/var/log/messages had a series of following messages before the system (RHEL 7) became unresponsive:
INFO: task [process]:[pid] blocked for more than 120 seconds
What does the Message mean
In kernel version 2.6.18-194 of Red Hat Enterprise Linux 5.5, the Detect Hung Task kernel thread (khungtaskd) was added, which provides the ability to detect tasks stuck in D-state ( Uninterruptible Sleep (UN) ) longer than a specified time period (120 seconds by default) and results in following type of message in system log file ( i.e /var/log/messages ).
: blocked for more than 120 seconds"
The khungtaskd kernel thread monitors the process states and checks if there are any processes stuck in uninterruptible state for the period set in “kernel.hung_task_timeout_secs” sysctl parameter (default is 120 seconds). If so it will log the message along with a call trace of the process that is blocked. The hung task watchdog will by default only report the first 10 instances of detected hung tasks and then will disable reporting. This is to prevent the system log from being flooded with messages. This means that even though no more blocked task messages are being reported there may still be processes becoming stuck for long periods.
These messages typically mean that the system is experiencing disk or memory congestion and processes are being starved of available resources. These messages serve as a warning that something may not be operating optimally. They do not necessarily indicate a serious problem and any blocked processes should eventually proceed when the system recovers.
To troubleshoot further, generate a vmcore and engage Red Hat Support for analysis of the vmcore. If the hung task is a third-party application, please engage the application vendor in the issue meanwhile. You can look for the below messages in the /var/log/messages file to confirm:
INFO: task syslogd:2643 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syslogd D ffff81000237eaa0 0 2643 1 2646 2634 (NOTLB) ffff8101352c3d88 0000000000000086 ffff8101352c3d98 ffffffff80063ff8 0000000000001000 0000000000000009 ffff81013d2c57e0 ffff810102ac1820 0000340b30708992 0000000000000571 ffff81013d2c59c8 000000010000089f Call Trace: [
] thread_return+0x62/0xfe [ ] :jbd:log_wait_commit+0xa3/0xf5 [ ] autoremove_wake_function+0x0/0x2e [ ] :jbd:journal_stop+0x1cf/0x1ff [ ] __writeback_single_inode+0x1e9/0x328 [ ] do_readv_writev+0x26e/0x291 [ ] sync_inode+0x24/0x33 [ ] :ext3:ext3_sync_file+0xc9/0xdc [ ] do_fsync+0x52/0xa4 [ ] __do_fsync+0x23/0x36 [ ] tracesys+0xd5/0xe0
Capture the following data while the issue is occurring:
# uname -a > /tmp/uname.out # ifconfig > /tmp/ifcongfig.out # top -n 5 -b > /tmp/top.out # vmstat 1 50 > /tmp/vm.out # iostat -x 2 10 > /tmp/io.out # ps aux > /tmp/ps.out # ps auxH > /tmp/psh.out # sar -A > /tmp/sar.out # free > /tmp/free.out # lsof > /tmp/lsof.out
# tar -cjvf outputs.tar.bz2 /tmp/*.out # tar -cjvf message.tar.bz2 /var/log/message*
Check if it’s using remote filesystem such as NFS which can cause of delay when there’s high number of requests on NFS server:
$ grep MYAPP lsof | grep oracle MYAPP 7096 1062 cwd DIR 0,18 0 558301 /oracle/prd/fs_ne/inst/conc/log ...
$ grep nfs mount remotehost@:/ on /oracle/prd/fs_ne/inst type nfs (rw,nosuid)
Disabling the messages
If the hung task messages are known to be erroneous, they can be disabled. This is not advised, and it will not prevent any hangs. To disable the messages, use
# sysctl kernel.hung_task_timeout_secs=0
If it causes the crash and the causes are known to be erroneous, please disable it by putting the below in /etc/sysctl.conf. Once have the below value in the file, please run ‘sysctl -p’ to apply the change.
# vi /etc/sysctl.conf kernel.hung_task_panic = 0
# sysctl -p