ubuntu OS/Kernel logging - system freeze/hung

puneet336

New Member
Joined
Jun 16, 2026
Messages
1
Reaction score
0
Credits
42
We are encountering an issue where a server becomes completely unresponsive, with no clear indication of what caused the hang. example:

/var/log/syslog:

Jun 12 15:45:02 k-w16 weka-agent[12000]: DEBUG: requested_actions.d:189 <17422> [REQ_ACTION] Monitoring requested action needed info for container client: has_lease=true, action=NONE, state=ACTIVE, has_requested_action_failure=false, is_inactive=false
Jun 12 20:38:14 k-w16 kernel: Linux version 5.15.0-1063-nvidia (buildd@lcy02-amd64-007) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #64-Ubuntu SMP Fri Aug 9 17:13:45 UTC 2024 (Ubuntu 5.15.0-1063.64-nvidia 5.15.160)<br>Jun 12 20:38:14 k-w16 kernel: Command line: BOOT_IMAGE=images/d-os-6.3-h10-image/vmlinuz initrd=images/d-os-6.3-h10-image/initrd cgroup_enable=memory swapaccount=1 nouveau.blacklist=yes nouveau.modeset=0 iommu=pt namespace.unpriv_enable=1 user_namespace.enable=1 systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller intel_cpufreq.enable=1 intel_pstate=active crashkernel=8G console=tty0 ip=10.67.32.164:10.67.32.100:10.67.33.254:255.255.254.0 BOOTIF=01-58-a2-e1-76-03-98
Jun 12 20:38:14 k-w16 kernel: KERNEL supported cpus:


/var/log/kern.log

Jun 12 15:41:54 k-w16 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): cali9d766d49d92: link becomes ready
Jun 12 15:42:07 k-w16 kernel: wekafsio: [__wmp_add_recovery_inode:134]R[5:0xf8e07cea16470144] VC(i=677,d=c0:8,a=d) o=1/1w N[1249120_35832653_10112184.wav] sz=0x8802c m=100644 st[] |wekafs_file_open:366] recovery_inodes_count=4 le(0)
Jun 12 20:38:14 k-w16 kernel: Linux version 5.15.0-1063-nvidia (buildd@lcy02-amd64-007) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #64-Ubuntu SMP Fri Aug 9 17:13:45 UTC 2024 (Ubuntu 5.15.0-1063.64-nvidia 5.15.160)
Jun 12 20:38:14 k-w16 kernel: Command line: BOOT_IMAGE=images/d-os-6.3-h10-image/vmlinuz initrd=images/d-os-6.3-h10-image/initrd cgroup_enable=memory swapaccount=1 nouveau.blacklist=yes nouveau.modeset=0 iommu=pt namespace.unpriv_enable=1 user_namespace.enable=1 systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller intel_cpufreq.enable=1 intel_pstate=active crashkernel=8G console=tty0 ip=10.167.32.164:10.167.32.100:10.167.33.254:255.255.254.0 BOOTIF=03-88-a2-e1-76-03-98<br>

As shown above, there are no log entries between 15:45:02 and the subsequent boot at 20:38:14. During this period, the node was completely unresponsive and we were unable to SSH into the server, and there was no useful information visible on the console.

After the reboot, we were also unable to find any relevant messages in the system logs that would help identify the cause of the hang.

we do have following setting:

root@k-w16:~# cat /proc/sys/kernel/hung_task_panic
1
root@k-w16:~# cat /proc/sys/kernel/sysrq
0
kernel.hung_task_timeout_secs = 120
Given that kernel.hung_task_panic=1, we expected to see a hung-task stack trace or kernel panic information if tasks were blocked for longer than the configured timeout. However, no such information was captured in the logs.

i am primarily looking to understand if there additional boot/kernel settings ( sysctl) or debugging mechanisms (SysRq, NMI watchdog etc) that should be enabled to capture diagnostic information during such hangs so that i get some hints into where to look for more details - gpu driver.network driver etc? - just like exception handler printing stack trace before crashing out.

from the system console, ctrl + alt + delete from the makes system to power dnw, but Alt+F2 is unresponsive.
this issue occured with 3 out of 8 servers today within a span of 1 hour.
 


Follow Linux.org

Staff online


Top