Random [HW] crashes on debian server. At wits end

HatAM

New Member
Joined
Jan 9, 2025
Messages
9
Reaction score
2
Credits
107
I am at my wits end.
We have a small compute server running Debian stable. I think we're at the .37 kernel build last time I checked.
AMD X9900, 192GB RAM, Asus B650m tuff motherboard.

We get persistent crashes around ~once a week to once a fortnight.
Symptoms:
  • System is in an on (i.e., lights, fans, etc.) but unresponsive state (won't wake up to keyboard or mouse, doesn't output anything over DP).
  • Seems to happen with absolutely no warning
    • Journalctl never logs any problems whatsoever. Usually the last log is some standard UFW block message.
    • System is generally not under strain when it happens (I confirm with SAR logging - CPU is generally idling along, RAM usage <20%
  • There doesn't seem to be a pattern as to the time. Could be overnight, could be middle of the day. Someone might be working on it, may be headless.
Things I have tried:
  • Updated the motherboard firmware
  • Changed grub to not allow USB low-power modes
  • Removed USB-ethernet dongle and went to PCIe expansion card
  • Changed motherboard out to later model (B850m)
  • Changed PSU (like-for-like replacement, 850W, should be ample headroom for a GPU-less build)
  • Installed 950W UPS
  • Ran memtest86+ for 60+ hrs. No errors.
My feeling is that this must be hardware given there are (1) no logs whatsoever, (2) the system doesn't seem to be in a stressed state when it crashes, and (3) it seems to be in this weird semi-locked state but I can't think of what else it could be unless we've got a bung CPU?

Wtf is going on!? All help appreciated.
 


Try checking the ILO or IPMI system logs or whatever the equivalent is of the hardware you are using.
 
It's all running on pro-sumer hardware unfortunately. The motherboard in question is an Asus Tuff Gaming B850m and doesn't have a BMC
 
Is it a crash or a freeze?

Briefly, ASUI a crash is when the entire system stops running unexpectedly, and exits. It may cause a reboot or kernel panic.

A freeze on the other hand, again AIUI, is when the system or a program becomes unresponsive, but the machine's still running. The keyboard and mouse become unresponsive, so there's nothing much that can be done. If it's in GUI mode, maybe Ctrl+Alt+F# can recover the system. Usually a freeze doesn't get logged.

Looks like you may be experiencing a freeze.

Intermittent problems like this are often hardware related as you mentioned, though it can be buggy software.

RAM looks to be okay.

Perhaps consider testing the hard drive with smartmontools, and run badblock checks.

The cpus can be checked to see if they are all running within their expected ranges with the cpupower tools.

Consider altering the cpu governor, say from performance to powersave, or the reverse depending on the governors available. I recall a time when setting the governor to performance from powersave eliminated an intermittent issue. Can't say in your case.

Ensure all current firmware and microcode are installed.

Perhaps ensure the latest BIOS/UEFI is installed so that reboots are optimally provided for on that score.

Although the PSU situation seems to be covered, fluctuations in PSU power outputs can cause disruptions to running systems.
 
Last edited:
If it's in GUI mode, maybe Ctrl+Alt+F# can recover the system.

If it's truly frozen (not just barely accepting some inputs, slowed due to resource usages), then that'd not work.

I'm being a bit pedantic here, but not without cause.

An easy way to tell if your computer is completely frozen is to press the NumLock key or Caps key. If they trigger the light on the keyboard, the computer is not completely frozen. Those keys aren't a function of the keyboard. They send a signal to the OS, which returns a signal that turns on the light. This is still the case with modern computers and a good way to see if the computer is completely frozen.

If the lights come on, there's a chance of some system activity, such as the attempt to enter TTY mentioned above. You are also in a position to use the magic system keys (REISUB, or 'busier' backwards) to perform a graceful system shutdown and reboot.
 


Follow Linux.org

Members online

No members online now.

Latest posts

Top