steps to debug a system crash with sudden black screens and monitors switching off

MrWolf72

New Member
Joined
Apr 10, 2022
Messages
3
Reaction score
2
Credits
53
Hi all,
first post here.
I've been using Linux for all my life, even though I can't really say I am an expert at all.
Generally when I need to fix some error, I google whatever error report I get, and follow instructions in a pretty much copy-paste fashion in whatever post I find online and somehow, kicking and sweating in the end I manage to fix my problem.
Even though I realize Linux is more work, I prefer Linux to any other system any day cause of the flexibility it provides. In other words, I love Linux.

This issue I'm having is a bit different from the ones I've had in the past :
Sometimes (say between 0 to 3 times a day) my system does the following :
- both screens go blank and switch off, like the computer was switched off.
- the computer stays on (I can hear the vent going and the power led is still on)
- of course being the screens off, I cannot tell anything about what happened (is the mouse still working ? keyboard ? desktop ? no idea).

One important thing that I noticed is that this happens completely indipendently from the fact that I am actually using the computer. Sometimes I am only staring at the screen reading some document, with the computer pretty much idle, not using the mouse or the keyboard, and it happens.
This is definitely confirmed by the fact that sometimes I leave my workstation on at night (this time it is crunching stuff), and the next morning the computer is in the state above. Other times the computer stays on the whole night with no problem and I can resume working the next morning.

Note that this has been happening only on a new workstation that I just bought. Came with Windows installed, I installed Linux myself, and has been happening ever since.

When this happens the only solution (that I know) to regain control of the workstation is to hard reset the whole thing, and I mean really the big switch on/off in the back of my workstation near the power plug, not even the regular on/off button exposed on the case.

I googled a lot about linux issues about "sudden crash", "screen off", "crash out of the blue" , etc, but I couldn't find much that matches what is my experience explained above, and maybe I didn't ask google the correct question.

Has anybody experienced something like this and/or can give me any clue on how debug this ?

This what I get if I run `inxi - Fxz`

Code:
(base) [~/Desktop]$ inxi -Fxz
System:    Kernel: 5.11.0-49-generic x86_64 bits: 64 compiler: gcc v: 10.2.1 Desktop: Budgie 10.5.2
           Distro: Ubuntu 21.04 (Hirsute Hippo)
Machine:   Type: Desktop Mobo: ASRock model: X570 Creator serial: <filter> UEFI: American Megatrends v: P3.50 date: 04/21/2021
Battery:   Device-1: hidpp_battery_0 model: Logitech Wireless Mouse MX Master charge: 55% (should be ignored)
           status: Discharging
           Device-2: hidpp_battery_1 model: Logitech K800 charge: 70% (should be ignored) status: Discharging
CPU:       Info: 16-Core model: AMD Ryzen 9 5950X bits: 64 type: MT MCP arch: Zen 3 rev: 0 L2 cache: 8 MiB
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 217198
           Speed: 4012 MHz min/max: 2200/3400 MHz boost: enabled Core speeds (MHz): 1: 4012 2: 3603 3: 3586 4: 3214 5: 3592
           6: 3199 7: 3590 8: 4244 9: 4191 10: 3590 11: 4266 12: 3590 13: 3590 14: 4523 15: 3593 16: 3589 17: 3579 18: 3574
           19: 4660 20: 3579 21: 3669 22: 3590 23: 3701 24: 3785 25: 3590 26: 3592 27: 3591 28: 4128 29: 2878 30: 3594
           31: 3588 32: 3591
Graphics:  Device-1: NVIDIA GA102 [GeForce RTX 3090] driver: nvidia v: 470.86 bus ID: 32:00.0
           Display: x11 server: X.Org 1.20.11 driver: loaded: nvidia resolution: 1: 4096x2160~60Hz 2: 3840x2160~60Hz
           OpenGL: renderer: NVIDIA GeForce RTX 3090/PCIe/SSE2 v: 4.6.0 NVIDIA 470.86 direct render: Yes
Audio:     Device-1: NVIDIA GA102 High Definition Audio driver: snd_hda_intel v: kernel bus ID: 32:00.1
           Device-2: Advanced Micro Devices [AMD] Starship/Matisse HD Audio vendor: ASRock driver: snd_hda_intel v: kernel
           bus ID: 34:00.4
           Device-3: Logitech HD Pro Webcam C920 type: USB driver: snd-usb-audio,uvcvideo bus ID: 7-1:2
           Sound Server: ALSA v: k5.11.0-49-generic
Network:   Device-1: Intel I211 Gigabit Network vendor: ASRock driver: igb v: kernel port: b000 bus ID: 27:00.0
           IF: enp39s0 state: down mac: <filter>
           Device-2: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel port: b000 bus ID: 28:00.0
           IF: wlp40s0 state: up mac: <filter>
           Device-3: Aquantia AQC107 NBase-T/IEEE 802.3bz Ethernet [AQtion] vendor: ASRock driver: atlantic v: kernel
           port: a000 bus ID: 2d:00.0
           IF: enp45s0 state: down mac: <filter>
Bluetooth: Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8 bus ID: 5-6:2
           Report: ID: hci0 state: up running pscan bt-v: 3.0 lmp-v: 5.2 address: <filter>
Drives:    Local Storage: total: 9.1 TiB used: 1.97 TiB (21.7%)
           ID-1: /dev/nvme0n1 vendor: Western Digital model: WDS200T1X0E-00AFY0 size: 1.82 TiB temp: 46.9 C
           ID-2: /dev/sda vendor: Samsung model: SSD 870 EVO 2TB size: 1.82 TiB
           ID-3: /dev/sdb vendor: Western Digital model: WD4005FZBX-00K5WB0 size: 3.64 TiB
           ID-4: /dev/sdc type: USB vendor: Western Digital model: WD My Book 1140 size: 1.82 TiB
Partition: ID-1: / size: 90.81 GiB used: 56.43 GiB (62.1%) fs: ext4 dev: /dev/nvme0n1p4
           ID-2: /boot/efi size: 99.2 MiB used: 30.2 MiB (30.4%) fs: vfat dev: /dev/nvme0n1p2
           ID-3: /home size: 791.52 GiB used: 48.06 GiB (6.1%) fs: ext4 dev: /dev/nvme0n1p6
Swap:      ID-1: swap-1 type: partition size: 29.8 GiB used: 0 KiB (0.0%) dev: /dev/nvme0n1p5
Sensors:   System Temperatures: cpu: 45.1 C mobo: N/A gpu: nvidia temp: 39 C
           Fan Speeds (RPM): N/A gpu: nvidia fan: 30%
Info:      Processes: 561 Uptime: 1h 13m Memory: 125.54 GiB used: 4.35 GiB (3.5%) Init: systemd runlevel: 5 Compilers:
           gcc: 10.3.0 Packages: 2251 Shell: Bash v: 5.1.4 inxi: 3.3.01
 
Last edited:


First port of call is trawling through the logs. What do the logs say that may be relevant?:
journalctl -b
journalctl -b -x -p 3

You can use a viewer of your choice for these logs, but I've used vi here:

vi /var/log/syslog

Depending on where your Xorg.0.log lives:

vi ~/.local/share/xorg/Xorg.0.log
vi /var/log/Xorg.0.log
 
First port of call is trawling through the logs. What do the logs say that may be relevant?:
journalctl -b
journalctl -b -x -p 3
Yesterday I had 2 crashes that forced me to reboot at -1 and -2 below :
Code:
> journalctl --list-boots
...
...
  -2 1b92230aec4c4e8eb4916af7eff5c53b Sat 2022-04-09 20:56:05 PDT—Sat 2022-04-09 21:17:01 PDT
  -1 d3a4f101be314be2ac19dbb019f4ed1d Sat 2022-04-09 21:19:43 PDT—Sat 2022-04-09 23:50:15 PDT
   0 1afdfb10f4844c458aadcd7b3d24b38e Sun 2022-04-10 10:25:57 PDT—Sun 2022-04-10 15:13:45 PDT

After the crash I restarted the machine right away,
Looking at the session (-2) I imagine I should be looking at something that happened a few minutes before (-1), so around 21:18 ish , correct ?
I don't see anything relevant in that time running this command , am I wrong ?

Code:
> journalctl -b -2 -x -p 3
-- Journal begins at Fri 2021-10-22 08:38:34 PDT, ends at Sun 2022-04-10 15:33:57 PDT. --
Apr 09 20:55:53 mother kernel: ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.LPC0.EC0], AE_NOT_FOUND (20201113/dswload2-162)
Apr 09 20:55:53 mother kernel: ACPI Error: AE_NOT_FOUND, During name lookup/catalog (20201113/psobject-220)
Apr 09 20:55:53 mother kernel: ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.GPP1], AE_NOT_FOUND (20201113/dswload2-162)
Apr 09 20:55:53 mother kernel: ACPI Error: AE_NOT_FOUND, During name lookup/catalog (20201113/psobject-220)
Apr 09 20:55:53 mother kernel: pcieport 0000:04:04.0: pciehp: Hotplug bridge without secondary bus, ignoring
Apr 09 20:55:53 mother kernel: sd 6:0:0:0: [sdc] No Caching mode page found
Apr 09 20:55:53 mother kernel: sd 6:0:0:0: [sdc] Assuming drive cache: write through
Apr 09 20:55:53 mother kernel: scsi 6:0:0:1: Wrong diagnostic page; asked for 1 got 8
Apr 09 20:55:53 mother kernel: scsi 6:0:0:1: Failed to get diagnostic page 0x1
Apr 09 20:55:53 mother kernel: scsi 6:0:0:1: Failed to bind enclosure -19
Apr 09 20:55:53 mother kernel:
Apr 09 20:55:59 mother lightdm[2154]: PAM unable to dlopen(pam_kwallet.so): /lib/security/pam_kwallet.so: cannot open shared object file: No such file or directory
Apr 09 20:55:59 mother lightdm[2154]: PAM adding faulty module: pam_kwallet.so
Apr 09 20:55:59 mother lightdm[2154]: PAM unable to dlopen(pam_kwallet5.so): /lib/security/pam_kwallet5.so: cannot open shared object file: No such file or directory
Apr 09 20:55:59 mother lightdm[2154]: PAM adding faulty module: pam_kwallet5.so
Apr 09 20:56:00 mother lightdm[2383]: PAM unable to dlopen(pam_kwallet.so): /lib/security/pam_kwallet.so: cannot open shared object file: No such file or directory
Apr 09 20:56:00 mother lightdm[2383]: PAM adding faulty module: pam_kwallet.so
Apr 09 20:56:00 mother lightdm[2383]: PAM unable to dlopen(pam_kwallet5.so): /lib/security/pam_kwallet5.so: cannot open shared object file: No such file or directory
Apr 09 20:56:00 mother lightdm[2383]: PAM adding faulty module: pam_kwallet5.so
Apr 09 20:56:05 mother lightdm[2383]: gkr-pam: unable to locate daemon control file


You can use a viewer of your choice for these logs, but I've used vi here:

vi /var/log/syslog

Looking at syslog.1 in that time frame before 21:19 I see this. Not sure what how to interpret this. Do you see anything wrong ?

Code:
> vi /var/log/syslog.1
...
Apr  9 21:01:44 mother systemd[1]: systemd-timedated.service: Succeeded.
Apr  9 21:01:54 mother budgie-wm.desktop[2841]: Window manager warning: WM_TRANSIENT_FOR window 0x42001f3 for 0x42001fa window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x4200032.
Apr  9 21:02:00 mother budgie-wm.desktop[2841]: Window manager warning: WM_TRANSIENT_FOR window 0x4200106 for 0x420010d window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x4200032.
Apr  9 21:03:04 mother budgie-wm.desktop[2841]: message repeated 13 times: [ Window manager warning: WM_TRANSIENT_FOR window 0x4200106 for 0x420010d window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x4200032.]
Apr  9 21:03:09 mother budgie-wm.desktop[2841]: Window manager warning: WM_TRANSIENT_FOR window 0x420029e for 0x42002a5 window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x4200032.
Apr  9 21:03:18 mother budgie-panel[2848]: wnck_window_is_skip_pager: assertion 'WNCK_IS_WINDOW (window)' failed
Apr  9 21:03:18 mother budgie-panel[2848]: wnck_window_is_skip_tasklist: assertion 'WNCK_IS_WINDOW (window)' failed
Apr  9 21:03:18 mother budgie-panel[2848]: wnck_window_get_geometry: assertion 'WNCK_IS_WINDOW (window)' failed
Apr  9 21:04:35 mother systemd[1]: Starting Ubuntu Advantage Timer for running repeated jobs...
Apr  9 21:04:36 mother systemd[1]: ua-timer.service: Succeeded.
Apr  9 21:04:36 mother systemd[1]: Finished Ubuntu Advantage Timer for running repeated jobs.
Apr  9 21:04:52 mother budgie-wm.desktop[2841]: Window manager warning: WM_TRANSIENT_FOR window 0x4200106 for 0x420010d window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x4200032.
Apr  9 21:10:52 mother budgie-wm.desktop[2841]: message repeated 11 times: [ Window manager warning: WM_TRANSIENT_FOR window 0x4200106 for 0x420010d window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x4200032.]
Apr  9 21:10:58 mother systemd[1]: Starting Cleanup of Temporary Directories...
Apr  9 21:10:58 mother systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Apr  9 21:10:58 mother systemd[1]: Finished Cleanup of Temporary Directories.
Apr  9 21:17:01 mother CRON[267588]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
## I imagine this is where I had to reboot
Apr  9 21:19:46 mother systemd-modules-load[581]: Inserted module 'lp'
Apr  9 21:19:46 mother systemd-modules-load[581]: Inserted module 'ppdev'
Apr  9 21:19:46 mother kernel: [    0.000000] Linux version 5.11.0-49-generic (buildd@lcy02-amd64-054) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #55-Ubuntu SMP Wed Jan 12 17:36:34 UTC 2022 (Ubuntu 5.11.0-49.55-generic 5.11.22)
Apr  9 21:19:46 mother systemd-modules-load[581]: Inserted module 'parport_pc'
Apr  9 21:19:46 mother kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-49-generic root=UUID=ed3952ae-5b6e-4bd6-bb8b-e71e4c775fe2 ro quiet splash crashkernel=512M-:192M vt.handoff=7
Apr  9 21:19:46 mother kernel: [    0.000000] KERNEL supported cpus:
Apr  9 21:19:46 mother kernel: [    0.000000]   Intel GenuineIntel
Apr  9 21:19:46 mother kernel: [    0.000000]   AMD AuthenticAMD
Apr  9 21:19:46 mother kernel: [    0.000000]   Hygon HygonGenuine
Apr  9 21:19:46 mother kernel: [    0.000000]   Centaur CentaurHauls
Apr  9 21:19:46 mother systemd-modules-load[581]: Inserted module 'msr'
...

Depending on where your Xorg.0.log lives:

vi ~/.local/share/xorg/Xorg.0.log
vi /var/log/Xorg.0.log

Looking at /var/log/Xorg.0.log I can't find a time stamp that I can interpret. Is this generated during a crash ? I am not sure how to interpret it.
The file is 1384 lines , and the whole second half is all like this. Should I post the whole thing or look at something in particular ?

Code:
[ 17632.494] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-11ms), your system is too slow
[ 17658.656] (EE) client bug: timer event17 debounce: scheduled expiry is in the past (-12ms), your system is too slow
[ 17658.656] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-25ms), your system is too slow
[ 17748.009] (EE) event18 - Logitech K800: client bug: event processing lagging behind by 11ms, your system is too slow
[ 17831.649] (EE) event18 - Logitech K800: client bug: event processing lagging behind by 34ms, your system is too slow
[ 17873.001] (EE) client bug: timer event17 debounce: scheduled expiry is in the past (-6ms), your system is too slow
[...]
[ 18850.881] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-2ms), your system is too slow
[ 18866.757] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-11ms), your system is too slow
[ 19171.897] (EE) event17 - Logitech MX Master: client bug: event processing lagging behind by 19ms, your system is too slow
[ 19174.071] (EE) client bug: timer event17 debounce: scheduled expiry is in the past (-8ms), your system is too slow
[ 19174.071] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-21ms), your system is too slow
[ 19197.388] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-9ms), your system is too slow
[ 19235.402] (EE) event17 - Logitech MX Master: client bug: event processing lagging behind by 28ms, your system is too slow
[ 19259.226] (EE) event17 - Logitech MX Master: client bug: event processing lagging behind by 16ms, your system is too slow
[ 19263.038] (EE) event17 - Logitech MX Master: client bug: event processing lagging behind by 15ms, your system is too slow
[ 19263.818] (EE) event17 - Logitech MX Master: client bug: event processing lagging behind by 15ms, your system is too slow
[ 19263.818] (EE) event17 - Logitech MX Master: WARNING: log rate limit exceeded (5 msgs per 60min). Discarding future messages.
[ 19302.207] (EE) client bug: timer event17 debounce: scheduled expiry is in the past (-16ms), your system is too slow
[ 19302.207] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-29ms), your system is too slow
[ 19394.277] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-6ms), your system is too slow
[ 19433.379] (EE) client bug: timer event17 debounce: scheduled expiry is in the past (-8ms), your system is too slow
[ 19433.379] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-21ms), your system is too slow
[...]
[ 19992.812] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-13ms), your system is too slow
[ 20115.922] (EE) client bug: timer event17 debounce: scheduled expiry is in the past (-15ms), your system is too slow
[ 20115.922] (EE) client bug: timer event17 debounce short: scheduled expiry is in the past (-28ms), your system is too slow

I am not super sure what to make of this, do you see anything wrong ?

Thank you so much for helping on this.
 
The budgie window manager, or parts of it, look like they are causing some grief. One way to check that is to use a different window manager altogether and see what ensues. If it's a video card GUI issue, one way of checking that is to boot to a text prompt and see if the screens just "blank off" in that state. If they do blank even from a text prompt, then you may be looking at video card drivers. It looks like you're running the nvidia driver. I don't have experience with it, but if it gets to the point of being a driver issue, you can uninstall the nvidia one and install the nouveau driver and see if that resolves the issue. Actually, if you uninstall the nvidia driver the nouveau may automatically be installed by X as long as the driver hasn't been blacklisted in /etc somewhere. If you do test the issue by booting to the text prompt, you can still surf the net with text browsers such as elinks, w3m or lynx whilst you're waiting to see if the blanking happens. An overnight test leaving the machine at text prompt start might be the way to go on that one.
 
Last edited by a moderator:
The budgie window manager, or parts of it, look like they are causing some grief. One way to check that is to use a different window manager altogether and see what ensues. If it's a video card GUI issue, one way of checking that is to boot to a text prompt and see if the screens just "blank off" in that state. If they do blank even from a text prompt, then you may be looking at video card drivers. It looks like you're running the nvidia driver. I don't have experience with it, but if it gets to the point of being a driver issue, you can uninstall the nvidia one and install the nouveau driver and see if that resolves the issue. Actually, if you uninstall the nvidia driver the nouveau may automatically be installed by X as long as the driver hasn't been blacklisted in /etc somewhere. If you do test the issue by booting to the text prompt, you can still surf the net with text browsers such as elinks, w3m or lynx whilst you're waiting to see if the blanking happens. An overnight test leaving the machine at text prompt start might be the way to go on that one.
Interesting, thank you @NorthWest.

I assume that the lines that caught your attentions are these lines in syslog.1 correct ?
If so, would the fact that these log lines have been recorded several minutes before the crash rule this possibility out , or maybe after the issue might accumulate and explode after several errors , which then would not be reported in the crash log cause the machine is dead ?

Apr 9 21:02:00 mother budgie-wm.desktop[2841]: Window manager warning: WM_TRANSIENT_FOR window 0x4200106 for 0x420010d window override-redirect is an override-redirect window and this is not correct according to the standard, so we'll fallback to the first non-override-redirect window 0x4200032.

I'd like to attempt leaving the machine at text prompt overnight first as it seems the easiest so far.
In order to perform this test is it enough to switch to text mode from a regular X session using Ctrl-Alt-F6 or you believe it is a more secure test if I actually reboot in text mode ?
 
Personally, I'd reboot to the text prompt for the test. That way you exclude the possibly confounding variable of X.
would the fact that these log lines have been recorded several minutes before the crash rule this possibility out
I would not think so because code for these things can be quite complex with internal loops and time delays which may or may not be active, but I simply don't know because I haven't read the code. I think from the user point of view, the test of using another window manager could be informative. There are so many window managers now which are really light so they shouldn't tax the system and it easy to slip from one to the other.
 

Members online


Latest posts

Top