If it may be hardware, then one can check the hardware. It's often the case that intermittent issues like those described in post #1 are hardware related.
It's worth upgrading to the latest possible state before checking hardware to see whether the problems still persist. That should include the latest firmware as well as the kernel and packages, and also the latest BIOS update.
If one doesn't have the system in the latest updated state, then one will be trying to fix problems that may already have been fixed by those updates. Hence, if problems persist despite the system being currently upgraded, some of the following can be looked at, mostly using root or sudo privileges. Note that none of the following changes anything, it just gathers information. It's all run in a terminal which is an efficient way to do this because the output will appear on screen immediately to be read or copied for pasting.
First check the logs for errors:
The options mean: -b (this boot); -x (explanation of errors); -p (priority of message 3=error). Note the errors since each really needs an explanation, though not all are necessarily issues for the functioning of a system.
Check for missing firmware:
Check for missing microcode:
Code:
dmesg | grep -i microcode
Check for cpu vulnerabilities:
Observe the section on "Vulnerabilites" at the bottom of the output. If there are no issues, this output produces output like "Not affected" or "Mitigation"
Check the drives:
Code:
smartctl -x /dev/<device>
One can get the device name from the output of
lsblk, for example:
Code:
[~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sr0 11:0 1 1024M 0 rom
nvme0n1 259:0 0 465.8G 0 disk
├─nvme0n1p1 259:1 0 476M 0 part /boot/efi
├─nvme0n1p2 259:2 0 14.9G 0 part [SWAP]
└─nvme0n1p3 259:3 0 450.4G 0 part /
The device name is that denoted by the term "disk" in the output with a prefix, so in this case it would be:
/dev/nvme0n1
In the output of the smartctl command check the "Critical Warning", the temperatures, and whether the drive has: PASSED.
Check the memory with the package memtest86+. One can run it from a live disk or a rescue disk, which is how it's done here. In BIOS systems one can install the package and run it on the next boot by selecting it from the grub menu. It doesn't always appear in the grub menu on UEFI systems. It's easier run from a live disk I think.
Check the filesystem. This has to be done on an unmounted system. Using a live disk to run the fsck is safe, and is the way it's used here. To check the various partitions, get their names from the
lsblk output above, for example the root partition the one with: /, so its name is: /dev/nvme0n1p3. To check the filesystem on it run:
Each partition can be checked. The output on screen will show if the filesystem is "clean" or something else.
Check the overall temperatures in the system by running:
The sensors command is in the package: lm-sensors. Install it if it's not installed.
In particular check that the temperatures are within the ranges shown in the output.
There are other checks, but the above is a basic start to gather info.