How to debug persistent crashing ?

Usjes · Jan 7, 2023

Hi,

I have been running Ubuntu on an old Dell Latitude E6410 for a number of years and since the start it has been exhibiting crashes. The symptoms are always the same:
- It invariably happens when I am browsing online in Firefox
- The LED light will start to flicker indicating disk access (its on old hard disk not an SSD)
- This flickering will become more intense until it is pretty much constant and as it intensifies the mouse becomes more unresponsive
- Initially the mouse will move the cursor but it wont respond to clicks (eg try clicking on the X to close the browser as I'm almost certain it is something that the browser is doing that is causing this constant disk access)
- I have mapped the Ctrl-Alt-M key combo to Xkill and IF I am quick enough when the flashing starts it will respond to the Xkill command and I can kill the browser and the problem is resolved.
- 9 times out of 10 though I will not be quick enough and then the machine is basically frozen, it will remain there for hours apparently trying to access the disk and will be unresponsive to keystrokes or mouse clicks. The only way to resolve the problem at this point is to power cycle the laptop.

This crashing happens with a frequency of about once a week, I'm currently running 18.04.6 LTS but this has been going on for years (back as far as 12.04.6 LTS I think) so I'm sure there is no point in upgrading to a newer revision of Ubuntu.
So I have two questions:
1. Does anyone know how I would debug this to figure out what the problem is ?
2. If not, is there anyway of running Firefox in some sort of 'sandbox' mode such that when it crashes I can easily open a terminal and just kill Firefox. As it is when it goes crazy it seems to lock up the whole system. Is there any way I can run it that avoids if taking over all the system resources ? This was one of the major advantages of Linux vs Windows when I used to run Linux years ago; if there was a problem with a program in Windows it took the whole system down with a BSOD whereas with linux I would just open a terminal and issue the ps and kill <pid> commands. It seems now Linux (or Ubuntu at any rate) has contracted the craptastic Windows behaviour.

On the subject of trying to debug it I did create a simple shell script (crashLogging.sh attached) to try to see what is going on when the problem arises. I have also attached the outputs (log1.txt, log2.txt) when the problem happens. It is shows that ~100% of the CPU is occupied with 'wa, IO-wait : time waiting for I/O completion', does this give any further clue to the root cause of the problem or can anyone suggest any further experiments to debug the problem ? Doing a grep on the Cpu activity you can see that all of a sudden the wa % ramps up to near 100% and then just stays there indefinitely:
log2.txt shows a basline( 20 captures of the top command when there is no problem):
grep "Cpu" log2.txt
%Cpu(s): 12.3 us, 2.2 sy, 0.0 ni, 84.9 id, 0.6 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 20.4 us, 4.0 sy, 0.0 ni, 75.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 7.4 us, 2.3 sy, 0.0 ni, 89.5 id, 0.8 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 3.1 us, 1.2 sy, 0.0 ni, 95.7 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 1.9 us, 0.8 sy, 0.0 ni, 97.1 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 4.7 us, 1.3 sy, 0.0 ni, 93.7 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 4.2 us, 1.2 sy, 0.0 ni, 94.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 5.6 us, 2.3 sy, 0.0 ni, 92.0 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 6.4 us, 1.8 sy, 0.0 ni, 91.7 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 3.8 us, 1.6 sy, 0.0 ni, 94.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 6.2 us, 1.9 sy, 0.0 ni, 91.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 25.9 us, 4.0 sy, 0.0 ni, 67.2 id, 2.6 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu(s): 16.1 us, 4.2 sy, 0.0 ni, 77.3 id, 2.4 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 15.7 us, 3.4 sy, 0.0 ni, 78.3 id, 2.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 10.5 us, 2.8 sy, 0.0 ni, 81.2 id, 5.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 21.5 us, 4.2 sy, 0.0 ni, 73.4 id, 0.7 wa, 0.0 hi, 0.2 si, 0.0 st
%Cpu(s): 16.6 us, 3.8 sy, 0.0 ni, 79.5 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 28.7 us, 4.2 sy, 0.0 ni, 65.1 id, 1.8 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 8.1 us, 2.6 sy, 0.0 ni, 89.1 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 12.2 us, 2.8 sy, 0.0 ni, 84.8 id, 0.1 wa, 0.0 hi, 0.1 si, 0.0 st
Then about a 80s into log2 the cpu 'wa' % starts to ramp up indicating the start of the problem:
grep "Cpu" log1.txt
%Cpu(s): 12.3 us, 2.2 sy, 0.0 ni, 84.9 id, 0.6 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 9.5 us, 2.2 sy, 0.0 ni, 88.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 40.4 us, 5.8 sy, 0.0 ni, 52.2 id, 1.6 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 36.2 us, 6.9 sy, 0.0 ni, 49.3 id, 7.4 wa, 0.0 hi, 0.2 si, 0.0 st
%Cpu(s): 28.1 us, 5.0 sy, 0.1 ni, 43.4 id, 23.4 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 16.7 us, 3.7 sy, 0.0 ni, 39.8 id, 39.8 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 4.1 us, 1.3 sy, 0.0 ni, 35.8 id, 58.8 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 2.3 us, 1.5 sy, 0.0 ni, 34.1 id, 62.0 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 0.7 us, 0.7 sy, 0.0 ni, 1.8 id, 96.8 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 1.2 us, 1.7 sy, 0.0 ni, 10.1 id, 87.1 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 0.8 us, 1.5 sy, 0.0 ni, 5.8 id, 91.9 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 0.6 us, 1.1 sy, 0.0 ni, 4.7 id, 93.6 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 1.0 us, 1.4 sy, 0.0 ni, 17.2 id, 80.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 0.3 us, 0.9 sy, 0.0 ni, 0.2 id, 98.6 wa, 0.0 hi, 0.1 si, 0.0 st
%Cpu(s): 0.9 us, 1.7 sy, 0.0 ni, 0.1 id, 97.2 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 0.6 us, 3.2 sy, 0.0 ni, 5.9 id, 90.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu(s): 0.9 us, 1.5 sy, 0.0 ni, 6.8 id, 90.8 wa, 0.0 hi, 0.0 si, 0.0 st

Anyone have any suggestions or is there any other information in these logs that would explain what is going on ? Thanks,

Usjes

Brickwizard · Jan 7, 2023

Its not unusual for old style hard drives to crash [I have had several over the years
Test your hard drive with smartctl or similar

MattWinter · Jan 7, 2023

What is your ram usage like when this is going on? Is it possibly paging? Can you run htop while running the browser? While the browser is freezing the machine, can you switch to a different terminal (ctrl-alt-F1 through F7), login, and kill the browser process from that terminal?

kc1di · Jan 7, 2023

Often times the problem is Firefox's hardware acceleration setting. You can try disabling that under Settings> performance settings
by unchecking the box Use recommended performance settings Then unchecking the box that says "use hardware acceleration when available.

JasKinasis · Jan 7, 2023

The CPU usage and memory usage doesn’t seem to be a problem.
It looks to me like the processes are having to wait for existing disk read/write operations to finish before they can do their thing. So it’s probably where Firefox is caching web content, can’t be sure if it’s a problem with the read, or write.

So it may well be the hard drive is on its way out. I’d second @Brickwizard ‘s suggestion to check your hard-drive for SMART errors. It might be in a pre-fail state.

osprey · Jan 7, 2023

Usjes asked:

is there anyway of running Firefox in some sort of 'sandbox' mode such that when it crashes I can easily open a terminal and just kill Firefox.

There's a good chance firefox is running in a namespace which is a form of container. When it's running, run:

Code:

lsns -t net

and see if firefox is there.

To kill firefox, if little else works, get it's PID from the output of the ps command, and kill the PID, for example:

Code:

[flip@flop ~]$ ps aux | grep -i firefox
flop         1181 15.2  7.9 3550296 646764 pts/0  Sl   07:56  12:55 firefox-esr
<snip>

Then:

Code:

kill 1181

If that fails, you can use "kill -9 1181", but it's very unusual in my experience that firefox isn't killed easily.

All in all though, it sounds like the hard drive is on the way out.

Usjes · Jan 8, 2023

Thanks for the suggestions, I'll try some of these tomorrow. But some immediate reactions:
1. While the browser is freezing the machine, can you switch to a different terminal (ctrl-alt-F1 through F7), login, and kill the browser process from that terminal
- I can't imagine this will work as it will be no different to my ctrl-alt-M which I have mapped to xkill, when the problem happens the machine becomes unresponsive to keystrokes so 'ctrl-alt-F1->F7' simply wont have any effect
2. Using ps and kill <PID> won't work for the same reason
3. When this initially started happening I did also think myself that the hard-disk might be failing but I would emphasize that this has been happening for literally years, I'm not sure how long but form the earliest documents on this disk I can say that it has been happening since at least 2016. If a disk was on its last legs would it really hobble along for 6 years without failing completely ?
I guess the other thing to mention is that it is a dual boot setup with a Windows7 install, I have never seen any problems with Windows7 although I boot that literally once a year to file a tax return so I guess I would have to wait 7 years to see a failure if it happens roughly once a week.

KGIII · Jan 8, 2023

kc1di said:
You can try disabling that under Settings> performance settings

^ THIS is the first thing I'd try. LOL It's often me who's giving that suggestion.

But, if it has been happening for 6 years, that might not be the cause. So, another thing to try is Firefox in 'safe mode' - with no extensions added to it.

digitaltrails · Jan 8, 2023

I would start by looking at the logs.

Have you checked the logs using journalctl or looked in /var/log/ or at ~/.local/share/sddm/xorg-session.log (I'm not exactly sure where ubuntu keeps some of its log files). If it uses journalctl, then journalctl --boot -1 after a reboot should be worth a look.

As other have suggested, you might first try firefox without any acceleration or addons. You might also try a different browser for a while.

The symptoms are typical of running out of RAM, but these days Linux is normally configured so that the OOM (Out of Memory) Killer eventually kicks in, so perhaps it isn't the cause in this case.

Stuck on an I/O to a bad part of the disk/SSD might also result in these symptoms. I would expect that to be logged. I would also expect that smartctl would report any issues, assuming your problem drive is /dev/sda, as root, you could do

Code:

smartctl -a /dev/sda

And smartctl can be used to start self-tests on drives.

Some memory usage patterns can trigger bad RAM, but I would just expect the system to fall over. Experimenting with stress-ng to push hard on the machine might be worth a go. BIOS RAM timings and overclocking might also be the source of RAM issues.

As other have suggested, when the problem begins you could see if you can get onto a text console (Alt-Ctrl-F1) and take a look at the logs etc.

Deleted member 108694 · Jan 9, 2023

you can also use the Disks utility to check you disk - you will need a Live USB for this - SMART Data and Self Tests

Usjes · Jan 14, 2023

Finally got a chance to run smartctl. It seems to run ~instantaneously so I guess it is just reporting data that has already been collected rather than running a test? It looks like the most relevant line is:
SMART overall-health self-assessment test result: PASSED
Does this mean the disk is ok ? Or are there more details that need to be taken into account ?
Full output:

sudo smartctl /dev/sda -a
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-136-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Momentus 7200.4
Device Model: ST9250410AS
Serial Number: 5VG7WJ8R
LU WWN Device Id: 5 000c50 02a94928f
Firmware Version: D005SDM1
User Capacity: 250,059,350,016 bytes [250 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Sat Jan 14 15:26:43 2023 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 53) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 087 006 Pre-fail Always - 139074299
3 Spin_Up_Time 0x0003 100 099 085 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 096 096 020 Old_age Always - 4236
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail Always - 13564005736
9 Power_On_Hours 0x0032 082 055 000 Old_age Always - 16067
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 097 037 020 Old_age Always - 3971
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 6824
188 Command_Timeout 0x0032 100 098 000 Old_age Always - 131138
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 062 053 045 Old_age Always - 38 (Min/Max 21/39)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 216
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 544
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 560869
194 Temperature_Celsius 0x0022 038 047 000 Old_age Always - 38 (0 12 0 0 0)
195 Hardware_ECC_Recovered 0x001a 048 030 000 Old_age Always - 139074299
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 54980 (19 253 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3783033231
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2793996618
254 Free_Fall_Sensor 0x0032 001 001 000 Old_age Always - 93

SMART Error Log Version: 1
ATA Error Count: 4855 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4855 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 20 20 ad 0b 49 00 00:20:36.620 READ FPDMA QUEUED
61 00 08 68 76 0b 49 00 00:20:36.619 WRITE FPDMA QUEUED
61 00 08 b0 85 0a 49 00 00:20:36.618 WRITE FPDMA QUEUED
61 00 08 f8 6a 0a 49 00 00:20:36.615 WRITE FPDMA QUEUED
61 00 20 e0 ac db 4f 00 00:20:36.614 WRITE FPDMA QUEUED

Error 4854 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 20 e0 6a 0a 49 00 00:20:34.055 READ FPDMA QUEUED
61 00 08 10 23 0a 49 00 00:20:34.054 WRITE FPDMA QUEUED
61 00 08 d8 e6 08 49 00 00:20:34.054 WRITE FPDMA QUEUED
61 00 08 a0 9a 07 49 00 00:20:34.053 WRITE FPDMA QUEUED
61 00 08 38 4a 07 49 00 00:20:34.052 WRITE FPDMA QUEUED

Error 4853 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 20 80 67 04 49 00 00:20:28.898 READ FPDMA QUEUED
61 00 08 88 e4 03 49 00 00:20:28.898 WRITE FPDMA QUEUED
61 00 08 d8 cd 5b 40 00 00:20:28.896 WRITE FPDMA QUEUED
61 00 08 20 58 5e 40 00 00:20:28.895 WRITE FPDMA QUEUED
61 00 08 e0 cd 5b 40 00 00:20:28.895 WRITE FPDMA QUEUED

Error 4852 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 08 03 0e 49 00 00:20:26.327 READ FPDMA QUEUED
61 00 08 73 34 af 47 00 00:20:26.260 WRITE FPDMA QUEUED
61 00 08 6b 34 af 47 00 00:20:26.260 WRITE FPDMA QUEUED
61 00 08 63 34 af 47 00 00:20:26.259 WRITE FPDMA QUEUED
61 00 10 30 b8 98 48 00 00:20:26.259 WRITE FPDMA QUEUED

Error 4851 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: WP at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 08 30 42 98 48 00 00:20:21.243 WRITE FPDMA QUEUED
61 00 05 98 49 24 40 00 00:20:21.243 WRITE FPDMA QUEUED
61 00 08 38 e6 5f 40 00 00:20:21.243 WRITE FPDMA QUEUED
61 00 08 e8 01 41 45 00 00:20:21.242 WRITE FPDMA QUEUED
61 00 01 00 34 af 47 00 00:20:21.242 WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 37799 -
# 2 Short offline Completed: read failure 90% 12195 412409
# 3 Short offline Completed without error 00% 1 -
# 4 Short offline Completed without error 00% 0 -
# 5 Short offline Completed without error 00% 0 -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

digitaltrails · Jan 14, 2023

It would help if you code quoted log text to make it easier for us to read, like this:

Code:

sudo smartctl /dev/sda -a
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-136-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Momentus 7200.4
Device Model: ST9250410AS
Serial Number: 5VG7WJ8R
LU WWN Device Id: 5 000c50 02a94928f
Firmware Version: D005SDM1
User Capacity: 250,059,350,016 bytes [250 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Sat Jan 14 15:26:43 2023 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 53) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 087 006 Pre-fail Always - 139074299
3 Spin_Up_Time 0x0003 100 099 085 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 096 096 020 Old_age Always - 4236
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail Always - 13564005736
9 Power_On_Hours 0x0032 082 055 000 Old_age Always - 16067
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 097 037 020 Old_age Always - 3971
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 6824
188 Command_Timeout 0x0032 100 098 000 Old_age Always - 131138
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 062 053 045 Old_age Always - 38 (Min/Max 21/39)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 216
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 544
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 560869
194 Temperature_Celsius 0x0022 038 047 000 Old_age Always - 38 (0 12 0 0 0)
195 Hardware_ECC_Recovered 0x001a 048 030 000 Old_age Always - 139074299
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 54980 (19 253 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3783033231
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2793996618
254 Free_Fall_Sensor 0x0032 001 001 000 Old_age Always - 93

SMART Error Log Version: 1
ATA Error Count: 4855 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4855 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 20 20 ad 0b 49 00 00:20:36.620 READ FPDMA QUEUED
61 00 08 68 76 0b 49 00 00:20:36.619 WRITE FPDMA QUEUED
61 00 08 b0 85 0a 49 00 00:20:36.618 WRITE FPDMA QUEUED
61 00 08 f8 6a 0a 49 00 00:20:36.615 WRITE FPDMA QUEUED
61 00 20 e0 ac db 4f 00 00:20:36.614 WRITE FPDMA QUEUED

Error 4854 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 20 e0 6a 0a 49 00 00:20:34.055 READ FPDMA QUEUED
61 00 08 10 23 0a 49 00 00:20:34.054 WRITE FPDMA QUEUED
61 00 08 d8 e6 08 49 00 00:20:34.054 WRITE FPDMA QUEUED
61 00 08 a0 9a 07 49 00 00:20:34.053 WRITE FPDMA QUEUED
61 00 08 38 4a 07 49 00 00:20:34.052 WRITE FPDMA QUEUED

Error 4853 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 20 80 67 04 49 00 00:20:28.898 READ FPDMA QUEUED
61 00 08 88 e4 03 49 00 00:20:28.898 WRITE FPDMA QUEUED
61 00 08 d8 cd 5b 40 00 00:20:28.896 WRITE FPDMA QUEUED
61 00 08 20 58 5e 40 00 00:20:28.895 WRITE FPDMA QUEUED
61 00 08 e0 cd 5b 40 00 00:20:28.895 WRITE FPDMA QUEUED

Error 4852 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 08 03 0e 49 00 00:20:26.327 READ FPDMA QUEUED
61 00 08 73 34 af 47 00 00:20:26.260 WRITE FPDMA QUEUED
61 00 08 6b 34 af 47 00 00:20:26.260 WRITE FPDMA QUEUED
61 00 08 63 34 af 47 00 00:20:26.259 WRITE FPDMA QUEUED
61 00 10 30 b8 98 48 00 00:20:26.259 WRITE FPDMA QUEUED

Error 4851 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: WP at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 08 30 42 98 48 00 00:20:21.243 WRITE FPDMA QUEUED
61 00 05 98 49 24 40 00 00:20:21.243 WRITE FPDMA QUEUED
61 00 08 38 e6 5f 40 00 00:20:21.243 WRITE FPDMA QUEUED
61 00 08 e8 01 41 45 00 00:20:21.242 WRITE FPDMA QUEUED
61 00 01 00 34 af 47 00 00:20:21.242 WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 37799 -
# 2 Short offline Completed: read failure 90% 12195 412409
# 3 Short offline Completed without error 00% 1 -
# 4 Short offline Completed without error 00% 0 -
# 5 Short offline Completed without error 00% 0 -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Others more experienced with SMART may well comment, but in the meantime. I've never seen any errors or failed tests logged on my current rotating drives which are several years old. But I see you have:

Code:

Error 4855 occurred at disk power-on lifetime: 12198 hours (508 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 f9 4a 06 00 Error: UNC at LBA = 0x00064af9 = 412409

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 20 20 ad 0b 49 00 00:20:36.620 READ FPDMA QUEUED
61 00 08 68 76 0b 49 00 00:20:36.619 WRITE FPDMA QUEUED
61 00 08 b0 85 0a 49 00 00:20:36.618 WRITE FPDMA QUEUED
61 00 08 f8 6a 0a 49 00 00:20:36.615 WRITE FPDMA QUEUED
61 00 20 e0 ac db 4f 00 00:20:36.614 WRITE FPDMA QUEUED

The commands leading up to it include some WRITE commands.

If I understand correctly, the summary lists that 6,824 of these have occurred:

Code:

187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 6824

You can google the error messages. This is an UNCorrectable error, which means a block has not been able to be corrected.
The drive has also a failed test at some time in the past:

Code:

# 2 Short offline Completed: read failure 90% 12195 412409

You can use smartctrl to initiate tests, take a look at the man page.

Personally, I would be uncomfortable continuing to depend on this drive without an explanation for these errors.

I had some errors on a new Samsung SSD's - but after some googling, I found they were due to firmware limitations and disabling NCQ got rid of them.

Sometimes drives have a few errors, automatically map out the bad blocks, replacing them with spares, and then happily keep going.

I would replace the drive with an SSD. An SSD of around 250 GB can be purchased for US$30. That would make the old system fly.

Usjes · Oct 13, 2024

I did finally get to the bottom of this. In case anyone else finds themselves in the same situation. The problem was that when installing Ubuntu it somehow messed up the swap partition. The clue is in the logs I posted originally so maybe nobody bothered looking at them:

Notice how in all the saved outputs from the top command (in log1/2.txt that I attached to the original post) the Swap size is listed as '0 total'. When installing Ubuntu I left 4Gb for swap space to match my 4Gb of RAM. So it seems when free RAM dropped below ~128Mb Ubuntu would start trying to write to the non-existent swap space, and after failing to do so a couple of million times would continue trying a couple of billion times more locking up the system until I power cycled. When installing Ubuntu I created the swap partition on sda6 so for me the solution was:
sudo mkswap /dev/sda6
sudo swapon /dev/sda6

Now when it runs low on RAM I see the same behaviour; the %Cpu value for 'wa' in the top command ramps up towards 100% and applications freeze however now it is temporary, now it presumably succeeds in writing data to swap and the 'wa' value falls back towards zero and the applications recover.

Condobloke · Oct 13, 2024

@Usjes, top marks to you for hanging in there !

How to debug persistent crashing ?

Usjes

New Member

Attachments

Brickwizard

Super Moderator

MattWinter

Active Member

kc1di

Well-Known Member

JasKinasis

Super Moderator

osprey

Well-Known Member

Usjes

New Member

KGIII

Administrator

digitaltrails

Member

Deleted member 108694

Guest

Usjes

New Member

digitaltrails

Member

Usjes

New Member

Condobloke

Well-Known Member

Similar threads

Follow Linux.org

Staff online

Members online

Latest posts