Seagate Exos X20 20TB intermittent startup/detection issue across multiple PCs - not sure if firmware, interface, or power-up problem

kibasnowpaw

Well-Known Member
Joined
Jan 2, 2022
Messages
507
Reaction score
395
Credits
6,404
I am trying to figure out what is actually wrong with this drive and whether anyone here has seen the same thing before.

The drive is:

Seagate Exos X20 20TB
Model: ST20000NM007D
Part Number: 3DJ103-011
Firmware: SN01
Serial: ZVT5R78G

I got it from a friend because he could not figure it out either. He had the same type of issue on Windows 10 before I got it, so this did not start on Linux.

The problem is random startup/detection failure. Sometimes the drive works normally for days, passes tests, handles large transfers, and looks fine. Other times, after the PC has been powered off for a while, the drive does not come up properly at boot.

What makes this annoying is that it is not a dead drive in the usual sense. It does not behave like a drive with obvious bad sectors or a drive that is constantly throwing read/write errors.

What I have tested so far:

  • different SATA data cables
  • different SATA power connectors / power leads
  • different PCs
  • full format to ext4
  • large data transfers to and from the drive
  • SeaTools short test
  • SMART checks with smartctl
  • hdparm feature check
What I found:

SMART media-side values still look clean:

  • Reallocated_Sector_Ct = 0
  • Current_Pending_Sector = 0
  • Offline_Uncorrectable = 0
SeaTools short self test passes.

Once the drive is detected and running, it can move large amounts of data without losing files. I have already used it for moving data back and forth while reformatting another HDD to ext4, and I have not had data loss.

But I still had one real failure myself:

My PC had been running for about 3 days without problems. Then I shut it fully off for around 10 hours. On next boot the drive did not come up properly. After that, once it was back, it worked again.

The SMART value that bothers me is this:

UDMA_CRC_Error_Count was already 114 when I got the drive from my friend.
After about a week of my own testing it has gone to 116.

So I am not saying all 116 happened on my system. I am only saying it increased by 2 while I had it.

That makes me think this is more of an interface / link / startup problem than a normal media failure.

I also checked hdparm and got this:

  • Power-Up In Standby feature set
  • SET_FEATURES required to spinup after power up
I do not know if that is relevant or just normal enterprise drive behavior, but I mention it because this drive does seem to have a weird power-up/startup side to the problem.

Another thing that caught my eye: this exact variant is 3DJ103-011 with SN01, and it does not seem to match the public firmware listings I found for other ST20000NM007D variants like 3DJ103-006 / 3DJ103-720 / 3DJ103-790 that show newer branches like SN03/SN06. That makes me wonder if this specific revision has some odd firmware behavior.

One person on a Danish hardware forum suggested Linux SATA power management, specifically the med_power_with_dipm / medium_power angle. I do not fully buy that as the root cause, because the same problem was already present on Windows 10 before I got the drive. I still tried checking that path on my Linux system, but my controller/kernel will not let me change link_power_management_policy at runtime. I only get Operation not supported or I/O error, so I cannot really confirm or rule that theory out on this machine.

My current system is:

Ubuntu Resolute Raccoon dev branch
Kernel 7.0.0-13-generic
KDE Plasma on X11
Intel i7-6850K
NVIDIA RTX 2070 SUPER

At this point my own conclusion is:

The drive does not look dead on the media side.
But it also does not look fully healthy or trustworthy.
It feels more like an intermittent startup / interface / firmware / PCB issue than a classic bad-sector failure.

I have sent an email to Seagate support ([email protected]) asking whether SN01 is correct for 3DJ103-011 and whether there is any known issue or firmware update for this variant, but I am not expecting much and I would not be surprised if the answer is just “go use the contact page”.

So I am asking here:

Has anyone seen this exact kind of behavior on an Exos X20?
Especially a drive that looks fine in SMART and under load, but randomly fails to come up properly after power-off?
And does this sound more like firmware/startup behavior, SATA link initialization, PCB/controller trouble, or something else?

If needed I can post full SMART output as well.
 


Thanks, this may actually have been the most useful lead so far.

I went through that article and tried the SeaChest route on the drive itself instead of chasing Linux runtime LPM any further.

What I did:

First I scanned the drives with SeaChest and confirmed the problem drive was /dev/sg0:

ST20000NM007D-3DJ103 / ZVT5R78G / SN01

Then I checked the current power state/features with:

Code:
SeaChest_PowerControl -d /dev/sg0 -i
SeaChest_PowerControl -d /dev/sg0 --showEPCSettings

Before changing anything, the drive showed:
  • EPC [Enabled]
  • active EPC timers on Idle A and Idle B
The EPC timer table looked like this before:
  • Idle A = *1
  • Idle B = *1200
  • Idle C = 0
  • Standby Z = 0
So based on that article, I then did:

Code:
SeaChest_PowerControl -d /dev/sg0 --powerBalanceFeature disable
SeaChest_PowerControl -d /dev/sg0 --EPCfeature disable

Both completed successfully.

After that I checked it again with the same commands.

Now it shows:
  • EPC is still listed as a supported feature, but it is no longer shown as [Enabled]
  • Current Timer is now 0 for all EPC states:
    • Idle A = 0
    • Idle B = 0
    • Idle C = 0
    • Standby Z = 0
So that part at least looks like it changed correctly.

That does not prove this was the whole problem, but it is the first thing I have changed that actually affects the drive itself in a meaningful way.

For context, I also tried the Linux ALPM angle before this, but on my system all attempts to change link_power_management_policy just returned Operation not supported or I/O error, so I could not properly test that path runtime on this machine.

At this point there is not much more I can do except watch it.

So now I am basically just going to keep an eye on:
  • whether it fails another cold boot after many hours powered off
  • whether UDMA_CRC_Error_Count goes beyond 116
  • and whether it now behaves normally over time after disabling Power Balance and EPC
So thanks again, because that article gave me something concrete to test instead of just guessing.
 


Follow Linux.org

Members online


Top