How to recover "dead" disk drives (sometimes).

dos2unix

Well-Known Member
Joined
May 3, 2019
Messages
4,408
Reaction score
4,601
Credits
41,638
Warning: This is mostly for advanced users. Yes, some of these commands can fix a broken drive. On the other hand, some of these
commands can break a fixed drive. I'm not responsible for broken drives, run at your own risk. There is no recovery from running some of these commands the wrong way, on the wrong device. - Moderators: If this is too dangerous, delete it.

sgdisk


Part of the gdisk package, sgdisk is the scriptable command-line version of GPT fdisk. Where fdisk gets confused or corrupted by GPT weirdness, sgdisk speaks native GPT.


Print partition table, non-destructive, always start here:
Code:
sgdisk -p /dev/sdX


Verify GPT integrity:
Code:
sgdisk --verify /dev/sdX


Backup the partition table to a file, do this before anything else:
Code:
sgdisk --backup=/root/sdX_partition_table.bak /dev/sdX


Restore from backup:
Code:
sgdisk --load-backup=/root/sdX_partition_table.bak /dev/sdX


Attempt to recover a damaged GPT, rebuilds from secondary GPT header:
Code:
sgdisk -e /dev/sdX


Move secondary GPT to end of disk, useful after a resize:
Code:
sgdisk -e /dev/sdX


Zap everything, the nuclear option:
Code:
sgdisk --zap-all /dev/sdX


Clone partition table from one disk to another, then randomize GUIDs on the target:
Code:
sgdisk --replicate=/dev/sdY /dev/sdX
Code:
sgdisk --randomize-guids /dev/sdY


When it saves you: corrupted primary GPT header but secondary GPT is intact, misaligned partition tables after a sector-size change like 512e to 4Kn, partition table lost after a dd mishap.




hdparm


The Swiss Army knife for ATA/SATA drives. Less relevant for NVMe, use nvme-cli there, but still essential for spinning rust and older SSDs.


Drive identity, model, firmware, supported features:
Code:
hdparm -I /dev/sdX


Read speed benchmark, bypasses page cache:
Code:
hdparm -tT /dev/sdX


Check power state:
Code:
hdparm -C /dev/sdX


Check APM level:
Code:
hdparm -B /dev/sdX


Set APM to maximum performance, 255 disables APM entirely:
Code:
hdparm -B 255 /dev/sdX


Disable spindown:
Code:
hdparm -S 0 /dev/sdX


Disable power-up-in-standby:
Code:
hdparm -s 0 /dev/sdX


Check ATA security lock status:
Code:
hdparm -I /dev/sdX | grep -i security


Unlock a locked drive:
Code:
hdparm --security-unlock PASSWORD /dev/sdX


Disable ATA security entirely:
Code:
hdparm --security-disable PASSWORD /dev/sdX


If the drive shows frozen, suspend the machine to RAM, then bring it back. That clears the frozen bit without a full power cycle on some systems, then retry the unlock.


ATA Secure Erase, useful for restoring write performance on a degraded SSD:
Code:
hdparm --user-master u --security-set-pass TEMPPASS /dev/sdX
Code:
hdparm --user-master u --security-erase TEMPPASS /dev/sdX


This resets all NAND cells to erased state and can recover dramatically degraded SSD write performance.




Imaging First, Always


Before doing anything else, get an image off the drive if it is still readable at all. ddrescue is the king here.


Pass 1, fast pass, get what you can:
Code:
ddrescue -d -r0 /dev/sdX /mnt/rescue/image.img /mnt/rescue/image.map


Pass 2, retry bad sectors up to 3 times:
Code:
ddrescue -d -r3 /dev/sdX /mnt/rescue/image.img /mnt/rescue/image.map


Pass 3, scrape mode, reads individual sectors around bad spots:
Code:
ddrescue -d -r3 -R /dev/sdX /mnt/rescue/image.img /mnt/rescue/image.map


The mapfile is critical. It lets you resume interrupted rescues. Never skip it. dd_rescue (the older, separate tool, note the underscore) and dcfldd are alternatives but GNU ddrescue handles failing drives better because of that mapfile resume capability.




Filesystem Repair


ext4, force check, verbose, fix:
Code:
e2fsck -fvy /dev/sdX1


ext4 with alternate superblock if the primary is gone:
Code:
e2fsck -fvy -b 32768 /dev/sdX1


Find alternate superblock locations without writing anything:
Code:
mke2fs -n /dev/sdX1


XFS standard repair:
Code:
xfs_repair /dev/sdX1


XFS zero log, loses in-flight transactions but gets it mountable:
Code:
xfs_repair -L /dev/sdX1


Btrfs check:
Code:
btrfs check /dev/sdX1


Btrfs zero log:
Code:
btrfs rescue zero-log /dev/sdX1


Btrfs restore, pulls files even from an unmountable filesystem:
Code:
btrfs restore /dev/sdX1 /mnt/recovered/


NTFS fix for Windows drives:
Code:
ntfsfix /dev/sdX1




NVMe


Install nvme-cli on RHEL/Fedora:
Code:
dnf install nvme-cli


SMART equivalent health log:
Code:
nvme smart-log /dev/nvme0


Error log:
Code:
nvme error-log /dev/nvme0


Identify controller:
Code:
nvme id-ctrl /dev/nvme0


Identify namespace:
Code:
nvme id-ns /dev/nvme0n1


Sanitize, block erase equivalent to ATA secure erase:
Code:
nvme sanitize /dev/nvme0 --sanact=2


Watch Critical Warning, Available Spare, and Percentage Used in the smart-log output. Those three tell you most of what you need to know about NVMe health at a glance.




SMART via smartmontools


Quick health check:
Code:
smartctl -H /dev/sdX


Full attribute dump:
Code:
smartctl -a /dev/sdX


Short self-test, about two minutes:
Code:
smartctl -t short /dev/sdX


Long self-test, hours on large drives:
Code:
smartctl -t long /dev/sdX


Poll self-test results:
Code:
smartctl -l selftest /dev/sdX


USB drives with SAT bridge:
Code:
smartctl -d sat -a /dev/sdX


USB drives with JMicron bridge chips:
Code:
smartctl -d usbjmicron -a /dev/sdX


Key SMART attributes to watch: 05 is Reallocated Sector Count and any nonzero value is trouble. C5 is Current Pending Sector, sectors waiting to be remapped. C6 is Uncorrectable Sector Count. BB is Seagate-specific uncorrectable errors. 01 is Raw Read Error Rate, normalize against the vendor baseline before panicking.




USB Drive Resurrection


USB drives are their own special pain. The bridge controller chip matters as much as the NAND itself.


Identify the USB bridge controller:
Code:
lsusb -v | grep -A5 "Mass Storage"


Install f3 to fight fake-capacity drives:
Code:
dnf install f3


Fast probe:
Code:
f3probe /dev/sdX


Full write test:
Code:
f3write /mnt/usbdrive


Full read verify:
Code:
f3read /mnt/usbdrive


Interactive partition table recovery with testdisk:
Code:
testdisk /dev/sdX


File carving with photorec, ignores the filesystem entirely and finds files by signature:
Code:
photorec /dev/sdX


For completely dead USB drives where the controller has bricked itself, the last resort is NAND chip-off recovery. You physically desolder the flash chip and read it directly. That is lab territory and requires specialized hardware, but it exists and it works when everything else fails.




Resurrection Priority Order


  1. SMART check. Is the hardware even talking?
  2. ddrescue image. Preserve what you have before anything else.
  3. sgdisk --verify or testdisk. Is the partition table intact?
  4. fsck, xfs_repair, or btrfs rescue. Hit the filesystem layer.
  5. photorec or foremost. File carving if the filesystem is completely gone.
  6. Mount read-only regardless and see what you can see:
    Code:
    mount -o ro,noatime,noload /dev/sdX1 /mnt/recovery
  7. ATA secure erase or NVMe sanitize. Last resort for SSD performance recovery.

The cardinal rule: never write to a dying drive until you have an image. Every write attempt on a marginal drive risks turning unrecovered sectors into permanent losses.
 


Looks like a good article, Ray :)

However, I think I would probably move

The cardinal rule: never write to a dying drive until you have an image. Every write attempt on a marginal drive risks turning unrecovered sectors into permanent losses.

to the top, and then feature the image part.

Just a suggestion

Chris
 


Follow Linux.org

Members online


Top