# SSD deterioration

## lockie

Hello.

About a year ago I got myself a new rig with two shiny SSDs, some Kingston nvme for home partition and 240Gb Kingston A400 for the root/EFI partitions. Recently I've noticed some visible lag launching applications (e.g. mpv, pcmanfm, clang compiler) which disappears on the second launch; measurements show that it is pretty bad:

```

$ /usr/bin/time -v clang-10 --version

clang version 10.0.1

Target: x86_64-pc-linux-gnu

Thread model: posix

InstalledDir: /usr/lib/llvm/10/bin

        Command being timed: "clang-10 --version"

        User time (seconds): 0.07

        System time (seconds): 0.02

        Percent of CPU this job got: 0%

        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.73

        Average shared text size (kbytes): 0

        Average unshared data size (kbytes): 0

        Average stack size (kbytes): 0

        Average total size (kbytes): 0

        Maximum resident set size (kbytes): 43976

        Average resident set size (kbytes): 0

        Major (requiring I/O) page faults: 441

        Minor (reclaiming a frame) page faults: 2003

        Voluntary context switches: 354

        Involuntary context switches: 11

        Swaps: 0

        File system inputs: 88840

        File system outputs: 0

        Socket messages sent: 0

        Socket messages received: 0

        Signals delivered: 0

        Page size (bytes): 4096

        Exit status: 0

$ /usr/bin/time -v clang-10 --version  # the second launch

clang version 10.0.1

Target: x86_64-pc-linux-gnu

Thread model: posix

InstalledDir: /usr/lib/llvm/10/bin

        Command being timed: "clang-10 --version"

        User time (seconds): 0.00

        System time (seconds): 0.00

        Percent of CPU this job got: 42%

        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02

        Average shared text size (kbytes): 0

        Average unshared data size (kbytes): 0

        Average stack size (kbytes): 0

        Average total size (kbytes): 0

        Maximum resident set size (kbytes): 43696

        Average resident set size (kbytes): 0

        Major (requiring I/O) page faults: 0

        Minor (reclaiming a frame) page faults: 2275

        Voluntary context switches: 2

        Involuntary context switches: 7

        Swaps: 0

        File system inputs: 0

        File system outputs: 0

        Socket messages sent: 0

        Socket messages received: 0

        Signals delivered: 0

        Page size (bytes): 4096

        Exit status: 0

```

I mean yeah, clang loads a couple of hundreds of megabytes worth of shared libraries, but ten seconds?..

I'm quite weirded out by excessive count of major page faults in the first launch, is that normal?

Measuring the disk in question's performance with synthetic tests also shows speed problem - it is order of magnitude less than advertised:

```

$ sudo hdparm -Ttv /dev/sda3

/dev/sda3:

 multcount     =  1 (on)

 IO_support    =  1 (32-bit)

 readonly      =  0 (off)

 readahead     = 256 (on)

 geometry      = 29185/255/63, sectors = 405305344, start = 63555584

 Timing cached reads:   28352 MB in  2.00 seconds = 14192.68 MB/sec

 Timing buffered disk reads: 152 MB in  3.11 seconds =  48.92 MB/sec

```

To compare, here's the NVME one (which suffers no such problems):

```

$ sudo hdparm -Ttv /dev/nvme0n1p1

/dev/nvme0n1p1:

 readonly      =  0 (off)

 readahead     = 256 (on)

 geometry      = 953868/64/32, sectors = 1953521664, start = 2048

 Timing cached reads:   29320 MB in  2.00 seconds = 14677.48 MB/sec

 Timing buffered disk reads: 3460 MB in  3.00 seconds = 1152.79 MB/sec

```

I've also tried benchmarking with the Gnome tool, and speed performance graph looks... most weird: https://imgur.com/a/eBjXTBW . The first few seconds the speed is the same miserable 50Mb/s.

But, the SMART reports nothing bad, still 90% life left:

```

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x0032   100   100   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       4320

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       568

148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0

149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0

167 Write_Protect_Mode      0x0000   100   100   000    Old_age   Offline      -       0

168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0

169 Bad_Block_Rate          0x0000   100   100   000    Old_age   Offline      -       17

170 Bad_Blk_Ct_Erl/Lat      0x0000   100   100   010    Old_age   Offline      -       0/15

172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0

173 MaxAvgErase_Ct          0x0000   100   100   000    Old_age   Offline      -       123 (Average 97)

181 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0

182 Erase_Fail_Count        0x0000   100   100   000    Old_age   Offline      -       0

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       45

194 Temperature_Celsius     0x0022   033   042   000    Old_age   Always       -       33 (Min/Max 23/42)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

199 SATA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0

218 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       0

231 SSD_Life_Left           0x0000   090   090   000    Old_age   Offline      -       90

233 Flash_Writes_GiB        0x0032   100   100   000    Old_age   Always       -       9923

241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       8618

242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       2619

244 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       97

245 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       123

246 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       352992

```

The root partition is also only 40% full. Surely I do have swap partition on this drive, but I have sysctl vm.swappiness set to 1, so it is barely ever used.

So this does not look like the faulty hardware. Or maybe it is and I just have to replace it with the new one (probably from another vendor)? Might I be missing some important detail here? Any input or insight appreciated.

----------

## NeddySeagoon

lockie,

How do you trim it?

It all the free space has been written, the write speed drops alarmingly as the drive has to do a very slow erase before it can write anything new.

This does not apply to swap space as swapon does a trim.

```
fstrim -av
```

will tell the drive about all the free space that is not in use, so the drive can erase it.

The first time its run, it will report all the free space. Its up to the drive to not erase already erased space.

You can also use the discard option to mount but that may be a bad thing for the drive life. 

```
241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       8618

242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       2619 
```

That's a lot of writes that are never being read. Is /var/tmp/portage there?

----------

## lockie

Hey NeddySeagoon,

Thanks for the advice!

I've run fstrim command and it gave me this:

```
# fstrim -av

/home: 476,6 GiB (511736147968 bytes) trimmed on /dev/mapper/home

/boot: 500,6 MiB (524935168 bytes) trimmed on /dev/sda1

/: 118,5 GiB (127194042368 bytes) trimmed on /dev/sda3

```

Still, even after rebooting the box the problem with low reading speed persists; launching clang still takes 10 seconds.

I do have discard in mount options:

```
$ mount | grep sda3

/dev/sda3 on / type ext4 (rw,noatime,discard)

```

I was under the impression it should be fine.

 *Quote:*   

> Is /var/tmp/portage there?

 

Yeah, and I'm pretty obsessive updater, running full update once in a week or so   :Embarassed:  Also I've been using ccache; recently dropped it though, seeing the Life left percentage is dropping.

----------

## NeddySeagoon

lockie,

Trim/discard is not an instant thing.

It informs the SDD about the free space that it can erase any time it wants to.

Some SSD firmware take it as an immediate command. That's bad for SSD life as it leads to excessive write amplification.

Other SSDs just make notes and do the discard later, when they work out its really required.

You can't tell how any particular model works.

I've stopped using discard and made fstrim a part oy my update routine.

How much RAM do you have?

You may want to put /var/tmp/portage into tmpfs to avoid the writes that will never be read.

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 

31 SSD_Life_Left           0x0000   090   090   000    Old_age   Offline      -       90 
```

That's 90% of the SSD write life left. 

In a year, you have used 10% of the drive write endurance. It will last another 9 years.

You will likely have replaced it before it wears out.

----------

## lockie

Thanks for clarifications on trim NeddySeagoon!

----------

## Hu

Even if we assume the drive is suffering massive write amplification, shouldn't that have little or no effect on a read-heavy workload, like that hdparm test or the startup of clang, which should mostly be reading shared libraries in /usr?

----------

## NeddySeagoon

Hu,

True. 

I had in mind lockfiles being written but if they exist at all, they should bu in tmpfs.

----------

## mike155

@Hu: you're probably right!

 *Quote:*   

> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:10.73 

 

10 seconds? Looks like a socket timeout.

@lokie: you could use strace to find out what's going on here...

```
strace -o /tmp/strace.log -f -t clang-10 --version
```

----------

## lockie

Hey @mike155, nice idea, thanks! Here's the result of running strace: https://pastebin.com/B4CVB2hd . I can't see nothing criminal in there, except those few mprotect calls which take a couple of seconds...

 *Quote:*   

> Looks like a socket timeout.
> 
> 

 

Yeah right, except that clang seemingly does not access any sockets   :Confused: 

----------

## mike155

 *lockie wrote:*   

> Hey @mike155, nice idea, thanks! Here's the result of running strace: https://pastebin.com/B4CVB2hd . I can't see nothing criminal in there, except those few mprotect calls which take a couple of seconds...
> 
>  *Quote:*   Looks like a socket timeout.
> 
>  
> ...

 

I agree.

The lines below are interesting:

```
7202  07:18:43 mprotect(0x7f25bb285000, 16384, PROT_READ) = 0

7202  07:18:43 mprotect(0x7f25bb2b5000, 4096, PROT_READ) = 0

7202  07:18:46 mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f25b9224000

7202  07:18:46 mprotect(0x7f25becef000, 3149824, PROT_READ) = 0

7202  07:18:48 mprotect(0x7f25c1c6e000, 1638400, PROT_READ) = 0

7202  07:18:48 mprotect(0x55fc1f062000, 4096, PROT_READ) = 0
```

Something happens there. It's either related memory management (swap?) or it's not related to system calls at all.

Please redo the test and use options '-tt' and '-T':

```
strace -o /tmp/strace.log -f -tt -T clang-10 --version
```

----------

## lockie

 *Quote:*   

> Please redo the test and use options '-tt' and '-T':

 

Sure, here's the result: https://pastebin.com/acpCgNCQ

 *Quote:*   

> It's either related memory management (swap?) or it's not related to system calls at all.

 

Not sure those are swaps, calling time with -v shows "Swaps: 0".

I reckon trying running this under profiler, and a lot of time was spent by clang at the guts of dynamic loader, erm, loading libraries.

----------

## eccerr0r

Just to rule out other stuff, what kernel are you using? Changed recently? and what SATA driver (hopefully not generic or using emulation?) and io scheduler (more in case if there's a bug?)

Still weird that it wants to page out but you should have plenty of ram?

----------

## lockie

@eccerr0r, the kernel is zen-sources 5.13.13, installed in september (but I reckon I've seen those disk lags for a few months now, so it might be a culprit).

Not sure how to peek at my SATA driver, here's grepping kernel config for "sata":

```
zcat /proc/config.gz|grep SATA

CONFIG_SATA_HOST=y

# CONFIG_SATA_ZPODD is not set

# CONFIG_SATA_PMP is not set

CONFIG_SATA_AHCI=y

CONFIG_SATA_MOBILE_LPM_POLICY=0

# CONFIG_SATA_AHCI_PLATFORM is not set

# CONFIG_SATA_INIC162X is not set

# CONFIG_SATA_ACARD_AHCI is not set

# CONFIG_SATA_SIL24 is not set
```

IO scheduler is all new and shiny Kyber; this definitely might be a problem too

```
cat /sys/block/sda/queue/scheduler

[kyber] none

```

Yeah, I do have RAM excess - there's 32 gigs, and I almost never seen swapping on this rig.

----------

## lockie

All right, tried compiling in BFQ and turning it on for the disk in question, - no luck, gnome-disks shows the same picture as before: first few secods the speed is basically crap, then it starts jumping trying to reach advertised speed (failing miserably though): https://i.imgur.com/gjHImtb.png

I'll also try stock kernel, but my bet is the hardware just went kaboom. Not exactly sure why though, using ccache with portage on SSD was obviously bad idea, but was it that bad so that it managed to ruin the disk in about a year?..

----------

## NeddySeagoon

lockie,

If something is getting flushed to swap, turn off swap for the test.

That does not prevent swapping. Only saving dynamically allocated RAM to HDD. 

You tested with 

```
hdparm -Ttv /dev/sda3
```

Do the other partitions return the same result?

What about the entire device?

Partitions on an SSD are mythical. The concept of a partition as a contiguous sequence of blocks does not physically exist.

Rather like addressing a HDD using the CHS method. It works but its mythical.

What about testing with dd and bs=4M ?

Don't make a mes of that. It can ruin you whole install.

What else is using the drive?

Can you test from some boot media?

----------

## eccerr0r

I have much less RAM on some machines and swap to SSD, so far they still work.  Luckily your drive is giving statistics of your usage, I have one 32G SSD (mPCIe) that I'm still trying to figure out how much endurance is left on it, no dice.  Still works however but I'm sure I've got hundreds of erase cycles over the whole disk already due to its size.

It only tops out at around 70MB/sec read and write though it's consistent unlike what you're seeing, alas the seek time make it faster than a mechanical disk.  It too I sometimes see bad behavior - but specifically on writes only.  When I fill the disk up past 70-80% and it needs to start garbage collection, it gets real slow down to 3MB/sec writes.  Reads aren't affected until they are blocked by an interspersed barrier write.  This SSD also does not support trim.

But agreed can't rule out a failing disk but need to rule out other possibilities too.

----------

## lockie

 *Quote:*   

> Do the other partitions return the same result?

 

Weirdly enough, I'm getting adequate speed with 512M EFI boot partition (formatted to FAT32 of course):

```
# hdparm -Ttv /dev/sda1

/dev/sda1:

 multcount     =  1 (on)

 IO_support    =  1 (32-bit)

 readonly      =  0 (off)

 readahead     = 256 (on)

 geometry      = 29185/255/63, sectors = 1048576, start = 2048

 Timing cached reads:   25820 MB in  2.00 seconds = 12923.76 MB/sec

 Timing buffered disk reads: 512 MB in  1.14 seconds = 449.38 MB/sec

```

But then there are those crappy speeds for other partitions (swap and root):

```
# hdparm -Ttv /dev/sda2

/dev/sda2:

 multcount     =  1 (on)

 IO_support    =  1 (32-bit)

 readonly      =  0 (off)

 readahead     = 256 (on)

 geometry      = 29185/255/63, sectors = 62504960, start = 1050624

 Timing cached reads:   29560 MB in  2.00 seconds = 14798.65 MB/sec

 Timing buffered disk reads:  42 MB in  3.13 seconds =  13.41 MB/sec

# hdparm -Ttv /dev/sda3

/dev/sda3:

 multcount     =  1 (on)

 IO_support    =  1 (32-bit)

 readonly      =  0 (off)

 readahead     = 256 (on)

 geometry      = 29185/255/63, sectors = 405305344, start = 63555584

 Timing cached reads:   29678 MB in  2.00 seconds = 14858.73 MB/sec

 Timing buffered disk reads: 154 MB in  3.07 seconds =  50.17 MB/sec

```

The results for the whole drive are, erm, average of those:

```
# hdparm -Ttv /dev/sda

/dev/sda:

 multcount     =  1 (on)

 IO_support    =  1 (32-bit)

 readonly      =  0 (off)

 readahead     = 256 (on)

 geometry      = 29185/255/63, sectors = 468862128, start = 0

 Timing cached reads:   25580 MB in  2.00 seconds = 12803.34 MB/sec

 Timing buffered disk reads: 542 MB in  3.11 seconds = 174.43 MB/sec

```

I'll make a test with some bootable liveusb when I'll have spare time, thanks for the suggestion @NeddySeagoon!

----------

## NeddySeagoon

lockie,

That's very odd. hdparm does raw reads of the block device you point it at. Any filesystem will be ignored.

I'm not sure how it does it or the black size it uses. With a small block size, the results will be horrible because of all the syscalls.

Hence dd with a 4M block size.

If you do 

```
swapoff

swapon /dev/...
```

does that change the speed?

It forces a trim on all of swap, but that should only help write speeds. 

I'm still interested in the speeds that dd reports, since I know what that does.

```
dd if=/dev/sda oy=/dev/null bs=4M
```

You can add count=1024 to test over 4GiB and stop.

----------

## eccerr0r

I think for statistical checks, don't count the small EFI partition  or at least you must weight it a lot.  hdparm measures how much data it can pull in a fixed amount of time (~3 seconds) and if it ran out of data by hitting the end of the device, it compensates for it by extrapolation.

This SSD is a TLC drive or more.  And it appears that its wear leveling is not great either.  I'd be tempted to say you may need to replace the disk at some point but given the data presented here I'd avoid buying this particular drive...

Another weird behavior I've seen - partitioning an SSD, even on erase boundaries, seems to lead to odd behavior in terms of garbage collection.  I haven't done in depth study of this phenomenon but it seems to behave strangely.  My current SSD arsenal is a 32G, a 180G, a 240G, and a 256G.  I'm exhibiting really strange write slowdowns on the 240G of which I have many partitions due to dual boot.  The other drives are one partition disks and getting full speed all the way, with the exception of getting slowdowns when the 32G disk goes past 70% utilization.  Again weird but probably not relevant to this discussion.

Incidentally, I wonder if write fragmentation is something that needs to be cleaned up once in a while on disks that perform poor wear leveling.  As in copy off - erase the disk - and copy back.  Wastes write endurance but might give a performance boost until it gets clogged up once more...

----------

## lockie

So I've booted from liveUSB only to witness the same pathetic read speeds. I've also done dd test, results of which are quite weird: the speed at the beginning of the disk is better than further:

```
$ dd if=/dev/sda of=/dev/null bs=4M count=128

128+0 records in

128+0 records out

536870912 bytes (537 MB, 512 MiB) copied, 1.17396 s, 457 MB/s

$dd if=/dev/sda of=/dev/null bs=4M count=256

256+0 records in

256+0 records out

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 40.3575 s, 26.6 MB/s

```

Anyway, I've googled around and found reddit thread mentioning firmware bug in my drive model, so I've grabbed old rig with Windows dualboot and tried flashing new firmware, only to find out that the drive has latest version of firmware and there's nothing to update. So I freaked out, went to nearest computer shop and bought another $35 SSD drive, with the brand Goodram this time (I mean, it literally cannot be bad, right?   :Laughing:  ). Even while cloning data with Clonezilla I was getting ridiculous speeds, around 600M/min at the beginning and 900M/min at the end of the disk. Anyway, yeah, avoid Kingston A400 drives, just in case.

Thanks for all your help guys!

----------

## eccerr0r

After you back up the drive, I'm curious what happens if you wipe the disk and reuse it.  I don't know if you should just mkfs it again or dump /dev/zero to it first, but would be interesting...

----------

## NeddySeagoon

eccerr0r,

Making a filesystem should force a trim of the entire space.

Writing zeros is not the same as erased, It just consumes one more erase cycle.

----------

## eccerr0r

also assuming the drive doesn't support or inefficiently supports trim...

The drive doesn't know what is valid and invalid data, rewriting all blocks will at least reprogram most of the lru data so it could affect its block allocation strategy....

----------

## NeddySeagoon

eccerr0r,

Agreed but how does that affect the read speed?

----------

## eccerr0r

Don't know.  It could actually be hitting weak spots and it's retrying reads that hit...an ecc error... and slowing things down.  Or if there happened to be an interspersed write during the read and it had to reverse direction for a bit.  Don't know...

----------

