# How come EXT4 slows my ssd so much?

## RayDude

```
server /mnt/backup/root # hdparm -tT /dev/nvme0n1

/dev/nvme0n1:

 Timing cached reads:   22260 MB in  2.00 seconds = 11144.34 MB/sec

 Timing buffered disk reads: 8146 MB in  3.00 seconds = 2715.05 MB/sec

server /mnt/backup/root # hdparm -tT /dev/nvme0n1p4

/dev/nvme0n1p4:

 Timing cached reads:   20114 MB in  2.00 seconds = 10068.90 MB/sec

 Timing buffered disk reads: 3356 MB in  3.00 seconds = 1118.66 MB/sec
```

This bugs me. I mean I really don't notice the performance difference, but it seem wrong for ext4 to create such an incredible overhead.

Is this normal? Is this expected?

----------

## Ant P.

Is the partition correctly aligned?

----------

## tmcca

I was going to say same thing make sure it is aligned. Also use fstrim instead of discard on root. You can use discard on boot I think that is correct approach.

How did you partition drive? Did you use parted?

----------

## mike155

Is ext4's lazy inode table zeroing still running?  See: 'man mkfs.ext4', option 'lazy_itable_init'.

----------

## NeddySeagoon

RayDude,

```
# hdparm -tT /dev/nvme0n1
```

 does raw sequential reads from the block device.

The contents of the blocks read are ignored. That is, the read speed returned by  

```
hdparm -tT
```

 does not depend on the filesystem, if any.

----------

## RayDude

Thanks for the quick replies.

I used gparted to partition the disk so the alignment should be correct.

I'll put fstrim on root and see if that makes a difference.

I'll check the lazy itable feature as well.

----------

## NeddySeagoon

RayDude,

fstrim is about erasing used but free space in good time before you want to reuse it.

It will make no difference to the read speed.

Boot from a liveCD and rerun the tests when you are sure the partitions are not in use.

Don't even mount them.

----------

## RayDude

 *NeddySeagoon wrote:*   

> RayDude,
> 
> fstrim is about erasing used but free space in good time before you want to reuse it.
> 
> It will make no difference to the read speed.
> ...

 

Thanks Neddy, I'll try that.

----------

## Naib

What is the IO scheduler being used?

----------

## Zucca

If you want to test filesystem performance, then use some other tool, like fio for example.

As Neddy said, hdparm "skips" filesystem. You can test disk performance with hdparm or (apparently) partition performance. As to why the partition performance is that much slower on an SSD, I have no clue. It would make sense if it was HDD you're testing...

Maybe it's about the IO scheduler as Naib was questioning.

I want to see how this ends up...

----------

## Naib

also note that hdparm expects pata/sata type devices, nvme is not that so it might mis-report. nvme-tools provides means to do block reads

----------

## Pearlseattle

 *RayDude wrote:*   

> 
> 
> ```
> server /mnt/backup/root # hdparm -tT /dev/nvme0n1
> 
> ...

 

I thought that the tests done by hdparm did not involve at all the specific filesystem used for the partition?

----------

## Hu

That is what NeddySeagoon and Zucca both said, yes.  The hdparm tests should be usable even on a device with no filesystem at all.

RayDude: please post the actual alignment so we can review whether the alignment is correct.  The smartctl -a output could also be interesting.  Hide any identifying data (such as serial numbers).  We only need general model information.

----------

## RayDude

Update: I ran hdparm from a system-restore boot flash on an unmounted /dev/nvmen0p4 and got the same results.

Thanks for telling me about fio, I'll try it.

I just checked and my kernel is configured for no IO Scheduler. How is that possible?

There are three choices: MQ deadline, Kyber, and BFQ. Which should I select?

What does it use if none is selected. I seriously wonder how I did this...

Update: none is apparently good for NVME: https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers

Edit: since I'm using a raid6 array, it looks like I should use deadline...

----------

## mike155

 *RayDude wrote:*   

> I just checked and my kernel is configured for no IO Scheduler. How is that possible? 

 

"none" (aka "noop") is the correct scheduler to use for NVMe disks. 

See: https://stackoverflow.com/questions/27664334/selecting-the-right-linux-i-o-scheduler-for-a-host-equipped-with-nvme-ssd

----------

## Pearlseattle

 *Quote:*   

> Edit: since I'm using a raid6 array, it looks like I should use deadline...

 

What do you mean RayDude? I think that you previously posted tests done directy against an nvme device and not against a raid... .

----------

## RayDude

 *Pearlseattle wrote:*   

>  *Quote:*   Edit: since I'm using a raid6 array, it looks like I should use deadline... 
> 
> What do you mean RayDude? I think that you previously posted tests done directy against an nvme device and not against a raid... .

 

The system boots off an NVME, but has a RAID6 arrary. To optimize the kernel for both the NVME and the RAID6 array it's best for me to use a deadline I/O scheduler. deadline doesn't slow the NVME much, but it improves the performance of the HD ARRAY.

----------

## RayDude

Here's the partition table, according to parted:

```
server ~ # parted /dev/nvme0n1

GNU Parted 3.2

Using /dev/nvme0n1

Welcome to GNU Parted! Type 'help' to view a list of commands.

(parted) p                                                                

Model: Unknown (unknown)

Disk /dev/nvme0n1: 1000GB

Sector size (logical/physical): 512B/512B

Partition Table: gpt

Disk Flags: 

Number  Start   End     Size    File system     Name      Flags

 1      1049kB  3146kB  2097kB                  BIOSBOOT  bios_grub

 2      3146kB  213MB   210MB   fat16           EFI       msftdata

 3      213MB   8803MB  8590MB  linux-swap(v1)  SWAP

 4      8803MB  1000GB  991GB   ext4            SERVER
```

Here's smarctl -a:

```
server ~ # smartctl -a /dev/nvme0n1

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.1.5-gentoo] (local build)

Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Model Number:                       CT1000P1SSD8

Serial Number:                      XXXXXXXXXXX

Firmware Version:                   P3CR010

PCI Vendor/Subsystem ID:            0xc0a9

IEEE OUI Identifier:                0x000000

Controller ID:                      1

Number of Namespaces:               1

Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]

Namespace 1 Formatted LBA Size:     512

Local Time is:                      Sun Jun  2 10:06:43 2019 PDT

Firmware Updates (0x14):            2 Slots, no Reset required

Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test

Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp

Maximum Data Transfer Size:         32 Pages

Warning  Comp. Temp. Threshold:     70 Celsius

Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States

St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat

 0 +     9.00W       -        -    0  0  0  0        5       5

 1 +     4.60W       -        -    1  1  1  1       30      30

 2 +     3.80W       -        -    2  2  2  2       30      30

 3 -   0.0500W       -        -    3  3  3  3     1000    1000

 4 -   0.0040W       -        -    4  4  4  4     6000    8000

Supported LBA Sizes (NSID 0x1)

Id Fmt  Data  Metadt  Rel_Perf

 0 +     512       0         0

=== START OF SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)

Critical Warning:                   0x00

Temperature:                        40 Celsius

Available Spare:                    100%

Available Spare Threshold:          10%

Percentage Used:                    0%

Data Units Read:                    2,925,784 [1.49 TB]

Data Units Written:                 3,735,578 [1.91 TB]

Host Read Commands:                 16,841,519

Host Write Commands:                25,212,969

Controller Busy Time:               844

Power Cycles:                       12

Power On Hours:                     198

Unsafe Shutdowns:                   2

Media and Data Integrity Errors:    0

Error Information Log Entries:      0

Warning  Comp. Temperature Time:    0

Critical Comp. Temperature Time:    0

Temperature Sensor 1:               41 Celsius

Temperature Sensor 2:               39 Celsius

Temperature Sensor 5:               59 Celsius

Error Information (NVMe Log 0x01, max 256 entries)

No Errors Logged
```

Thanks for your help, everyone!

----------

## molletts

 *RayDude wrote:*   

> The system boots off an NVME, but has a RAID6 arrary. To optimize the kernel for both the NVME and the RAID6 array it's best for me to use a deadline I/O scheduler. deadline doesn't slow the NVME much, but it improves the performance of the HD ARRAY.

 

You can use different schedulers on different devices if you like.

Put a line like this into /etc/udev/rules.d/10-ioscheduler.rules:

```
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none"
```

and the system should automatically use the noop scheduler for all NVMe devices and whatever you select as the default scheduler (e.g. deadline) for all other devices.

You can check which is being used for each device with something like:

```
cat /sys/block/nvme0n1/queue/scheduler
```

substituting the device name as appropriate. It will show a list of available schedulers with the selected one bracketed.

(If you want to try out different schedulers, you can also echo the name of a scheduler that is available in your kernel to the file to change it on the fly.)

Hope this helps,

Stephen

----------

## Anon-E-moose

hdparm works on devices, not partitions, and (I don't think) arrays.

if you want file system performance, then something like iozone would be more what you need.

Edit to add: not sure why there's a performance difference in your first post, it should make no difference whether you point to whole device or a partition of it, it still uses the whole device, because it talks to the controller (if I'm not mistaken)

https://ssd.userbenchmark.com/SpeedTest/607339/CT1000P1SSD8

If running on a NVMe/PCIe Gen3 x4 slot then the device is supposed to hit ~2000 for reads and ~1700 for writes.

if it's not a gen3 slot then it will be slower, especially if that slot is shared with other cards, which is common on many motherboards.

----------

## Hu

Please post the partition table without rounding.  sgdisk --print can do this.  The parted output is not clear whether the partitions are aligned to any of the commonly important boundaries.

----------

## RayDude

 *Hu wrote:*   

> Please post the partition table without rounding.  sgdisk --print can do this.  The parted output is not clear whether the partitions are aligned to any of the commonly important boundaries.

 

update: found it:

```
server ~ # sgdisk --print /dev/nvme0n1

Disk /dev/nvme0n1: 1953525168 sectors, 931.5 GiB

Model: CT1000P1SSD8                            

Sector size (logical/physical): 512/512 bytes

Disk identifier (GUID): 1A547616-F8A0-485F-B15F-B6723E76FF7C

Partition table holds up to 128 entries

Main partition table begins at sector 2 and ends at sector 33

First usable sector is 34, last usable sector is 1953525134

Partitions will be aligned on 2048-sector boundaries

Total free space is 3437 sectors (1.7 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name

   1            2048            6143   2.0 MiB     EF02  BIOSBOOT

   2            6144          415743   200.0 MiB   0700  EFI

   3          415744        17192959   8.0 GiB     8200  SWAP

   4        17192960      1953523711   923.3 GiB   8300  SERVER

```

I can't find sgdisk...

How about this output from fdisk:

```
server ~ # fdisk /dev/nvme0n1

Welcome to fdisk (util-linux 2.33.2).

Changes will remain in memory only, until you decide to write them.

Be careful before using the write command.

Command (m for help): p

Disk /dev/nvme0n1: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors

Disk model: CT1000P1SSD8                            

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disklabel type: gpt

Disk identifier: 1A547616-F8A0-485F-B15F-B6723E76FF7C

Device            Start        End    Sectors   Size Type

/dev/nvme0n1p1     2048       6143       4096     2M BIOS boot

/dev/nvme0n1p2     6144     415743     409600   200M Microsoft basic data

/dev/nvme0n1p3   415744   17192959   16777216     8G Linux swap

/dev/nvme0n1p4 17192960 1953523711 1936330752 923.3G Linux filesystem
```

----------

