# bad SATA/NVME write performance (IO scheduler ?)

## hintegerha

I have a *very* strange problem which kept me busy for several days already. I am running a kernel 4.19.86 on a H370 mobo having both a NVME card (PCI-E) and a md raid 5 array containing 3 disks connected via SATA 6G. The disks are OK; since they were taken from a system that operated without any problems before I did upgrade mobo.

Whenever I user more than 4G RAM write performance on *both* NVME and SATA gets very slow. read perfomance (via hdparm) is still normal.

As the setup for this system was quite a lot of work and a have to run some propriaty x86 binaries (old SCADA software), I did migrate to NVME only and stick to x86 architecture, RAM was upgraded from 3G to 16G.  going to x86_64 would be my last option.

Any idea ?

EDIT: I tried with a x86_64 kernel, here the degratiation is even worse  :Sad: 

regardsLast edited by hintegerha on Mon Jan 13, 2020 7:49 am; edited 7 times in total

----------

## NeddySeagoon

hintegerha,

Welcome to Gentoo.

With a 32 bit install and 16G of RAM there will be a lot of data copying going on.

The system can only address, at most, 4G RAM without some jiggery pokery that reduces performance.

Can you use a multilib install?

That's a 64 bit install that also supports 32 bit programs.

You get a 64 bit kernel, so it can see all of RAM and at the same time, will run 32 bit applications.

You can have a pure 64 bit install too but that's not yet the default.

Multilib gives you two copies of some things. A 64 bit copy and a 32 bit copy. That's the price of supporting both ABIs.

----------

## hintegerha

 *NeddySeagoon wrote:*   

> hintegerha,
> 
> Welcome to Gentoo.
> 
> With a 32 bit install and 16G of RAM there will be a lot of data copying going on.
> ...

 

Thx for your reply, I *am* aware of this additional effort to map the pages, and multilib would be an option, but still be a PITA as it means reinstall from scratch.  I do have another 8G 32 bit machine (a VM, but this shouldn't make a difference)  which performes quite well, but it does not have the "dual" storage channels ACPI/SATA and NVME. I Also have another machine with basically same setup, exept md array, i.e. only a "single" storage path via NVME, and here no degradiation of write speed is present. What makes me go crazy is that with write  operations on both NVME and SATA with the 64G HIGHMEM kernel (with 16G of physical RAM installed) iotop finally drops to below 1 MB/s.

When doing large file transfers (via SMB) on the md array, this takes AGES :  eg a VMware backup via acronis writing an archive with 40G would not finish under a day - while this was finished below 5 mins with the old mobo with "only" 3GB of RAM !!

Changing kernel to HIGHMEM_4G, makes both NFS and samba share fast responding  both on writing and reading.With the HIGHMEM_64G kernel read speed stays high, but write speed drops horrible after some GBs have been written, and only a reboot makes write speed go up again.

PS: I tried some benchmarks already by booting the x86_64 kernel from the minimal install, resulting in (forever) blocking hdparm -t commands and having to resync the md array, so I'm not confident this will be fix my issue.

PPS: and why is the slowdown only present when  writing, I can read data from my SMB share (the md array) via 1GB/s net @ ~ 100MB, more is not possible with a single GB/s NIC. Are there some issues with Intel H370 AHCIs controllers ? I triple checked mobo setup, the NVME is shared with SATA 2 port *only* if the M.2 slot us used as SATA, but I do have a PCIe M.2 SSD.

Another option I see, is limiting RAM via kernel cmd line: mem=8G, would that make any changes ? But nevertheless no single process needs/uses more than 4G -except kernel for buffering/caching - but that's beyond my influence. "Only" samba/NFS daemon, small BIND, vsftpd and  XFCE DE and small postgreSQL DB (small memory footprint with no transactions at all - only test system)

----------

## hintegerha

Setting mem=8G in a 32bit 64G_HIGHMEM enabled kernel brings ksoftirqd page allocation failures - WTF - looks like when lots of packets on intel NIC with intel driver e1000e v 3.6.0 are trasmitted, the kernel cannot cope with allocating memory fast enough, kswapd load goes up significantly as well.  But now nearly 50G written to the md array an no more page allocation failures, strange . The system is definitely more usable that when using full 16G of ram, but definitely not at it's best. I'm afraid I'll *have* to go through reinstall x86_64 multilib from scratch  :Sad:  :

```
[ 5467.984276] ksoftirqd/0: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)

[ 5467.984277] ksoftirqd/0 cpuset=/ mems_allowed=0

[ 5467.984281] CPU: 0 PID: 9 Comm: ksoftirqd/0 Tainted: G           O      4.19.86-gentoo #1

[ 5467.984282] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M-ITX/ac, BIOS P4.10 05/08/2019

[ 5467.984283] Call Trace:

[ 5467.984288]  dump_stack+0x4f/0x63

[ 5467.984290]  warn_alloc+0x7b/0xeb

[ 5467.984291]  __alloc_pages_nodemask+0xa4c/0xaaa

[ 5467.984294]  ? tcp_gro_receive+0x1af/0x207

[ 5467.984296]  page_frag_alloc+0x4a/0xcf

[ 5467.984297]  __netdev_alloc_skb+0x6d/0xbe

[ 5467.984301]  e1000_alloc_rx_buffers+0x6b/0x1a2 [e1000e]

[ 5467.984305]  ? e1000_clean_rx_ring+0x240/0x240 [e1000e]

[ 5467.984307]  e1000_clean_rx_irq+0x27d/0x2c9 [e1000e]

[ 5467.984309]  ? e1000_clean_jumbo_rx_irq+0x452/0x452 [e1000e]

[ 5467.984312]  e1000e_poll+0x60/0x1ee [e1000e]

[ 5467.984314]  net_rx_action+0xc8/0x238

[ 5467.984316]  __do_softirq+0xd5/0x1f2

[ 5467.984318]  run_ksoftirqd+0x15/0x1f

[ 5467.984320]  smpboot_thread_fn+0x116/0x12a

[ 5467.984321]  kthread+0xed/0xf2

[ 5467.984323]  ? sort_range+0x18/0x18

[ 5467.984323]  ? kthread_park+0x83/0x83

[ 5467.984325]  ret_from_fork+0x2e/0x38

[ 5467.984326] Mem-Info:

[ 5467.984329] active_anon:371678 inactive_anon:158881 isolated_anon:0

                active_file:201035 inactive_file:569393 isolated_file:0

                unevictable:1892 dirty:0 writeback:0 unstable:0

                slab_reclaimable:27819 slab_unreclaimable:16693

                mapped:108452 shmem:124564 pagetables:4456 bounce:0

                free:63149 free_pcp:1255 free_cma:0

[ 5467.984331] Node 0 active_anon:1486712kB inactive_anon:635524kB active_file:804140kB inactive_file:2277572kB unevictable:7568kB isolated(anon):0kB isolated(file):0kB mapped:433808kB dirty:0kB writeback:0kB shmem:498256kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

[ 5467.984334] DMA free:2952kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:1352kB inactive_file:1576kB unevictable:0kB writepending:16kB present:15984kB managed:8240kB mlocked:0kB kernel_stack:48kB pagetables:268kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

[ 5467.984334] lowmem_reserve[]: 0 735 5938 5938

[ 5467.984338] Normal free:1252kB min:3452kB low:4312kB high:5172kB active_anon:0kB inactive_anon:0kB active_file:70476kB inactive_file:69496kB unevictable:0kB writepending:88kB present:890872kB managed:758176kB mlocked:0kB kernel_stack:5872kB pagetables:17556kB bounce:0kB free_pcp:2696kB local_pcp:416kB free_cma:0kB

[ 5467.984339] lowmem_reserve[]: 0 0 41620 41620

[ 5467.984342] HighMem free:248392kB min:512kB low:6612kB high:12712kB active_anon:1486600kB inactive_anon:635524kB active_file:732020kB inactive_file:2206560kB unevictable:7568kB writepending:0kB present:5327424kB managed:5327424kB mlocked:7568kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:2324kB local_pcp:452kB free_cma:0kB

[ 5467.984343] lowmem_reserve[]: 0 0 0 0

[ 5467.984344] DMA: 21*4kB (UE) 34*8kB (UME) 20*16kB (UME) 21*32kB (UME) 3*64kB (UM) 1*128kB (U) 3*256kB (M) 1*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 2948kB

[ 5467.984351] Normal: 302*4kB (UME) 27*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1424kB

[ 5467.984356] HighMem: 15271*4kB (ME) 9356*8kB (ME) 2680*16kB (ME) 1313*32kB (ME) 310*64kB (M) 59*128kB (ME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 248220kB

[ 5467.984363] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

[ 5467.984363] 897635 total pagecache pages

[ 5467.984365] 912 pages in swap cache

[ 5467.984365] Swap cache stats: add 14544, delete 13632, find 1482/2010

[ 5467.984366] Free swap  = 4032764kB

[ 5467.984366] Total swap = 4080636kB

[ 5467.984367] 1558570 pages RAM

[ 5467.984367] 1331856 pages HighMem/MovableOnly

[ 5467.984367] 35110 pages reserved

[ 5467.984376] ksoftirqd/0: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)

[ 5467.984376] ksoftirqd/0 cpuset=/ mems_allowed=0

[ 5467.984378] CPU: 0 PID: 9 Comm: ksoftirqd/0 Tainted: G           O      4.19.86-gentoo #1

[ 5467.984378] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M-ITX/ac, BIOS P4.10 05/08/2019

[ 5467.984379] Call Trace:
```

....

[Moderator edit: added [code] tags to preserve output layout. -Hu]Last edited by hintegerha on Sat Dec 28, 2019 6:48 pm; edited 1 time in total

----------

## Hu

Another option would be to use a 64-bit kernel and keep the user programs as x86, ignoring the multilib support entirely.  This is a bit harder to maintain, but is less disruptive to try out, especially if you just want to test how it performs.

----------

## mike155

Does this article discuss your issue? 

You could try the proposed solution (2G/2G split).

----------

## hintegerha

I'm currently running a backup of my 32bit installation under x64 architecture (boot from minimal install CD) and even here IO is strange, I highly suspect some bug in kernel  :Sad:  I'M rsyncing my current NVME root to the md array, resulting in on and off long pauses beween the progress update, as if no read/write would be done at all. No hints from dmesg. I then tried to sync, resulting waited for serveral minutes  :Sad:  and then this in kernel dmesg output:

```

[ 2824.302050] INFO: task sync:11976 blocked for more than 120 seconds.

[ 2824.302051]       Not tainted 4.19.86-gentoo #1

[ 2824.302052] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2824.302053] sync            D    0 11976  11811 0x00000000

[ 2824.302054] Call Trace:

[ 2824.302060]  ? __schedule+0x626/0x691

[ 2824.302062]  ? __queue_work+0x255/0x2ae

[ 2824.302064]  schedule+0x65/0x6e

[ 2824.302066]  wb_wait_for_completion+0x51/0x7a

[ 2824.302068]  ? wait_woken+0x6a/0x6a

[ 2824.302070]  sync_inodes_sb+0xb1/0x239

[ 2824.302071]  ? __ia32_sys_tee+0x11/0x11

[ 2824.302072]  iterate_supers+0x67/0xad

[ 2824.302074]  ksys_sync+0x3b/0x9f

[ 2824.302075]  __ia32_sys_sync+0x5/0x8

[ 2824.302077]  do_syscall_64+0x57/0xf7

[ 2824.302079]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 2824.302080] RIP: 0033:0x7ff84c10f407

[ 2824.302083] Code: Bad RIP value.

[ 2824.302084] RSP: 002b:00007ffc1491d688 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2

[ 2824.302086] RAX: ffffffffffffffda RBX: 00007ffc1491d7b8 RCX: 00007ff84c10f407

[ 2824.302086] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00007ff84c1a7302

[ 2824.302087] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000

[ 2824.302087] R10: 000055b51544f98f R11: 0000000000000206 R12: 0000000000000000

[ 2824.302088] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

```

As I get quite normal write performance using a 32bit kernel limited to 8G, I guess hardware is OK ....  is there anything on the minimal install CD, that I can use to trace down on what kernel is waiting to write the dirty pages ???

[Moderator edit: added [code] tags to preserve output layout. -Hu]

----------

## hintegerha

 *mike155 wrote:*   

> Does this article discuss your issue? 
> 
> You could try the proposed solution (2G/2G split).

 

I don't think that matches my case. I see the slowdown also running a 64 bit kernel (with full 16GB of RAM, did not try to limit here yet) , are there AHCI controllers, that can address 8G of RAM via DMA only ? Especially when syncing large files (from NVME to md array) i see transfer rates of ~ 5 MB/s  :Sad:  But I guess the manufacturer won't state that there is max. 32GB of RAM addressable, if the board could address only 8GB via DMA and the rest via copying pages with CPU. BTW the board is a ASRock H370M-ITX/AC.

----------

## hintegerha

Tried the 64bit kernel from mininal install CD with mem=8G, same desaster as with 64bit kernel full 16G RAM and 32bit kernel full 16G RAM, something is seriously broken - but no clue what/where- I'll have to stick to my 32bit kernel with forced 8G RAM and "waste" the other 8G. Thats the *only* way my hardware is capable of running as an efficient file server, strange  :Sad: 

----------

## hintegerha

My conclusio running gentoo on an ASRock H370M-ITX/AC mobo with 16GB of RAM and NVME installation

x86 HIGHMEM_64G with no memory limitation: horrible SATA/NVME disk I/O performance (mostly write) after a few GB written, thzis applies only to my fileserver, (cifs modules loaded ~ 512k), on another station (same H/W, but no SATA md array, no nfs modules) disk I/O performance is normal, I guess there is enough kernel mem here for the mem_map[] structure (256MB in my case for 16GB of ram if i interpret https://flaterco.com/kb/PAE_slowdown.html correct) - on the x86 file server there isn't, so I have to force memeory to only 8GB, resulting in 128MB mem_map[] )

x86 HIGHMEM_64G mem=8G, IO perfomance normal, but obviously only using 50% installed ram

x86_64 I couldn't get a usable disk I/O performance with either unlimited memory or limited memory as with the x86 installation. Something is broken here  :Sad:  As posted before, I booted with minimal install x86_64  and tested an rsync from NVME root to md raid 5 array and got horrible write perfomance below 5-10MB/s (I only checked large files)

As I'm planning to replace my x86_x64 home server as well with this mobo, I definitely will run into this problem again, the question is how can I trace that ? Should I file a kernel bug ?

I guess the subject should be changed reflectiong the mobo and stating, that x86_64 is also causing issues, but that's not possible (at least for me)

Is anybody out there using an ASRock H370M-ITX/AC as a file server (using NVME as root device and md raid 5 array as storage) using gentoo x86_64 ?

Update: now I got following error (still can use system) - what is this ??

```
[ 8315.096916] ------------[ cut here ]------------

[ 8315.096918] memremap attempted on mixed range 0x000000000009d000 size: 0x1000

[ 8315.096924] WARNING: CPU: 3 PID: 13589 at kernel/iomem.c:81 memremap+0x5a/0x140

[ 8315.096925] Modules linked in: nfsd nfs_acl auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc ext4 jbd2 hid_logitech_hidpp hid_logitech_dj raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_mod binfmt_misc input_leds led_class usbhid iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass rtc_cmos i915 iosf_mbi drm_kms_helper syscopyarea sysfillrect sysimgblt snd_hda_intel xhci_pci snd_hda_codec fb_sys_fops xhci_hcd e1000e(O) snd_hda_core drm i2c_i801 snd_pcm igb usbcore intel_gtt snd_timer ahci ptp snd i2c_algo_bit pps_core i2c_core agpgart libahci usb_common video backlight button

[ 8315.096953] CPU: 3 PID: 13589 Comm: lshw Tainted: G           O      4.19.86-gentoo #1

[ 8315.096954] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M-ITX/ac, BIOS P4.10 05/08/2019

[ 8315.096956] EIP: memremap+0x5a/0x140

[ 8315.096957] Code: 00 83 7d e8 02 75 2a 80 3d 51 7c 81 c0 00 0f 85 c0 00 00 00 8d 45 ec 51 50 68 84 94 71 c0 c6 05 51 7c 81 c0 01 e8 77 05 f6 ff <0f> 0b e9 a0 00 00 00 31 c0 f6 c3 01 74 6e 83 7d e8 00 74 13 8b 45

[ 8315.096958] EAX: 00000041 EBX: 00000001 ECX: f315fac8 EDX: 00000006

[ 8315.096959] ESI: 0009d400 EDI: 00000000 EBP: c4e8de9c ESP: c4e8de74

[ 8315.096960] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286

[ 8315.096960] CR0: 80050033 CR2: bf9e4fc8 CR3: 2db24000 CR4: 003406f0

[ 8315.096961] Call Trace:

[ 8315.096964]  xlate_dev_mem_ptr+0x19/0x2b

[ 8315.096967]  read_mem+0x90/0x139

[ 8315.096969]  ? __fsnotify_parent+0xa8/0xb2

[ 8315.096970]  ? write_mem+0xef/0xef

[ 8315.096972]  __vfs_read+0x22/0x101

[ 8315.096974]  ? security_file_permission+0x7a/0x89

[ 8315.096975]  ? rw_verify_area+0xa0/0xf8

[ 8315.096976]  vfs_read+0x8c/0x10e

[ 8315.096977]  ksys_read+0x42/0x87

[ 8315.096979]  sys_read+0x11/0x13

[ 8315.096980]  do_int80_syscall_32+0x50/0xd3

[ 8315.096982]  entry_INT80_32+0xca/0xca

[ 8315.096983] EIP: 0xb7f82092

[ 8315.096984] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 2c 00 00 00 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00

[ 8315.096985] EAX: ffffffda EBX: 00000003 ECX: bf9e50f8 EDX: 00000400

[ 8315.096986] ESI: bfa250f9 EDI: bfa250fe EBP: 00000000 ESP: bf9e507c

[ 8315.096986] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246

[ 8315.096988] ---[ end trace 56c161f79a25f35a ]---

```

regards

[Moderator edit: added [code] tags to preserve output layout. -Hu]Last edited by hintegerha on Tue Jan 07, 2020 10:39 am; edited 1 time in total

----------

## NeddySeagoon

hintegerha,

You can edit the topic title by editing your first post in the topic. The title is there.

Your error ... 

```
[ 8315.096916] ------------[ cut here ]------------ 
```

is a kernel Oops. Something went wrong but the kernel was able to recover, unlike a panic, when recovery is not possible.

```
[ 8315.096953] CPU: 3 PID: 13589 Comm: lshw Tainted: G O 4.19.86-gentoo #1 
```

The problem was on CPU (core 3) and Process ID13589. Tainted means that you have one or more out of kernel modules loaded.

The rest of the output is the machine state which means nothing to me.

----------

## hintegerha

 *NeddySeagoon wrote:*   

> 
> 
> Your error ... 
> 
> ```
> ...

 

yeah, I thought of something like that  :Wink:  , but what does :

```
memremap attempted on mixed range 0x000000000009d000 size: 0x1000
```

 mean ? Is this related to PAE ? 

 *NeddySeagoon wrote:*   

> 
> 
> ```
> [ 8315.096953] CPU: 3 PID: 13589 Comm: lshw Tainted: G O 4.19.86-gentoo #1 
> ```
> ...

 

The only non  stock module is intel e1000e NIC driver (latest version 3.6.0, which shouldn't cause any problem), I used this already in the previous hardware/software setup. Unfortunately the process with given PID is not running any more ... I guess it was smbd, while I was doing a backup to the samba share ...

----------

## Hu

According to the output, the process was lshw.

The kernel has an e1000e driver in-tree, as of several years ago.

----------

## msst

Just to be sure and to exclude possible weird gotchas which are possible with faulty RAM: Did you run an overnight memtest on the system to be sure the RAM works under load?

I have had a faulty RAM once and the problems encountered with it can be fairly cryptic and erratic...

The consistent low write performance points to a more principle problem though. But the kernel oops are weird and could come from RAM problems.

----------

## Blind_Sniper

what  NVME card do you use?

----------

## hintegerha

 *Blind_Sniper wrote:*   

> what  NVME card do you use?

 

I have a Silicon Power P34A80 512 GB M.2 3400 MB/s PCIe x4 3.0 3400/2700

----------

## hintegerha

 *msst wrote:*   

> Just to be sure and to exclude possible weird gotchas which are possible with faulty RAM: Did you run an overnight memtest on the system to be sure the RAM works under load?
> 
> I have had a faulty RAM once and the problems encountered with it can be fairly cryptic and erratic...
> 
> The consistent low write performance points to a more principle problem though. But the kernel oops are weird and could come from RAM problems.

 

You're right, just discovered that the machine was at load > 20 while doing exactly nothing  :Sad:  and all backups failed WTF. I Will definitely run a memtest next time I'm in front of that station ..... but as I see no data loss, just horrible slowlyness, I guess the problem lies somewhere else ... maybe it is also the mainboard ... wasn't able to test/simulate this at my home server yet, I'm still stunned how weird this is and refuse to change a running system (I'm taling of my home server now, not the one I was referring to in my post)Last edited by hintegerha on Sun Jan 05, 2020 5:40 pm; edited 1 time in total

----------

## hintegerha

 *Hu wrote:*   

> According to the output, the process was lshw.
> 
> The kernel has an e1000e driver in-tree, as of several years ago.

 

Yes I know, and despite of this fact I was running the intel e1000e driver for several years without *any* problem.

----------

## hintegerha

 *msst wrote:*   

> Just to be sure and to exclude possible weird gotchas which are possible with faulty RAM: Did you run an overnight memtest on the system to be sure the RAM works under load?
> 
> I have had a faulty RAM once and the problems encountered with it can be fairly cryptic and erratic...
> 
> The consistent low write performance points to a more principle problem though. But the kernel oops are weird and could come from RAM problems.

 

RAM (all 16G) ar OK, no overnight test, but at least one pass (single CPU), ran for about 1 hr.

----------

## hintegerha

Changed now to HIGHMEM_4G, but an absolute idle system (lightdm login X screen, idle samba/nfsd, idle apache, postgreSQL and mysql) still shows load ~ 1. How can I find out what the system is waiting for  ? The load should drop to almost zero, right ? At least backups can be done to the samba share. (but load goes up to at least 6-7 while doing so  :Sad: 

NVME read peformance better ~ 1000MB/s (while the specs say >3000MB/s - is NVME already mature + stable ?

md (Raid 5) read performance ~ 200MB/s, OK ,given the fact that these are 2.5" slow hard drives @ 5900rpm

iostat shows no write activities  :Sad:  and 100% cpu idle

```

avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           0,00    0,00    0,00    0,00    0,00  100,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

nvme0n1           0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00

sda               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00

sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00

sdc               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00

md0               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00    0,00    0,00   0,00   0,00
```

vmstat says basically no swapping and CPU id 100=absolutely idle:

```

 vmstat 1

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

 0  0  11264 202832   1888 1101856    0    0   215   171  144  109  0  0 96  4  0

 0  0  11264 202572   1888 1101872    0    0     0     0  529  306  0  0 100  0  0

 0  0  11264 202572   1888 1101872    0    0     0     0  525  309  0  0 100  0  0

 0  0  11264 212616   1888 1092200    0    0     0     0  575  333  0  0 100  0  0

 0  0  11264 212652   1888 1092132    0    0     0     4  528  312  0  0 100  0  0

 0  0  11264 212652   1888 1092164    0    0     0     0  535  322  0  0 100  0  0

 0  0  11264 212652   1888 1092164    0    0     0     0  520  292  0  0 100  0  0

 0  0  11264 212652   1888 1092164    0    0     0     0  535  328  0  0 100  0  0

 0  0  11264 212668   1888 1092164    0    0     0     0  539  300  0  0 100  0  0

 0  0  11264 212668   1888 1092164    0    0     0     0  529  297  0  0 100  0  0

 0  0  11264 212668   1888 1092164    0    0     0     0  528  322  0  0 100  0  0

 0  0  11264 212668   1888 1092164    0    0     0     0  538  330  0  0 100  0  0

 0  0  11264 212668   1888 1092164    0    0     0     0  530  332  0  0 100  0  0

 0  0  11264 212668   1888 1092164    0    0     0     0  530  309  0  0 100  0  0

 0  0  11264 212668   1888 1092164    0    0     0     0  538  315  0  0 100  0  0

 0  0  11264 212668   1888 1092164    0    0     0     0  534  324  0  0 100  0  0
```

but look:

```
cat /proc/loadavg

1.01 1.06 1.02 1/319 12913
```

 and

```
 uptime

 22:43:24 up 12:04,  1 user,  load average: 1,03, 1,05, 1,01
```

 what the hell is going on here ?

----------

## hintegerha

Oops, this time i discovered a mce error  -for the first time ever (whilest  load increased to > 3 despite system is idle) !

```

[77578.962570] mce: [Hardware Error]: Machine check events logged

[77578.962573] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 0: 9000004000010005

[77578.962574] mce: [Hardware Error]: TSC da7c50864b82 

[77578.962577] mce: [Hardware Error]: PROCESSOR 0:906eb TIME 1578381101 SOCKET 0 APIC 4 microcode b4

```

I checked MB manufacturer page and found out that new BIOS with newer microcode was released today. Will give it a try. The MB is not listed as Linux compatible, but  I never ever thought that this could really be the case ... we'll see, nevertheless on the other machine (no fileserver) everything looks ok, here the load is ~ 0.3 with an active X with xscreensaver running, that's definitely normal.

BTW, I'm still posting from this machine !!! really strange.

EDIT: flashed latest BIOS, no mce until now (which doesn't mean a lot, since I was running the system for several days before  - of course with bad I/O behaviour, but at least running), now load below 1 (which was the case event before BIOS update after reboot) but after several hours load at >2 without doing ANYTHING, no swapping, only a few I/O , but i have no clue what triggers the system so the load will go up without doing anything  :Sad: 

----------

## hintegerha

Did some more tests and discovered this (Ubuntu, but basically the same) https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1333294

My experience is, that once write speed is throttled, a

```
echo 3 > /proc/sys/vm/drop_caches
```

 (1 would probably be sufficient) 

brings back original write speed, but if dirty pages increase fast, writes will get slow again soon.

The above article mentions setting highmem_is_dirtyable to 1 as a permanent solution. (did not verify yet) . From the docs:

```
Official reference

Available only for systems with CONFIG_HIGHMEM enabled (32b systems).

This parameter controls whether the high memory is considered for dirty writers throttling. This is not the case by default which means that only the amount of memory directly visible/usable by the kernel can be dirtied. As a result, on systems with a large amount of memory and lowmem basically depleted writers might be throttled too early and streaming writes can get very slow.

Changing the value to non zero would allow more memory to be dirtied and thus allow writers to write more data which can be flushed to the storage more effectively. Note this also comes with a risk of pre-mature OOM killer because some writers (e.g. direct block device writes) can only use the low memory and they can fill it up with dirty data without any throttling.
```

I'm definitely affected by this write throttling, what is strange is the fact that I was hit also when copying data from NVME to md array with a 64bit kernel (x86_64 minimal install cd booted - if I remember correct - I definitely have to rerun this test !)

----------

## hintegerha

just running another test, this write speed throttling is definitely also taking place under a 64bit kernel. Is there any way to disable this on a e.g. fileserver alsolutely unnecessary bug/feature ??? I guess is one of the vm.dirty_* settings ?

Or is this cgroups related ? @home I have a 64bit server, which is able to constantly write from SATA to md raid5 array with >150MB/s, even though on some tests rsync freezes as well, when buffers goes up more than 50% of memory (totall RAM 16G)

Looks like this is IO scheduler related, I tried some of the available schedulers in my 64bit kernel, but didn't find a good setting yet. In nearly every test scenario the IO gets suspended at a certain data transfer load, dropping the caches brings back "normal" transfer rates. But that of course is not the optimal solution ...

----------

