# Filesystem for SSD?

## The_Great_Sephiroth

I am ordering a 2TB SSD (SATA) for my new laptop tomorrow. I have never run a work system on an SSD before. After doing a lot of reading I believe I have narrowed it down to BTRFS or F2FS. BTRFS will not do duplicate data mode or anything due to many SSDs doing that on their level, so no real benefit there. F2FS seems to be what I need, but can I use it? I know /boot would need to be BTRFS, ext4, or something GRUB can use. What about the rest of the system? It was designed for this task and I assume it would do it well, but I have zero experience with it.

Also, would F2FS work for the partition I use for VM storage (VirtualBox)? I know I can tell the guest OS that the disk is an SSD, but could I host ISO images and various virtual disk images on F2FS? I cannot host ISO images on BTRFS for whatever reason.

----------

## mike155

I use ext4 on my SSDs and I'm happy with it.

----------

## Jaglover

I converted my desktop root filesystem to F2FS some time ago. No complaints. F2FS can be used to host swap files, I think this answers your question about images.

----------

## Perfect Gentleman

@Jaglover, could you post kernel setting for f2fs?

----------

## The_Great_Sephiroth

Mike, I ruled out ext4 because it is a journaled system. I love it on an HDD and while it supports TRIM, it was built before SSDs were a thing and TRIM and the like are compatibility enhancements. BTRFS and F2FS are built for SSDs.

Jaglover, thank you for the info. I believe that this is the route I will be taking. I will have an eSATA or USB3 HDD formatted with BTRFS DUP profile to serve as a backup. Thank you!

----------

## fedeliallalinea

 *Perfect Gentleman wrote:*   

> @Jaglover, could you post kernel setting for f2fs?

 

https://wiki.gentoo.org/wiki/F2FS#Kernel

----------

## Perfect Gentleman

 *fedeliallalinea wrote:*   

>  *Perfect Gentleman wrote:*   @Jaglover, could you post kernel setting for f2fs? 
> 
> https://wiki.gentoo.org/wiki/F2FS#Kernel

 

No, I mean I want his settings.

----------

## Jaglover

These are mine. I decided I do not want extra security for this box.

```
CONFIG_F2FS_FS=y

CONFIG_F2FS_STAT_FS=y

CONFIG_F2FS_FS_XATTR=y

CONFIG_F2FS_FS_POSIX_ACL=y

# CONFIG_F2FS_FS_SECURITY is not set

# CONFIG_F2FS_CHECK_FS is not set

# CONFIG_F2FS_FAULT_INJECTION is not set

```

----------

## The_Great_Sephiroth

I wish Windows supported a flash-oriented filesystem. Oh wait, I have WinBTRFS and it works great. I just bought a 1TB M.2 chip for my gaming rig running 7 Pro 64bit. Since NTFS and exFAT are not exactly flash-friendly, I am formatting it with BTRFS!

I am also now testing F2FS on the 16GB USB drive I use on my old laptop for portage. Seems to be better than ext2. Like, maybe it's a hair faster?

----------

## eccerr0r

A modern SSD's wear leveling has no problem dealing with NFS/FAT/journalling filesystem.  Even trim is not explicitly needed though it could help with reducing the number of unexpected slowdowns -- then again most people do not even write software for RAM use with explicit GC calls, which effectively what trim does.

I can't say the wear leveling is good inside all SD/(e)MMC and USB media so they could/will benefit from F2FS, though consistency problems from sudden media removal may also need to be considered.  Journalling has saved inconsistencies many times and I continue to use ext3/ext4 on SSDs.  Flash cards like SD/USB cards the jury's out and I may have to switch to f2fs due to having several wear out prematurely; and clearly on MTD devices F2FS is mandatory - in fact F2FS was intended for devices that have limited or no wear leveling, unlike SSDs.

I don't see any advantage of using ext2fs on an SSD, except possibly if one wants to minimize RAM footprint.

Oh and yes this SSD I have in the system in my sig is quite old now, even with ext4fs... and barely shows any wear in SMART.  In fact the only SMART field that's showing any (significant) wear is the power cycle field...turning the machine on and off every day...

----------

## The_Great_Sephiroth

I have seen NTFS with journaling kill disks fast. Workstations used 40hrs a week mind you, but faster than what I believe it should be. Since my internal SSD will not be re.oved and the laptop has a battery if I lose AC power, I do not worry about sudden power loss. Also, Plasma is configured to hibernate at 15%, so I SHOULD be safe even on battery.

What it boils down to is that I want an FS that was designed with solid-state media being considered.

----------

## NeddySeagoon

The_Great_Sephiroth,

Don't lump all SSDs together. 

They have diversified since the days of the anecdotal evidence that says a Gentoo install could kill a new SSD before it was complete.

15 years ago, using a special filesystem was a good idea but they mostly were not available then.

Now they are, they are no longer needed on SSDs.

Other sold state storage devices still benefits though.

----------

## eccerr0r

 *The_Great_Sephiroth wrote:*   

> I have seen NTFS with journaling kill disks fast.

 

You probably saw only one or two specific brands or models where NTFS killed a disk and maybe it's only one example.  But I can't fault you for being one bitten, twice shy - which will prohibit you from taking full advantage what's available now.

Do keep in mind that a kernel panic or other system hang is no different than power loss to a filesystem.  Gone are the days I have to slowly go through a zillion fsck errors, nowadays I only have to clean up a few inconsistencies here and there when things go really awry.

----------

## The_Great_Sephiroth

So what you're saying is that they'll last as long as magnetic despite usage? That is just hard to believe. We go through SSDs faster than magnetics at our datacenter. This is another thing driving my decision to go with something designed for or with flash in mind. I mean writing a journal takes a LOT of IO. Every little change results in a write. I just fail to understand how writing less to the disk will not prolong its life.

Heck, I still have a functioning Vendex Headstart 286 computer. It has a 30MB IDE disk in it. I can boot DOS and 3.11 right now. I doubt SSDs are there yet. Not trying to be stubborn here, maybe I am missing something?

----------

## mike155

 *Quote:*   

> I mean writing a journal takes a LOT of IO.

 

Do you really think that filesystem developers don't know what they are doing?   :Smile: 

Below is the smartctl ouput of my main SSD:

```
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       57357

 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       71

177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       30

241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       3922780857
```

It's a 256GB Samsung SSD 840 PRO Series from 2013. It has been running for 57.000 hours. In 2013, I formatted partitions with ext4 and I've never had any problems! 

2 TB were written to the drive and wear leveling lifetime is still at 99%. That means that I will be able to use this drive for another 500 years before wear level lifetime will be down to 0%.

Of course, if you intend to run a program like

```
while [ 1 ]

do 

   dd if=/dev/urandom of=/dev/sda bs=100M

done
```

you might be able to destroy your SSD within a couple of months.   :Smile: 

----------

## Hu

In theory, fewer writes are better than more writes.  In practice, reducing writes requires trade-offs somewhere, whether that is in increased filesystem complexity trying to level wear at the OS level, in decreased durability by trying to wait longer to issue a write (thus increasing the window where a crash can lose the data),  or in decreased reliability by forgoing a journal and hoping you never have an event where a journal would be useful.  I don't have the data to dispute your experience with your datacenter.  Perhaps your system usage pattern routes more writes to the SSDs than to the magnetic drives.  Perhaps your SSDs were bought from a lower quality vendor, and your magnetic drives are high quality enterprise drives.  Perhaps the local environment (temperature control, etc.) is harsher on SSDs than on magnetic drives.  Perhaps you're just unlucky.

For the sake of argument, let's suppose that your SSDs die meaningfully sooner than your magnetic drives and that the only variable you can control is the type of filesystem used.  Let's suppose further that you can find a filesystem that measurably prolongs SSD life relative to the "default filesystem" that most people will recommend.  How much work are you saving yourself by extending those lifetimes?  How much does it help your IT budget to extend those lifetimes?  In contrast, how much work are you creating for yourself with this research project, and any ongoing maintenance overheard you incur due to using a less common filesystem, which may have quirks you will need to discover in the field?  You don't have to post the answers to those questions here (and some of those answers may even be proprietary / confidential to the company, if it reveals too much about how you use the drives), but try to identify answers to them, even if you can't share the answers with us.Last edited by Hu on Sun Oct 06, 2019 12:30 am; edited 1 time in total

----------

## Anon-E-moose

I also have a samsung 840 evo 250g, from Dec 2014.

```
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       65747

 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       256

177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       11

...

241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       9089716564
```

Used to run reiser (3) with separate partitions for /, home, portage and a large catch-all partition, but have since moved it to all one partition (btrfs). 

But even with that it's only 4 1/2 tb of writes and disk still has a long way to go.

----------

## eccerr0r

Yes today's SSDs will typically last as long if not longer than hard drives.  I think it's hard to believe for you because you had one bad batch of SSDs.  Also do not equate SSD with SD/MMC media, a lot of these could be low quality media or controllers that will fail quickly; also do not equate with SSFDC or other MTD devices as they don't even do wear leveling - for any sort of lifetime, these indeed do require f2fs.

Yes you are indeed right that writing less does make the disk last "longer" but if you want to make the disk last longer 200 years, isn't that a bit ridiculous?  Will you care in year 2119?  I sure won't care if my SSD fails in 2119, there's other things much more important to me than the SSD that will fail, before then - well before then.  Also, how many mechanical hard drives have you had fail already?

Yes, the journal does take more I/O.  But it's not doubling I/O, that would hurt performance for mechanical hard drives too.  It's just a few more blocks per set of blocks just to make sure your writes are atomic instead of written randomly whenever it feels like.

I didn't bother posting my SMART data for lifetime because it still says 100% even after the years I've had it, ran Gentoo on it the whole time.  I suspect it will hit 99% soon now, but not exactly sure when because the disk uses compression which extends life even more.  I've also worked with another smaller SSD (128GB) for longer, and that had its counter tick down to 99% finally, and I think even after a decade of use, still has a bit before hitting the 98% life remaining.  I think I even made a poll at one point to show most people who use Gentoo on their SSDs don't see that much wear on their disks. 

In any case you should still be backing up your SSD.  True they are not infallible, but I'd be surprised to see them fail before mechanical hard drives given similar conditions.

----------

## krumpf

Killing a SSD is doable, but it takes time

Tech Report did a SSD Endurance Experiment in 2013-2015, on Windows (so I guess only NTFS was used)

The "winner" died after writing more than 2 PetaBytes

----------

## dmpogo

My first SSD, some early Samsung in Lenovo X300 laptop from 2008 lasted all the way till laptop power supply died in 2018 (that is drive is still fine).

With all gentoo updates etc.   I was running ext2 on it as far as I remember.  Of course this regime is not the one when disk is completely filled and the erase with some scratch data,

as maybe  some disks in data centers do

----------

## NeddySeagoon

krumpf,

This have moved on since then but the trade off is in bits per memory cell in the SSD.

Every time memory cell life doubles (approx) vendors add anthor bit per cell so drive life does not increase but you only get half the number of physical memory cells in your SSD.

That's good for for selling price, so we see prices tumble.

Original drives were one bit per cell, so each cell only stored a 1 or a 0.

Three bits per cell is common now, so each memory cell needs to store eight different voltage levels.

Yep, the storage system in an SSD is essentially an analogue device. Its a capacitor.

With lots of writes, the cells become 'leaky' so the different voltage levels tend to fall, so a cell that stored 111b reads back as 110b. Oops.

As the number of writes increases, the rate of leakage increases. 

SSD vendors have a pretty good grip of SSD lifetimes or they could not keep increasing the storage density like this and at the same time, keep lifetimes fairly constant and make money.

It does not follow that a lower bits per cell guarantees a longer cell life. There are other factors in the memory cell physics too.

----------

## krinn

 *NeddySeagoon wrote:*   

> With lots of writes, the cells become 'leaky' so the different voltage levels tend to fall, so a cell that stored 111b reads back as 110b. Oops.
> 
> As the number of writes increases, the rate of leakage increases. 

 

That's what i have read, it's not number of writes you do because of journaling, it's number of writes done on the same cell, with a journaling fs and a "smart" SSD, the controller of the SSD track this and won't rewrite the same cell over and over.

I think early SSD controller were lacking this, and as such, were doing write over and over the same cell, killing the SSD

----------

## NeddySeagoon

krinn,

Its just like you say. Early SSDs were also much smaller, so wear levelling, even if it was present, didn't have as much space to work in.

To be pedantic, its not writes that wear an SSD. Its the erase beforehand.

It matters as the erase block size is much larger than the write block size, which is what leads to write amplification, which in turn leads to erase amplification.

----------

## Zucca

 *NeddySeagoon wrote:*   

> Three bits per cell is common now, so each memory cell needs to store eight different voltage levels.
> 
> Yep, the storage system in an SSD is essentially an analogue device. Its a capacitor.

 Isn't TLC (Triple level cell) a cell that stores the data using three different voltage levels? Yielding to 00, 01, and 10 of possible binary configurations? Or is the data density really that high nowdays?  :Shocked: 

----------

## NeddySeagoon

Zucca,

Reading around the web, I'm no longer sure. Triple and Quad level SSD cells are referred to as three and four bits per cell.

A single bit requires two levels ...

It doesn't affect the idea I was trying to get across. We don't get longer lives out of our SSDs, we get lower cost and higher storage density.

Further, the SSD life extending technologies, in SSDs at least, have moved into the controller.

If you have raw FLASH access, then life management is still your own problem.

----------

## mike155

 *Zucca wrote:*   

> [Isn't TLC (Triple level cell) a cell that stores the data using three different voltage levels? Yielding to 00, 01, and 10 of possible binary configurations? Or is the data density really that high nowdays? 

 

Don't be fooled by the strange names. 

A single-level cell (SLC) can store 1 bit per cell

A multi-level cell (MLC) can store 2 bits per cell

A Triple-level cells (TLC) can store 3 bits per cell

A quad-level cells (QLC) can store 4 bits per cell See: https://en.wikipedia.org/wiki/Triple-level_cell

The guys who invented those names really messed it up.   :Rolling Eyes: 

----------

## The_Great_Sephiroth

The drive I am getting is TLC, which seems to be the most common now. I also have a QLC M.2 on my desk I have yet to put into my gaming rig. I doubt very seriously that either of these would offer direct flash access. As such I suppose I will do a read and write speed test on the rig using BTRFS and NTFS and the faster one wins. My laptop however, is another story.

So this wealth of information brings up another question. With systems like F2FS, what would they be good for now? USB sticks? I was planning on continuing to use an external disk for portage. would F2FS benefit me there? I will likely use BTRFS on the SSD if F2FS will not benefit me in any way there. I love BTRFS and am very familiar with it, and it does have TRIM support. The downside is that BTRFS does not allow you to use DUP modes on flash disks since in theory the disk stores backup copies of cells. This kind of scares me as bit-rot might occur and a backup is my only solution. My backups are on BTRFS RAID10 mechanical disks though, so no bit-rot there!

Finally, our SSDs and mechanical disks are enterprise. The mechanicals run around $150 a piece and the SSDs are WELL beyond that. Some of the enterprise SAS disks are over $1,000 a piece! Crazy costs.

----------

## eccerr0r

It is indeed disgusting they are playing games with longevity however, despite the fact they will tend to outlast mechanical disks.

It all depends on what you're planning to do with the disk.  If you're planning on continued data content churn, expect the disk to last a shorter amount of time.

In any case, the filesystem will have a negligible effect on disk longevity.  Not saying it has no effect, but the large data volume churn is the much, much larger portion of wear on SSDs.

BTW, unless you're running

# while true; do emerge -e @world; done

you're not really doing much writing to the disk.  Likely you do your update once in a while, and the binaries sit static on your disk, safe from erase cycles, and your disk lasts longer.

I wonder if I should hug my mere MLC SSDs as they're becoming more rare these days though technically the amount of data written to it before failure will not be that much larger than a newer, larger SSD with fewer erase cycles...

----------

## msst

 *Quote:*   

> 
> 
> So what you're saying is that they'll last as long as magnetic despite usage? That is just hard to believe.

 

What I can contribute is

1. I switched from using mechanical disks for the root to SSD some 6-7 years ago, but kept using mechanical disks on my NAS for bulk data storage.

2. I have occasionally lost mechanicals drives, in total at least 6-7 before that switch. I have not lost a single SSD so far. Statistically I should have lost 1-2 if they would last as long as mechanical ones.

3. This is based on consumer use. But pretty heavy one including gentoo compiling. Still 2. holds true.

4. I also have a small server running 24/7. The root drive is a SSD. The bulk is a 4 mechanical drive RAID10 for NAS usage. No losses so far.

5. I am using mostly BTRFS now, ext4 earlier. Both seem to give no problems.

My conclusion so far:

I am very sure that for consumer use, even quite heavy one, SSDs are much superior in all aspects except price. For a server that is heavily used I don't want to comment. But for a small server the same seems to hold. No special FS is required any more.

 *Quote:*   

> With systems like F2FS, what would they be good for now? USB sticks?

 

Exactly. But only if you tend to use them as a system disk. These are potentially not expecting so many random write cycles as the SSDs get and are built for.

----------

## eccerr0r

Also if it wasn't clear from all the posts, the more levels per cell, it is easier to wear out because the noise ratio increases with the number of levels.

Typical guaranteed/warranted erase endurance that I've seen:

SLC: 10,000 to 1 million (SSFDC typically rate 100,000 cycles, and I think things like the 24C16 can do 1 million.  Yes SSFDC maxes out at 128MB and the 24C16 is 2KB, and yes K as in Kilo)

MLC (2 bits): 3000

TLC (3 bits): 1000

QLC (4 bits): ??? 300?

But the life is much more than just these numbers.  The larger the disk versus your churn rate will lengthen the amount of time before the drive needs to be replaced, and this is the "dirty advertising" I alluded to - however for consumer use the advertisers are completely correct because churn rate of consumer use is exceedingly low.  About the only typical consumer use that churns high is probably surveillance video, but even this is kind of limited due to compression.

Though I have a measly 180G MLC disk here, I'm lucky I haven't totally filled it yet.  Soon though - I think the disk will be filled to the gills well before I use up only 5% of my wear cycles...

----------

## The_Great_Sephiroth

Well my 2TB disk arrived today. Shame I just built the OS on the old mechanical 1TB disk. I suppose I could clone it over, but I probably shouldn't be using DUP on an SSD.

Anyway, an earlier post here mentioned no long checks after a power failure or kernel crash. I cannot speak for Linux as much as Windows, but when a Windows box goes down unexpectedly, you better run a check on that NTFS volume. I know it will boot and run fine, but there ARE issues. 99% of the time after a power failure when I run chkdsk on the NTFS disk, it finds a few minor things. I have even had situations where a client in a bad area would have repeated power failures and keep running the system. Eventually it got slow as heck and had issues and crashes. A chkdsk found and fixed TONS of things, the system started flying again, and all was good. In other words, the journal is nice for a fast recovery, but it is NOT the same as a proper check. Heck, I have had a BTRFS system go down and when I manually ran btrfsck on the volume it found errors.

What I am getting at is that whether or not a journal allows me to get right back up without a check, I need to run a check anyway. We're not beyond Windows 98SE yet. We still need to run a proper check after an unplanned shutdown or crash. At least in Windows!

*EDIT*

Forgot to mention that I am going with F2FS. Look at the article linked below. This is a recent test and on SATA3 SSDs F2FS is in the lead. Maybe because it is designed for this job? Either way, unless I am mistaken GRUB added support for F2FS in 2.04, so once I check that I believe I will be off and running. Worst case scenario is that if I do not like it, I can always format and start again! Also, in the six or seven years I have used BTRFS on my old laptop, I have killed it once. I accidentally did not snap my battery in and while using it, the battery fell out. The check on my volumes only took a minute or two and all was good. I am not worried about losing power.

Ext4, BTRFS, XFS, and F2FS on SSD tests

----------

## NeddySeagoon

The_Great_Sephiroth,

Beware fsck or whatever Windows calls it.

On a damaged filesystem, it makes guesses at what the metadata should look like and it does this with complete disregard for any user data on the filesystem.

It has to do this because the filesystem metadata is incomplete.

Fixing the metadata allows mount  to work. What you find after a fsck and mount is anyones guess.

In a production environment, when fsck reports errors, reach for your backups.

Good luck putting all the fragments in lost+found together again.

Windows does the same thing but puts recovered fragments in the root directory.

----------

## mike155

Whatever filesystem you choose, take care about TRIM. You can choose between 'continous trim' and 'periodic trim'. Some drives don't support 'continous trim'. Most experts recommend 'periodic trim'.

A 'ready to run' cron/timer script for 'periodoc trim' comes with util-linux. Just activate it in your crontab or using systemctl.

----------

## msst

 *Quote:*   

> In other words, the journal is nice for a fast recovery, but it is NOT the same as a proper check. Heck, I have had a BTRFS system go down and when I manually ran btrfsck on the volume it found errors. 

 

Duh, ever since I run journalling filesystems from the ext4 days back I have basically not run a fsck for years on the same disk. With exactly zero negative effects. So for me the journalling works more than satisfactory. This fsck after unclean shutdown was a pest in the non-journalling days. I still remember this vividly!

And I think for BTRFS it is highly disencouraged to even do a manual btrfsck unless absolutely necessary because it does not mount. And all btrfs filesystems I have are working flawless also since a few years. Though there were serious problems in the earlier BTRFS development cycle, which I luckily skipped.

I get a bit the impression you are seeing problems where realistically none are. Is that possible?

----------

## The_Great_Sephiroth

 *msst wrote:*   

> Duh, ever since I run journalling filesystems from the ext4 days back I have basically not run a fsck for years on the same disk.

 

That is a lack of understanding, and not smart. Just because the journal allows it to start does NOT mean it is good to go after a failure. I have used ext4 for years before BTRFS on desktops and when power went out the system WOULD come up, but after we got in we'd run fsck. 90% of the time or more it would find simple issues. Not damaged files or anything crazy, but normally incorrect inode counts and the like. It corrected them. We never lost data.

The same can be said for NTFS. After a crash (BSOD) or power failure we normally ran chkdsk and a good chunk of the time it found invalid indexes. Again, not damaged files or missing files as Neddy suggested, but issues with the filesystem. Things like free space being marked as allocated.

So yes, a journal is nice but it is NOT a substitute for proper maintenance when needed. I run chkdsk on my gaming rig monthly and while it never finds anything, if it did that could indicate a failing disk or some other issue I could handle. Same for Linux. I am almost exclusively BTRFS now, but once in a while I run an fsck as a maintenance procedure. Again, never lost data doing this.

----------

## NeddySeagoon

The_Great_Sephiroth,

Running fsck for advice is harmless. Letting it change things is not so harmless.

Take your  *The_Great_Sephiroth wrote:*   

> Things like free space being marked as allocated.

 

You have no idea if fsck made the right decision  ... until one day that free space that wasn't really free is allocated and written to.

That's what backups are for.

----------

## The_Great_Sephiroth

Neddy, you are correct. Backups, specifically those on ReFS or BTRFS are critical. But running chkdsk or fsck in read-only mode is what I meant. Even then, when I have to allow it to repair something, I have never lost data. Of course it is RARE that I have needed to randomly run it in repair mode, normally only after said power failure.

Also, I tend to believe that the fsck devs (and chkdsk devs) would be able to do better than "guessing" what data went where. I have seen extreme cases where chkdsk made a bunch of chk files, but 99% of the time it does not do such a thing. Is this documented somewhere? I would like to read about it because I assumed that the repair tool would repair minor issues, not make them worse.

Oh and I do have another question. Since the new disk is an SSD, regardless of the FS I choose, has the in-kernel TRIM support improved since 2014? I have been reading older threads about using "-o discard" when mounting so Linux, like Windows, will TRIM each time a file is deleted. What I read was that the TRIM support in our kernel was very poor compared to Windows and using it like that would be slower than not using TRIM at all. I know I can manually run fstrim once in a while, but what about leaving it on like Windows? Has the support and speed improved after five years?

----------

## NeddySeagoon

The_Great_Sephiroth,

fsck has to guess. The filesystem metadata is inconsistent. There may be reasons to prefer one course of action over another, in which case its a weighted guess.

Its still a guess though.

Trim, however its done, is not a command to the SDD to do something. Its just advice to the drive that blocks that contain data can be erased.

What the drive does and when the drive does it is up to the drive.

Suppose the erase block size contains 64 write blocks. A write block is the smallest region the drive can write and an erase block is the smallest region it can erase.

If an entire erase block is trimmed, the drive can do the erase straight away if it wants to. That's the no brainer case.

Suppose only one write block in an erase block is trimmed. The other 63 are in use.

The drive has to copy the used data to a fresh erase block preserving the 63 in use blocks before it can erase the erase block that contained the single write block that had been trimmed.

Thats a very bad thing. Its called write amplification.  In a few minutes, a few more write blocks may be trimmed ...

Then there are all the cases in between, where some write blocks are erased, some in use and some trimmed (but not yet erased).

When the drive actually performs the erase depends on lots of factors. Its a complex part of the drive firmware.

So much so that there are drives in the wild that incorrectly erase LBA 0 from time to time.

Weather you use -o discard or fstrim in a cron job is hotly debated. They are both advice to the drive, not a command.

----------

## The_Great_Sephiroth

I do not believe that it is advice. It actually makes it do the read-update-write cycle instantly. You may test this on a USB flash stick. Mount it with discard and write a bunch of files. Unmount, mount it without discard, write those files again and it is MUCH faster. Then issue fstrim and watch the busy light on the stick go bonkers.

What I am getting at is this. Mounting with discard is slow writing due to the constant trimming. Mounting without discard vastly increased speed, but no TRIM command is ever issued, and the user must schedule it (desktop) or run it manually once in a while (laptop). I do understand how erase blocks work and normally the smallet erase block size is 4MiB nowadays. This is why I align my first partition after the first 4MiB.

I may be misunderstanding something, but the documentation I read on Linux mount option discard and the Linux fstrim command states what I just posted. If I am incorrect can you point me to updated documentation? I am not against reading and updating my opinions.

----------

## NeddySeagoon

The_Great_Sephiroth,

I believe what you say for USB sticks and any flash device without wear levelling.

Its interesting that you say the busy light comes on on USB sticks. The USB interface is not busy during the garbage collection caused by trim.

I have a real SSD connected to my Rasberry Pi via a USB/SATA bridge. The USB/SATA bridge does not support trim so every now and again I plug the SSD into my PC and run fstrim then disconnect the SDD.

Rebooting the Pi takes between 5 and 10 min because the SSD does not come ready until the erases are complete. At least, I think thats what the delay is. It only happens after fstrim in the PC.

I need to flash the USB/SATA bridge to get it to support trim.

----------

## mike155

 *Quote:*   

> the smallest erase block size is 4MiB nowadays. This is why I align my first partition after the first 4MiB.

 

Please remember that SSDs re-arrange data written to the SSD. On SSDs, there's no static 1:1 mapping between logical blocks and physical blocks. 

For that reason, it probably doesn't make sense to use 4 MiB boundaries for partitions.

Please look at this excellent article that explains how SSDs work internally: http://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdfLast edited by mike155 on Thu Oct 10, 2019 8:11 pm; edited 2 times in total

----------

## msst

Way too complicated discussion here for me, the only thing I can contribute is

 *Quote:*   

> We never lost data.
> 
> 

 

that I also did not loose any data by not running fsck any more since I use a journalling fs. And I would take a bet quite some people do not use fsck any more since then. Should we not have heard something if ext4 would cause troubles in its journaling mode without regular fsck runs?

----------

## eccerr0r

How many USB sticks support trim?  Is this common now?

It was a godsend that USB sticks supported some sort of wear leveling, but trim as well?

----------

## The_Great_Sephiroth

My old 16GB sticks do not support trim. I had a 32GB USB3 stick that claimed to, but I need to back it up before I experiment with that!

Every technical document that I have ever read on sold-state media says to align to the erase black size. I am mobile on my phone now but will post countless documents as to why you SHOULD do this when I get home.  It has nothing to do with some 1 to 1 mapping.

Neddy, I will play with my SSD and test these options before installing the OS and post the results.

----------

## eccerr0r

I think the correct wording is that you shouldn't straddle a filesystem write block across a erase block boundary.  Other than that, it doesn't really matter that a filesystem straddles a boundary.  On the other hand, if you can avoid block 0 which includes the partition table, you can increase longevity of the SSD if it has poor wear leveling logic by simply partitioning out bad areas up until block 0 no longer records properly. This however should not be the case or needed for modern SSDs.

The block mapping that Neddy is referring to is internal to the SSD, for example my Sandforce SSD.  For some reason I am sure I have exceeded 1% of its erase cycles by now as witnessed by the number of bytes I've written to it,  but due to compression, it has used fewer erase cycles due to writing fewer erase blocks than what would have normally been used to handle all the writes completed.

Note that you will not likely see any speed timing differences until you've written all blocks of the drive at least once as this will finally force the need for an erase cycle.

----------

## NeddySeagoon

eccerr0r,

Most SSDs are over provisioned. That is, they have a pool of spare erase blocks that are not included in the user space of the drive.

These blocks are included in wear levelling and so on.

That means the drive swaps out an erase block needing to be erased with one from the over provisioned pool, so the erase time penalty is much reduced until the drive can't erase erase blocks in the over provisioned pool fast enough to keep up with demand generated by writes.

----------

## eccerr0r

The over-provisioning is yet another strange behavior reason...  Yes it needs to keep erased blocks ready for use.  They are not technically "spare" but rather part of the pool for readiness. 

Which also brings forth another weirdness, whether "spare" blocks are really "spare" or just part of the overcounted pool from the start.  SSDs with "spare" blocks doesn't make sense, replacing a worn out block with an unused block isn't really doing much of a service as all the other blocks on the disk are equally likely to go soon, thus probably better be part of the "clean"/ready to write pool shared with the rest of the "main" blocks so all blocks will be equally worn as the others.

My 180G SSD would seem to be overprovisioned at some ridiculous percentage, so far I've theoretically gone through ~62 erase cycles based on data written, but still at 0% wearout.  As it uses MLC storage, it should sustain 3000 cycles and 62 erase cycles is much more than 1%.  If the chips are 5000 cycle it'd still be over 1% usage which should show up in SMART...

----------

## szatox

You can limit the "visible" size of some SSD's with hdparm ( -N p<number> - destructive for data on that disk)

This creates that second, hidden pool, which can be trimmed in the background - by firmware - without an explicit hint from the OS.

So, if you do that, you get a visible "readable" pool, and a hidden "writable" pool. So, you read data from "visible" pool that actually holds your data, you write to a block from hidden pool (swapping that block into readable pool in the process), and the former readable block gets trimmed into that hidden pool.

I guess losing a few blocks simply reduces the size of your hidden pool in this case. Once you run out of empty blocks in that hidden pool, you'll see write performance degraded due to forced in-line wipe..

----------

## Goverp

My conclusion from similar discussions about wear levelling and the internals of SSD memory is to just use F2FS and hope the authors (AFAIR someone in Samsung) have taken it all into account. As they make a lot of the stuff, they ought to get it right.  Am I deluding myself?

----------

## NeddySeagoon

Goverp,

That's rather like assuming Boeing get it right with things like the 737 Max.

Well, the anecdotal evidence about some SSDs randomly erasing LBA 0 (the partition table) was happening to Samsung drives.

F2FS does a lot of the things that SSD firmware does now, including wear levelling. Why do it twice?

----------

## Fitzcarraldo

 *The_Great_Sephiroth wrote:*   

> Mike, I ruled out ext4 because it is a journaled system.

 

It is possible to create an ext4 partition without journalling:

```
mkfs.ext4 -O ^has_journal /dev/sda2
```

It is also possible to disable (and enable) journalling in an existing unmounted ext4 partition:

```
tune2fs -O ^has_journal /dev/sda2
```

If a user does want ext4 to journal, it is even possible to configure ext4 to place the journal on a different device/partition.

So ruling out using ext4 on an SSD for reasons of journalling is not really a valid argument, because journalling is optional in ext4.

```
# dmesg | grep EXT4

[    1.947113] EXT4-fs (sda2): mounted filesystem without journal. Opts: (null)

[    2.063456] EXT4-fs (sda2): mounted filesystem without journal. Opts: (null)

[    5.351241] EXT4-fs (sda2): re-mounted. Opts: (null)

# blkid

/dev/sda1: UUID="49CA-4AC5" TYPE="vfat" PARTUUID="22be1d7f-92b3-4c4f-b2d3-d4d160a5139f"

/dev/sda2: UUID="85440e0c-2335-4746-97be-7ad3efc9a43b" TYPE="ext4" PARTUUID="f50d5459-b956-f348-b23b-224b127bbb6b"

/dev/sda3: UUID="bf467dd7-9d64-4b0a-a732-50bfa16f7c97" TYPE="swap" PARTUUID="047bbada-08c5-c142-b604-dfe874d65cf3"
```

----------

## The_Great_Sephiroth

Lots of detailed info here. I am now reading about things I had no clue on. I did not realize that as of 2015 some SSDs could start losing data in as little as seven days without power. Also that SLC and MLC was more reliable than TLC for bit-rot. My work laptop gets used enough, but I am concerned about bit-rot now. While F2FS is light-years faster than BTRFS, EXT4, or other systems on an SSD, only BTRFS offers bit-rot protection. However, BTRFS will not allow DUP mode on an SSD, so there is no protection. The SSD is supposed to check for data corruption, but what if there is? Does the SSD have a good backup? If not do I just lose data and resort to backups being restored?

The more I read the more I regret purchasing an SSD over an HDD. Data integrity is God here. SSDs seem to have so much against them, but everybody loves speed. I am seriously considering never using an SSD again. At least until they are as reliable as an HDD. I can do DUP mode on an HDD and never lose a single bit of data. What do I do on an SSD to prevent bit-rot?

*EDIT*

I forgot to mention that while I considered making two identical-sized partitions on the SSD and doing BTRFS RAID1 with them, I have chosen not to because I am fairly sure that most SSDs will see identical data and just reference the data for the second copy. This means that if the actual copy on the SSD is corrupted, both copies are since only one exists. Am I correct here?

----------

## Fitzcarraldo

I too am wary about the UBER (Uncorrectable Bit Error Rate) in SSDs. Has much changed in the last three or four years with SSD technology? The last two articles I read on SSD reliability were both from 2016 (see links below). Have you seen any recent articles on the latest SSDs?

CyberStreams - 2016’s SSD (Solid State Drive) Reliability Report

 *Quote:*   

> Overall, SSD flash drives experience significantly lower replacement rates (within their rated lifetime) than hard disk drives. The only catch is that they experience significantly higher rates of uncorrectable errors than hard disk drives.

 

ZDNet - SSD reliability in the real world: Google's experience (February 25, 2016)

 *Quote:*   

> KEY CONCLUSIONS
> 
> • Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number.
> 
> • Good news: Raw Bit Error Rate (RBER) increases slower than expected from wearout and is not correlated with UBER or other failures.
> ...

 

 *Quote:*   

> But it isn't all good news. SSD UBER rates are higher than disk rates, which means that backing up SSDs is even more important than it is with disks. The SSD is less likely to fail during its normal life, but more likely to lose data.

 

 *The_Great_Sephiroth wrote:*   

> *EDIT*
> 
> I forgot to mention that while I considered making two identical-sized partitions on the SSD and doing BTRFS RAID1 with them, I have chosen not to because I am fairly sure that most SSDs will see identical data and just reference the data for the second copy. This means that if the actual copy on the SSD is corrupted, both copies are since only one exists. Am I correct here?

 

I don't use BTRFS, but RAID1 does not protect against data corruption. RAID1 is not a backup solution.

----------

## Anon-E-moose

I've been running a samsung 840 evo since Dec 2014 and have had 0 errors of any kind. 

I have several 850's that are a few years newer, but still the same reliability. 

I still backup my drives onto a mirrored raid, and one offline drive but that's just standard precaution IMO.

Unless you're buying el cheapo ssd, then I wouldn't worry about it. 

Note the above is standard ssd, not the newer nvme style ssds. I have little experience with those, (only have one in a notebook and it gets minimal use).

----------

## mike155

 *The_Great_Sephiroth wrote:*   

> I did not realize that as of 2015 some SSDs could start losing data in as little as seven days without power. 

 

I've never heard about that and it's hard to believe. Please post a link.

----------

## NeddySeagoon

The_Great_Sephiroth,

Provided you can detect a bit error(s) you can guard against them.

Undetected errors result in data corruption. 

When an error is detected, one of two things happen.

a) the error is connected ... as happens with ECC RAM.

b) the error is detected but nothing can be done except flag it (fail the read)

For the sake of completeness,  ECC RAM can detect and correct all single bit errors and detect but not correct two bit errors.

For higher error counts, anything can happen.

In the case of your raid1, a failed read on one copy would result in the other copy being read.

mdadm would rewrite the failed read too, so it was good next time.

Read Partial-response maximum-likelihood and be afraid for your data on rotating rust.

Be very afraid. :)

SSDs do error correction too. The causes and consequences of errors reading SSDs and magnetic hard drives are quite different, so different error correction techniques are required.

If you want to go the raid1 route, you need 3 mirrors. With only 2 mirrors when one fails, you no longer have a backup.

Raid is not backup anyway but you are not using raid here as a backup in the conventional sense, its to ensure data integrity.

----------

## msst

I do think you are exaggerating and creating some sort of unnecessary panic for yourself here, but if

 *Quote:*   

> Data integrity is God here.

 

 *Quote:*   

> but I am concerned about bit-rot now

 

Then use BTRFS Raid 1 or 10 with regular scrubbing and in addition use borgbackup for incremental independant backups. Problem solved. If there was any.

----------

## Anon-E-moose

 *msst wrote:*   

> I do think you are ... creating some sort of unnecessary panic for yourself here

 

+++

It's computer equipment, something could always happen, whether ssd or hdd, I've had several hdd's (over the years) refuse to spin up or have bad sectors appear out of the blue. That's why you do backups. I still use hdds for backup media and general media storage. I use the ssd as the main drive in both my desktop and laptop, because it's a lot faster. But I still do backups of both the ssd's and the hdd's. There are no guarantees in life, except that there are no guarantees.

----------

## szatox

 *Quote:*   

> In the case of your raid1, a failed read on one copy would result in the other copy being read.
> 
> mdadm would rewrite the failed read too, so it was good next time

 I'd be careful with that.

I've tested mdraid 6 against data corruption on 1 out of 4 drives. Scrubbing it simply recalculated all checksums destroying the test data.

So, if the hard drive reports a read error, you're fine. But there are no sanity checks against single strip corruption.

----------

## Fitzcarraldo

 *NeddySeagoon wrote:*   

> In the case of your raid1, a failed read on one copy would result in the other copy being read.
> 
> mdadm would rewrite the failed read too, so it was good next time. 

 

A failed read on one of the drives. But in the case of a successful read of corrupt data on one of the RAID1 drives, the RAID software would not know which of the two is correct data. Or is that impossible with SSDs? In the case of using mdadm with HDDs it is possible to 'scrub' a RAID1 array to check for such errors: https://wiki.archlinux.org/index.php/RAID#Scrubbing

 *Arch Linux Wiki wrote:*   

> It is good practice to regularly run data scrubbing to check for and fix errors. Depending on the size/configuration of the array, a scrub may take multiple hours to complete.
> 
> To initiate a data scrub:
> 
> ```
> ...

 

 *Arch Linux Wiki wrote:*   

> Note: Users may alternatively echo repair to /sys/block/md0/md/sync_action but this is ill-advised since if a mismatch in the data is encountered, it would be automatically updated to be consistent. The danger is that we really do not know whether it is the parity or the data block that is correct (or which data block in case of RAID1). It is luck-of-the-draw whether or not the operation gets the right data instead of the bad data.

 

 *Arch Linux Wiki wrote:*   

> Due to the fact that RAID1 and RAID10 writes in the kernel are unbuffered, an array can have non-0 mismatch counts even when the array is healthy. These non-0 counts will only exist in transient data areas where they do not pose a problem. However, we cannot tell the difference between a non-0 count that is just in transient data or a non-0 count that signifies a real problem. This fact is a source of false positives for RAID1 and RAID10 arrays. It is however still recommended to scrub regularly in order to catch and correct any bad sectors that might be present in the devices. 

 

----------

## NeddySeagoon

Fitzcarraldo,

Now we are down in the weeds of error detection and correction.

----------

## The_Great_Sephiroth

The link to the information about temperature and being able to lose data in seven days is posted here. Also, BTRFS RAID1 does normally prevent bit-rot in that fi one disks rots it realizes this and corrects from the other. This does not help with RAID1 through mdadm, lvm, or any other normal means.

At this point I believe I will go with F2FS simply because it is faster than BTRFS many, many times over and BTRFS cannot prevent bit-rot on a modern SSD. I have to hope that the SSD (WD Blue in this case) is good enough to detect and correct errors, and backup to my BTRFS RAID10 system weekly. My biggest concern is that I save a file on Tuesday and when I back it up Saturday I am backing up a corrupted file. What I may do is see if I can somehow hook something where whenever a file in my home directory is written an sha1 sum is calculated and stored in a file with the same name and an sha1 extension. Then I could EASILY script a backup shell script which only backs up files which match the sha1 sum and informs me at the end which ones, if any, do not match so I can make the call on backing them up.

----------

## Anon-E-moose

WD Blue's aren't really enterprise class drives, whether ssd or hdd. 

 *Quote:*   

> My biggest concern is that I save a file on Tuesday and when I back it up Saturday I am backing up a corrupted file

 

You're no more likely to have that happen on an ssd than an hdd, or with btrfs vs ext4 vs f2fs.

----------

## NeddySeagoon

The_Great_Sephiroth,

```
$ eix tripwire

* app-admin/tripwire

     Available versions:  2.4.3.7 {libressl selinux ssl static +tools}

     Homepage:            http://www.tripwire.org/

     Description:         Open Source File Integrity Checker and IDS
```

----------

## mike155

 *The_Great_Sephiroth wrote:*   

> The link to the information about temperature and being able to lose data in seven days is posted here.

 

Very interesting link. Thanks! Especially the JEDEC Powerpoint presentation (link inside the ExtremeTech presentation).

Do you think that the tables in the ExtremeTech article show that a SSD loses data after 52 weeks @ 30°C or after 14 weeks @ 40°C? That would be wrong! 

I'm still trying to figure out what those numbers really mean. I will post an explanation as soon as I understand them.Last edited by mike155 on Sun Oct 13, 2019 10:19 pm; edited 1 time in total

----------

## The_Great_Sephiroth

Neddy, I will look into tripwire ASAP, thank you!

Moose, that is fairly false. I currently use BTRFS in DUP mode on an HDD. To have the same problem not one but TWO sectors would need to fail or have rot, and they would have to be the EXACT sectors that the two copies of the data for the file are on. Those odds are astronomically low. Running BTRFS in single mode, I agree. That same data could be corrupted by a single sector failing or rotting. Your statement is true for Ext4 however. BTRFS DUP mode puts two copies of the meta-data and two copies of the data for each file on the disk, not in adjoining sectors either. This is why I fell in love with BTRFS. Sadly, every recent test (some as recent as September this year) show BTRFS far behind in performance on regular SATA SSDs, likely due to CoW. Ext4, XFS, and F2FS all beat it out. F2FS is regularly the fastest SATA SSD filesystem, but Ext4 is right behind it, normally tied with it. However, in testing Ext4 wears out flash media faster than F2FS, so for the same speed some SSDs had their life extended by almost 70% on the ones using F2FS. But none of these systems have the bit-rot protection I am used to.

I guess it is like this. I have not had ANY data corruption or loss since switching to BTRFS and now I have to give that up because I went with a physical media that defeats BTRFS in several ways. It's like wearing pants your entire life and then having to go to work in only your underpants. You're a tad nervous about it!

*EDIT*

I forgot to ask. What scheduler would be best for an SSD? I have read that deadline may be good for a SATA drive, but not an NVME drive. In fact that is echoed on the Debian Wiki.

----------

## tholin

 *The_Great_Sephiroth wrote:*   

> The link to the information about temperature and being able to lose data in seven days is posted here.

 

The article says "In worst-case scenarios or high storage temps, the data on an enterprise drive can start to fail within seven days". The slides goes on to state that worst-case scenarios involves completely depleting the P/E cycles of the chips before the retention test. That's important because higher P/E cycles means decreased retention time. The SSD I use on my gentoo desktop has been in use for 6 years 9 months and I've depleted 8% of the rated P/E cycles. Under normal use you'll never get close to the rated P/E cycles. The worst-case storage temperature in that test was 55°C which is also bad for retention.

That article was shared a lot some years back because of the click bait headline but under normal conditions SSDs retain data for many years.

----------

## mike155

 *tholin wrote:*   

> That article was shared a lot some years back because of the click bait headline but under normal conditions SSDs retain data for many years.

 

I completely agree. 

There's a good explanation at: https://www.anandtech.com/show/9248/the-truth-about-ssd-data-retention

 *Anandtech wrote:*   

> All in all, there is absolutely zero reason to worry about SSD data retention in typical client environment. Remember that the figures presented here are for a drive that has already passed its endurance rating, so for new drives the data retention is considerably higher, typically over ten years for MLC NAND based SSDs. 

 

@The_Great_Sephiroth: stop worrying and enjoy your SSD   :Smile: 

----------

## eccerr0r

 *mike155 wrote:*   

> @The_Great_Sephiroth: stop worrying and enjoy your SSD  :)

 

... and stop spreading misinformation about ext4fs journaling reducing SSD life by 70%.

Just because it does so for MTD devices with no/poor wear leveling doesn't mean it's the same for all SSDs.

----------

## The_Great_Sephiroth

I never said ext4 with journaling reduced life by 70%. I said that in multiple tests which I have been following F2FS has extended life by up to about 70%. This includes testing against BTRFS, XFS, ZFS, and others. I am at work now but when I get home I will link said tests. Results do vary slightly but all show extended life with the filesystem designed for SSDs.

----------

## mysterious

f2fs is awesome and I use it for the root file system. However, you should be aware that there is an issue with sys-fs/f2fs-tools-1.12.0-r1. Boot fails with this version of f2fs tools. The fix for this issue was to revert back to version 1.11.0. Recently, "unstable" version sys-fs/f2fs-tools-1.13.0 has been released. However, I didn't try it yet.

Here is a link to a forum post discussing the 1.12.0-r1 boot failure: sys-fs/f2fs-tools-1.12.0: breaks boot on f2fs root

----------

## The_Great_Sephiroth

I am on a break and using my Galaxy Tab A 8 to write this. I have a Bluetooth keyboard here, but forgive me if I have any typos, I am not as good on the small keyboard as I am on a normal one.

Here is the thread on Reddit where a user tested filesystems on four SD cards and F2FS showed increased life up to about 70%. I read this one last. Let me find the other tests and I will update this post with links as well. Just give me some time as I am not on my laptop or desktop!

*UPDATE*

Here is a slightly older review of F2FS in comparison with ext4. Note that the big thing here is speed difference.

----------

## NeddySeagoon

The_Great_Sephiroth,

SD cards are not SSDs.

----------

## mike155

 *Quote:*   

> a user tested filesystems on four SD cards and F2FS showed increased life up to about 70%. 

 

If I had a SD card, I would also prefer F2FS over ext4.  :Smile: 

But we are talking about SSDs, aren't we? SSDs and SD cards are entirely different.

----------

## The_Great_Sephiroth

Yes, but I have read about how F2FS works with the FTL to improve both performance and life. I just updated my last post and am reviewing browser history to find more of the good reading material I have been going through. I use F2FS on a 16GB USB stick for portage on my old laptop as it stands and it IS faster than ext4 was on the same stick.

*EDIT*

So what about the scheduler? I am under the impression that I should use deadline for an SSD, not CFQ since CFQ was designed to minimize head movement, or so I have read.

----------

## Fitzcarraldo

 *mike155 wrote:*   

>  *Quote:*   a user tested filesystems on four SD cards and F2FS showed increased life up to about 70%.  
> 
> If I had a SD card, I would also prefer F2FS over ext4. 
> 
> But we are talking about SSDs, aren't we? SSDs and SD cards are entirely different.

 

True, but Samsung also targeted SSDs with F2FS:

 *LWN, October 10, 2012 wrote:*   

> Unlike jffs2 and logfs, f2fs is not targeted at raw flash devices, but rather at the specific hardware that is commonly available to consumers — SSDs, eMMC, SD cards, and other flash storage with an FTL (flash translation layer) already built in.

 

 *Joo-Young Hwang, Samsung, Feb 22, 2013 wrote:*   

> • NAND Flash-based Storage Devices
> 
> –SSD for PC and server systems
> 
> –eMMCfor mobile systems
> ...

 

https://elinux.org/images/1/12/Elc2013_Hwang.pdf

----------

## mike155

Fitzcarraldo, thanks for the document. 

The tests shown in the document (starting in page 19) were performed on a eMMC card, not on a SSD.

That's also what the summary (last page) says: F2FS is optimized for mobile flash storages - CAN also work on SSDs.

There's absolutely no doubt that... 

F2FS is a good file system for SD cards and USB sticks, almost certainly superior to ext4.

F2FS also works on a SSD

But I haven't seen a single proof or at least evidence yet for the statement that F2FS is better than ext4 on SSDs for the average user.  

The special feature of a SSD is the controller - which in fact is a computer of its own. The controller does all the magic that is required to get rid of all the idiosyncrasies and disadvantages that raw flash chips have. And that's why it's so important to buy a SSD with an excellent controller - and not a cheap one that in fact isn't any better than a SD card.

In case you haven't done yet, please read the textbook article that explains how SSDs work internally: http://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf

----------

## Fitzcarraldo

mike155,

For tests on SSDs (SATA and PCIe using Ubuntu 12.10 Server) comparing F2FS, ext4, Btrfs and NILFS2, see e.g. the following Samsung paper from the Proceedings of the 13th USENIX Conference on File and Storage Technologies (2015):  https://www.usenix.org/system/files/conference/fast15/fast15-paper-lee.pdf

You could argue that the tests in that paper do not represent an average user (whatever that is), but their results look interesting enough to consider using F2FS on an SSD instead of ext4 (which I use for my HDDs -- laptops, desktops, HTPC and server -- and really like). I'm sure ext4 works fine on modern SSDs, as you and some others have posted in this thread, but I don't think anyone can tell The_Great_Sephiroth categorically that he should consider ext4 over F2FS. From the various articles and posts I've seen, F2FS looks like a reasonable choice (perhaps a better choice, who knows) for an SSD if he wants to try it.

----------

## axl

so little love for xfs  :Sad: (

----------

## axl

well, in as many years as I have as linux user, and a gentoo user, I'll tell you, the least problems I EVER had with anything, any component of any system was XFS. 

I use XFS extensively on everything. New raspberry PI4... great... let me recompile the kernel with XFS in it... and other goodies. But the point is, XFS was always a starting point of mine. And for me, xfs didn't kill any drive. In fact it saved some data for me. 

It's one of those things you don't think about, but was always there. IN EVERYTHING.

----------

## axl

tip. try blktrace and iowatcher I think it's called the new utility to map it out. the older one still in portage doesn't work anymore. not that it's complicated to make a new one. the point is... if you enable CONFIG_BLK_DEV_IO_TRACE=y in kernel and use blktrace you can map out when something reads and writes. and can make really cool graphs.  :Smile: 

stuff like this: https://www.youtube.com/watch?v=UQ12PS5x53U

Anyway... that's a cool toy to play with. and yes, that is my video, reading through my ssd. in march 2017. 

wouldn't you think if you had the fear that stuff gets over-written too often, and you feared your fs, wouldn't you like to map out what it's doing. 

I was curious. So I tested. Funny enough I remained with XFS. 

Make up your own test.  :Smile: 

PS actually I wasn't reading the disk in the test. I don't remember what I was doing. But there's clear reading and writing.

----------

## axl

 *axl wrote:*   

> https://www.youtube.com/watch?v=UQ12PS5x53U

 

btw, the speed scale which often hits 1000MB/s is from an intel 750 drive. I know that because that was my top dawg in the farm. 

but now I have 2 900 series running around. and I didn't graph them yet. 

I dont know what people consider a ssd these days... but I love mine:

```
[root@sanziana:~]# lspci|grep memory

b3:00.0 Non-Volatile memory controller: Intel Corporation Optane SSD 900P 
```

there she is. in pci near the cpu. I loved the 750 but the 900 is way freaking better.

----------

## tholin

 *The_Great_Sephiroth wrote:*   

> So what about the scheduler? I am under the impression that I should use deadline for an SSD, not CFQ since CFQ was designed to minimize head movement, or so I have read.

 

CFQ was an legacy IO scheduler and it was removed with the rest of the legacy block layer in kernel 5.0

----------

## Goverp

AFAIR the current default scheduler is "none", which is supposedly good for SSD's and their ilk (there are no head movements blocking IO).

Otherwise current kernels have a choice of bfq and mq-deadline.  IIUC a gross oversimplification is mq-deadline is best for servers (best throughput), bfq for desktops (guarantees some bandwidth for interactive work.

----------

## The_Great_Sephiroth

Thank you, Goverp. I am on the current stable kernel, which is 4.19.72 unless I am mistaken. When Gentoo moves to 5.x, I will be there too! I try to stay in-tree as often as possible. Let Portage deal with my software so I don't have a bunch of manual things to deal with.

Axl, I did consider XFS in the beginning, but for my use-cases it is not up to snuff. On mechanical disks it does not do bit-rot protection or checksums, so while it may be fine, it is not for me, BTRFS and/or ZFS are. I am only now looking into SSDs filesystems and I have settled on F2FS due to the speed improvements it offers as well as extended lifespan of said SSD.

----------

## axl

 *The_Great_Sephiroth wrote:*   

> Axl, I did consider XFS in the beginning, but for my use-cases it is not up to snuff. On mechanical disks it does not do bit-rot protection or checksums, so while it may be fine, it is not for me, BTRFS and/or ZFS are. I am only now looking into SSDs filesystems and I have settled on F2FS due to the speed improvements it offers as well as extended lifespan of said SSD.

 

xfs does now. mkfs.xfs -m crc=1 is a different fs from -m crc=0. implemented in the last few years. they also added an xfs_fsr which is a defrag utility which is completely lovely.

PS 

CONFIG_XFS_ONLINE_SCRUB=y

CONFIG_XFS_ONLINE_REPAIR=y

They show up as a new option under xfs. in kernel. but they do have checksum now.Last edited by axl on Wed Oct 16, 2019 10:45 pm; edited 1 time in total

----------

## axl

 *The_Great_Sephiroth wrote:*   

> I am only now looking into SSDs filesystems and I have settled on F2FS due to the speed improvements it offers as well as extended lifespan of said SSD.

 

Well, I am speaking from a SSD perspective. I have valuable nvme drives. 3 intel 750 and 2 optane. My personal property. Not the place where I work, not anything like that. MY OWN. And I totally entrusted them to XFS. I am talking about the older 750 drives. I think they are about 4 years old. And still 99% health. And there was no reason to doubt xfs for the new ones, that have like a few weeks of time since they started working. 

I totally trust xfs, for both mechanical and ssds. I was never tempted to test f2fs. not even once. why would I? 

As far as I know, f2fs (to replace vfat/exfat) was created for mmc cards. By samsung. And some people just ran with it. There's nothing to suggest it's a good idea to keep a rootfs on it. At least I haven't seen anything.

----------

## eccerr0r

Due to the similar sounding "eMMC" and "MMC" I suspect that the former is simply an on-board version of the latter "embedded".  I don't know of wear leveling reliability of eMMC (as far as I know, it's fairly good as witnessed in the longevity of phones, tablets, etc. that use them for primary/root storage) but the removable variants of MMC / SD / USB flash sticks leave something to be desired and f2fs will benefit these.

I'll need to try to set up another more recent Gentoo install on a flash stick (SD/USB) and see how long it lasts before it fails with f2fs... past experience with these it's a bad idea to run Gentoo off of removable flash media on ext2fs, though I do have one Lexar branded USB flash stick that has an old Gentoo install on it.  I dared not to emerge @world on it without a PORTAGE_TMPDIR on some other media that I don't need to worry about erase cycles on.

----------

## The_Great_Sephiroth

I have a 1GB Lexar Firefly drive with FAT32 on it THAT STILL WORKS. How it works is beyond me. I used it new in 2006. This drive is insane. I do sha1sum files on it now and it still seems to work!

----------

## Fitzcarraldo

Coincidentally, the laptop HDD partition in which I store all my data files is nearly full and I had been wondering what to do for several months. I was not looking forward to replacing the 750GB HDD with a higher-capacity HDD or SSD and having to re-install Gentoo. The 13.3-inch laptop does not have a second drive bay but it does have two mSATA slots for SSDs, so I have just bought a 256GB mSATA SSD module (INDMEM Global Corp., Model: DMMS, 256GB mSATA 3D NAND MLC Flash SSD SATA 3) to provide an extra partition for my data. I installed the SSD today and have formatted it as F2FS to see how the filesystem performs. The SSD will not be as heavily used as the HDD (which uses ext2 on the boot partition, ext4 on the root partition and on the home partition, and NTFS on a partition for data), but it will still be interesting to see how the SSD performs over time using F2FS. I will be mainly copying photos and documents to it, and viewing and editing those files. I will also configure a Cloud client to sync the directories on that drive with my Cloud server.

----------

## DigitalCorpus

 *The_Great_Sephiroth wrote:*   

> So what about the scheduler? I am under the impression that I should use deadline for an SSD, not CFQ since CFQ was designed to minimize head movement, or so I have read.

 

I generally don't like Phoronix's benchmark suite/results because they're more indicative of enterprise-ish workloads and not enthusiast workloads. That said, they have good stuff sometimes:

Kernel 4.17 Schedulers on [Optane NVMe] SSD

Kernel 4.19 Schedulers on [SATA] SSD

Kernel 4.20 Schedulers on [NVMe] SSD

Kernel 5.0 Schedulers on [120 GB NVMe] SSD

However, I can't find the exported "low_latency" variable in /sys or /proc

Edit: BFQ wasn't selected by default, so nvm...

----------

## 389292

I'm running my mining rig from a USB3 flash drive with ext4 for almost 2 years now, and it keeps going. No RO optimizations or anything like that, just regular install.

I was quite surprised actually. My first SSD died after only 1.5 years, it was infamous OCZ Agility. So it depends on many factors, not just your FS.

My current SSD is running ext4 because journaling is actually useful to me because of often black outs.

----------

## Zucca

 *eccerr0r wrote:*   

> Due to the similar sounding "eMMC" and "MMC" I suspect that the former is simply an on-board version of the latter "embedded".

 That is correct.

----------

