# LVM RAID questions...

## Zucca

So. First time for everything - my server's btrfs array has some errors. Data is there and I can copy the stuff from there using normal means or copying.

I'm now in process of removing a single 3TB disk from the array (btrfs is doing it's balance to ensure my files still have two copies).

I could format a new btrfs on to that 3TB disk. Then move another disk from the original btrfs disk array into the new one. Then copy files over from the old array to the new. Finally (reformat) move the rest of the disk from old array into the new. Bang. Straight forward and plain simple - that's what btrfs is; easy to use, but slow (although speed isn't an issue here).

Since the LVM(2) has the ability to use mdraid functionality under the hood...

I could on the other hand use LVM. Add the single 3TB disk as a base of new vg. Then remove another disk from btrfs array and add it too to the vg. Then create some lvs for /, swap, /home, maybe /var... Then copy files over and finally add the rest of the disks to the vg.

At this stage can I change the data ordering of my lvs? For example can I change / to RAID6, swap to RAID5, /home to RAID10... without copying files back and forth? Are these actions "online", meaning I can perform them while lvs are in use?

I think I have more questions too, but let's start with these.

Thank you.

----------

## NeddySeagoon

Zucca,

```
man lvconvert
```

That sort of thing?

----------

## Zucca

 *NeddySeagoon wrote:*   

> 
> 
> ```
> man lvconvert
> ```
> ...

 Looking into that.

UPDATE: I got my entire btrfs array ro. This is getting increasingly difficult...   :Mad: 

----------

## Zucca

Now that I would need to reformat the entire five disk array at once, I might as well go full lvm raid.

I'm thinking of using ext4 and xfs. Or do people use other filesystems on lvm?

Opinions and expariences?

----------

## Hu

I use ext4 on LVM because ext* is traditional.  I have no complaints about it, but neither do I have any arguments against using other well known filesystems on LVM.

----------

## Zucca

Apparently ext4 has encryption support.

On the other hand xfs has loads of performance tuning options, includingStriped allocation - ensures that data allocations are aligned with the stripe unit. This increases throughput.

Does any of these well supported, single disk, filesystems have compression support? zstd would be nice. I think f2fs has, but I'm running spinning rust still on this box.

Right now I'm leaning against a mix of ext4 and xfs. Although sticking with single fs (at least with the directories that system needs to boot) would ease the job of constructing kernel and initramfs.

----------

## Zucca

i've been rftm'ing here and there and also looked at the tables on wikipedia (which maynot be up-to-date).

HAMMER(2) is one reason I'd like to try out DragonFly BSD someday.

Anyways. Looks like jfs is one option too. Although there seems to bee fragmentation issue. That is, there's no way to do online defragmentation of jfs on Linux, afaik.

Oh well.

I have few days off from work, starting tomorrow. I'll have then decided the layout of my filesystems.

----------

## wwdev16

I'm curious to know how things have worked out, and if you have used any encryption facilities.

I stopped using lvm because I couldn't get a server to not hang during shutdown with:  md raid -> luks on whole device -> lvm pv -> lvm group -> / as lv due to / still being mounted ro. So I am wondering if you get a clean shutdown or reboot.

As far as my experience: 

xfs usually not a problem unless the kernel fs has a bug like in the recent 

    past. IIRC it might have been 5.11.

initramfs generators (e.g. dracut) become hard to use with more layers and w/

    non-typical usage. Like having the luks device key be on flash drive that is only

    present during boot. Or say having a gpt partition-table on top of a luks device.

----------

## Zucca

 *wwdev16 wrote:*   

> I'm curious to know how things have worked out, and if you have used any encryption facilities.

 I've been very busy with all the other things atm.

The process is that I have made a almost complete copy of my files into several other hard disks. That is 3-4TiBs of data. Took a while to move things around. I have backups of the most important things (photos, custom scripts, /etc), but I thought moving the whole system would be easier, although maybe more time consuming.

Next step is creating vg with five (or six) pv's in it.

As for encryption: none. I might later utilize EncFs, but that's all.

I plan to use custom initramfs.

----------

## Zucca

I'm creating lvm raid6 by running:

```
lvcreate --type raid6 -L 65G -i 3 -n OS_raid6 zelan_vg
```

... and now the raid array is being created/synced. While this uses mdraid beneath /proc/mdstat doesn't show progress as I was expecting.

I didn't found any lv* command either to show the progress/status. Does anyone know if it's possible to see the progress of lvm raid sync?

----------

## Zucca

 *Zucca wrote:*   

> Does anyone know if it's possible to see the progress of lvm raid sync?

 

```
lvs
```

Now I know. :)

----------

## pjp

Is RAID a solution worth using for your use case? I'm not a fan of it in my home systems, since they lack some of the niceties that otherwise make it useful:

 *Zucca wrote:*   

> Many times I've been doing stupid things, like pulling wrong drive out,

  And that is exactly my biggest concern whenever I'm using or thinking about using RAID in a home system. Or at least a system without hardware stable paths to identify a drive, and preferably the ability to turn on an LED for the drive in question (which also preferably happens automatically).

<rant> I get why those are "enterprise" features, but it doesn't seem like in 2021 it should still be prohibitively expensive in all but the lowest end hardware. It is the more important version of coloring mouse / keyboard ports.</rant>

----------

## Zucca

pjp,

I do have leds on the hotswap cage. Although those are activity leds. One can still conclude which drive is in question by: 

```
dd if=/dev/foo of=/dev/null
```

... and  checking which led is lit constantly.  :Wink: 

I also have udev rules which create links for each drive based on hardware address. I need to (yet again) double check its state. I remember eudev and systemd-udev had their differences and it was a little "hackier" to implement in eudev.

----------

## wwdev16

zucca, 

If you end up building a raid again, you may want to look at md raid. It doesn't 

need udev rules to get a stable device name and it can easily be used in an initramfs.

```
/etc/mdadm.conf

DEVICE /dev/sd*

ARRAY /dev/md0  metadata=1.2 UUID=<raid-uuid> name=any:md0

MAILFROM mdadm@example.com

MAILADDR admin@example.com
```

This is all I have for the file server. All an initamfs needs is a copy of /etc/mdadm.conf

and to then to execute /sbin/mdadm --assemble --scan. IIRC dracut can do this.

----------

## Zucca

wwdev16,

The problem isn't stable device names, but rather mapping each physical drive to a drive slot.

```
# ls -lhF /dev/pool/

total 0

lrwxrwxrwx 1 root root 8 May 27 11:39 cache -> ../dm-38

lrwxrwxrwx 1 root root 8 May 27 13:22 home -> ../dm-49

lrwxrwxrwx 1 root root 8 May 27 11:36 root -> ../dm-10

lrwxrwxrwx 1 root root 8 May 27 17:36 storage -> ../dm-82

lrwxrwxrwx 1 root root 8 May 27 11:35 swap -> ../dm-21

lrwxrwxrwx 1 root root 8 May 27 11:40 var -> ../dm-32

lrwxrwxrwx 1 root root 8 May 27 17:35 vault -> ../dm-71

lrwxrwxrwx 1 root root 8 May 27 13:20 zucca -> ../dm-60
```

I do have stable device names for each lv, so that's not a problem. I could also rely on UUIDs.

I have made udev rules earlier which create symlinks to device nodes based on which SATA port the drive in question is connected to.

Of course those don't work with SATA port multipliers (Those dirty things! Who uses them?)  :Wink: 

----------

## szatox

 *Quote:*   

>  Of course those don't work with SATA port multipliers (Those dirty things! Who uses them?) 

 I guess those people who don't know that SAS controllers can speak to SATA drives, and do that better than SATA controllers.

I haven't done this myself (yet), but once I run out of disk slots, I'm definitely going to give this setup a shot.

It doesn't work the other way around; you can't connect a SATA controller to a SAS drive, and the plugs are keyed to clear your doubts.

----------

## pjp

Zucca,

I didn't want to derail the thread too much, but as it is at least somewhat related, what dock/cages are you using for hotswap? Some of the stuff I've looked at seemed of questionable quality.

----------

## Zucca

 *pjp wrote:*   

> what dock/cages are you using for hotswap? Some of the stuff I've looked at seemed of questionable quality.

 Silverstone 5x3.5" SATA/SAS hotswap cage. At least it hasn't failed on me. :P Also the hull of the cage is quite thick aluminum.

I've always expected Silverstone to manufacture (or stick their branding on) good products. I have their PSU which is manufactured by Super Flower so they didn't skimp on that either.

----------

## pjp

Thanks, I was probably being cheap when I looked at theirs (well, cheap without a recommendation).

I have my drives nicely spaced, so I'm not sure I want them that close together. But I'm wondering if I could use the single drive version internally. Nope, I'm thinking of a different case with easier side access. I guess it'll have to go on the shopping list if I ever find a suitable "NAS" case.

----------

## Zucca

I have now all the things running... Although i need to tune my initramfs init script a little more.

By moving from btrfs based storage I now need to have some way to do snapshots or something that'll save me from accidental "rm actions" etc...

----------

## wwdev16

Does CONFIG_DM_SNAPSHOT help at all?

----------

## szatox

Zucca, something like LVM? Tends to work pretty well.

Until your thin pool overflows and explodes, which is unrecoverable. Honestly.... Why doesn't it just freeze when running out of space?

Anyway, full snapshots are an option too. If they overflow, you only lose that one snapshot and not the whole pool. Obviously, less efficient in terms of performance and storage used, but much safer.

----------

## Zucca

Ewww. lvm snapshotting sounds half done. :E

I think I'll resort to rdiff-backup or similar which has "history" function.

----------

## pjp

Not really half done, they're intended to be temporary. VMWare snapshots are short-term use as well.

Backup: snapshot, backup from the snapshot, delete snapshot.

Upgrade: snapshot, upgrade, verify upgrade, rollback or delete snapshot.

I've only tested the rollback part once in a VM. The only problem I recall was /boot was not part of LVM, so kernel changes would have needed to be handled outside of the rollback feature (I forget the command, it's called something else).

----------

## szatox

it's lvconvert --merge.

Anyway, yeah, I do use lvm and snapshots. Still, blowing all your data up just because you've ran out of extents is something I'd classify as a critical failure. It shouldn't happen on a persistent storage, regardless of how long you're going to keep that snap. 

Full snaps are OK (as long as you can afford the performance hit).

----------

## pjp

 *szatox wrote:*   

> it's lvconvert --merge.

  I saw that not long ago too, while looking for some other info. Not "intuitive" IMO.

 *szatox wrote:*   

> Anyway, yeah, I do use lvm and snapshots. Still, blowing all your data up just because you've ran out of extents is something I'd classify as a critical failure. It shouldn't happen on a persistent storage, regardless of how long you're going to keep that snap. 
> 
> Full snaps are OK (as long as you can afford the performance hit).

  Does it damage the original volume, or only the thin snapshot?

Thin provisioning is always a risk because you're gambling that you'll have the space. Maybe its the default "I know what I'm doing" option, somewhat like rm -rf being able to wipe out from / down. 

 *man lvmthin wrote:*   

> thin provisioned LV is given a virtual size, and can then be much larger than physically available storage.

  Seems like a design bug. I would think "disk full" wouldn't be particularly difficult to implement. And a "reserve %" option would seem consistent with existing fs features.

----------

## szatox

True, it' not very intuitive. Gotta peek into the manual every time. Not that I do that very often.... Still, it's one of those cases where you  either know in advance what you're looking for, or you're not likely to recognize it when you see it, so even the manual won't really get you started.

 *Quote:*   

> Does it damage the original volume, or only the thin snapshot? 

 Yes.

The original volume is a part of the thin pool, so when the pool blows up, you lose your original volume too. This is different from full snapshots which act as a copy-on-write overlay rather than an alternative table of pointers to specific chunks in a shared pile of data in various states.

In fact, when I tested it, the damage was even bigger, the whole volume group got corrupted (though I can't recall in what way exactly). I was able to recover the other volumes by manually removing that thin pool from LVM headers backup and then restoring it, but I couldn't find any way to repair the thin pool itself.

 *Quote:*   

> 
> 
>  *Quote:*   thin provisioned LV is given a virtual size, and can then be much larger than physically available storage.	 
> 
> Seems like a design bug. I would think "disk full" wouldn't be particularly difficult to implement.

  This itself is not a bug. The fact that it causes data loss definitely is.

Filesystems usually have a lot of gaps inside, so storing only the blocks that actually contain data makes sense. When you run a cloud or anything with SAN, you can take advantage of that and overcommit resourses to cut costs - because all those machines won't always be using full capacity of their drives.

I used to work for a kraken (it was big enough to have it's tentacles wrapped all around the globe), we were overcommiting storage by up to 30%. According to the company guidelines, it was safe.

Well... I did it some time ago, so... Who knows? Maybe it got fixed in the meantime? Either way, stress-test components you want to use BEFORE you shot yourself in the foot.

Say, I like the idea of nilfs2;. Seems to be the perfect FS for storing backups taken with rsync. Except that I managed to force it into read-only mode. Irrevocably. By filling it up to its full capacity.

----------

## Zucca

I've thought this a long time ago, but having /etc or separate partition makes sense if you want to snapshot it (for user error recovery). But since fstab resides there, it will require (custom) initramfs.

Although... doing rdiff-backups (for example) of /etc is much more simpler to implement, not to mention not as prone to breakage. But it is possible... and this is Gentoo. ;) Have it your own way.

----------

## pjp

 *szatox wrote:*   

> The original volume is a part of the thin pool, so when the pool blows up, you lose your original volume too.

  OK, that's really bad. Thinking that only the snapshot would be lost, I didn't consider that to be too bad of a problem provided it was a documented outcome. But losing all of it does not seem like a good design implementation IMO. I've only used the standard snapshots and never looked into thin snapshots given "normal" concerns of over provisioning capacity.

 *szatox wrote:*   

> This is different from full snapshots which act as a copy-on-write overlay rather than an alternative table of pointers to specific chunks in a shared pile of data in various states.
> 
> In fact, when I tested it, the damage was even bigger, the whole volume group got corrupted (though I can't recall in what way exactly). I was able to recover the other volumes by manually removing that thin pool from LVM headers backup and then restoring it, but I couldn't find any way to repair the thin pool itself.

  Until thin pools are improved, I'll definitely consider them unusable. 

 *szatox wrote:*   

> This itself is not a bug. The fact that it causes data loss definitely is.
> 
> Filesystems usually have a lot of gaps inside, so storing only the blocks that actually contain data makes sense. When you run a cloud or anything with SAN, you can take advantage of that and overcommit resourses to cut costs - because all those machines won't always be using full capacity of their drives.
> 
> I used to work for a kraken (it was big enough to have it's tentacles wrapped all around the globe), we were overcommiting storage by up to 30%. According to the company guidelines, it was safe.

  Yeah, I meant the data loss was a bug. I've worked with several NAS / SAN storage solutions, so I'm familiar with over committing / provisioning storage allocations. Yes, it is "safe" from the standpoint of not LVMing your data into nothingness, but it was still a concern. Especially as the NAS / SAN solutions aged and actual usage increased. It works well right up until you run out of actual storage and departments start asking questions about why they don't get to use all of the storage they "paid" to have.

----------

## pjp

 *Zucca wrote:*   

> I've thought this a long time ago, but having /etc or separate partition makes sense if you want to snapshot it (for user error recovery). But since fstab resides there, it will require (custom) initramfs.
> 
> Although... doing rdiff-backups (for example) of /etc is much more simpler to implement, not to mention not as prone to breakage. But it is possible... and this is Gentoo. ;) Have it your own way.

  fstab doesn't concern me too much. I rarely change it. However, it seems I have to mess with /etc/portage quite a bit. My long-term plan is to try using a read-only system and make changes to the build environment and not the running system. But that dream may be dead with the New Way of Doing Things that prefers forcing use of latest versions rather than security updates (the general trend, not specifically Gentoo).

----------

