# LVM børked; how do I rebuild? [SOLVED]

## ExecutorElassus

Okay, several problems:

one of the drives in a RAID5 array choked during some heavy writing, and I had to shut down dirty (ie, no unmounting; just had to switch off)

as a result, that array - which holds the LVM for all of my extended partitions, including /usr, /var, and all my userland storage - will not start.

#cat /proc/mdstat:

```
Personalities [linear] [raid0] [raid1] [raid6] [raid5] [raid4]

md127 : inactive sda4[0] sdc4[3]

              1931841384 blocks super 1.2

md1 : active raid1 sdc1[2] sda1[0] sdb1[1]

          97536 blocks [3/3] [UUU]

md3 : active raid1 sdc3[2] sda3[0] sdb3[1]
```

You will notice several problems with md127:

sdc4 has a role number higher than the array number, so it won't act as active. How do I change that?

I cannot fail or remove sdc4 from that array. I get errors about "device or resource busy"

also, 'md127' kinda sucks for an array name   :Confused: 

I cannot add sdb4 to md127. Doing so returns:

```
mdadm: add new device failed for /dev/sdb4 as 4: invalid argument
```

In short: how do I rebuild the array superblock for both /dev/sdb4 and /dev/sdc4, and then rebuild the array (as a bonus, it'd be nice to re-name it to something other than 'livecd:4' and give it a more sensible array number while I'm at it. Also, I do not have access to nano, or less, on this machine. Is it possible to do all this with mdadm?

Thanks for the help.

EE

UPDATE: 'cat /proc/mdstat' now returns:

```
md127 : inactive sdb4[4](S) sda4[0] sdc4[4](S)

                       2897762076 blocks super 1.2
```

 suggesting that all drives are now in the array, they just have the wrong role numbers. 'mdadm -E /dev/sdc4' gives 

```
Array State : A..
```

as the last line, while the other two are now both A.A Trying to start the array now results in 'I/O Error' as the failure.

UPDATE 2 - Message to the Future

As per XKCD:

Dear people from the future, here's what we've learned:

1) If one drive in a RAID array goes down, start the array degraded, back up your data, and then proceed. Do NOT force the raid controller to assume everything is correct unless you are sure it is. I made this mistake, and then the fsck on the next startup relocated files and nodes all over the place. I still have directories re-designated as files and vice-versa.

2) glibc breaking - in my case, by using tinderbox to overwrite my glibc emerge with an older version - is one of the worst ways to break a system. DON'T DO IT. 

3) the udev-182 upgrade will render your system nonfunctional on the next boot if you don't take steps to pre-load /usr and /var if they are on separate partitions (or RAID or LVM). 

4) NeddySeagoon - the one guy who replied to my query, and walked me through everything to repair my system - is a gentleman and a scholar. Buy him a beer for me if you run into him. He also wrote up a nice guide to help with the migration to udev-182. The migration is tricky! YMMV, etc etc.

5) build mdadm, lvm[2], and busybox with package-specific "static" USE flag, because you may need to use them before udev starts and mounts the libraries they would otherwise link.

6) Always have a SystemRescueCD (or some other liveCD) on hand, and a drive you can use to boot it.

7) make sure you understand the difference between UUIDs of a block device and of a filesystem. Different parts of your startup sequence (say, your /etc/fstab entries vs your root= entry in grub.conf vs. Neddy's initrd script) require one or the other.Last edited by ExecutorElassus on Sun May 06, 2012 11:04 pm; edited 1 time in total

----------

## NeddySeagoon

ExecutorElassus,

I know your pain.  I had to replace a drive in my raid5 that houses LVM for my KVMs and the bare metail hardware system.

During the resync one of the 'good' drives got kicked out of the array with hard read errors, resulting in I/O errors and interface resets. This left me with a five spindle raid5 with only 3 active drives.

Anyway enough of my woes.

Please post the full mdadm -E output for each partition in the raid set giving you issues.

You should be able to stop the partitally assembled raid and force assembly with the partitions you give mdadm.  How successful thats likely to be depends on the event count on each contributing partition.

The down side is that the raid will assemble but there is no way to detect and correct damaged data caused by the raid elements being out of sync.

You only get one go at this unless you have enough space to make images of the partitions involved so you have an undo.

Right now the preferred minor number of the raid set is the least of your problems.

----------

## ExecutorElassus

sigh. So, since last night, I deleted the array, re-made it with --assume-clean, rebooted, dropped and re-added partitions, re-synced, rebooted again and ran fsck (which cleared a whole bunch of "illegal block"s from the inodes, and a few more repetitions of the same steps. Anyway, here are my current outputs (thanks heavens I can at least ssh into the box now!)

```
# mdadm -E /dev/sda4

/dev/sda4:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x0

     Array UUID : d42e5336:b75b0144:a502f2a0:178afc11

           Name : domo-kun:carrier  (local to host domo-kun)

  Creation Time : Wed Apr 11 02:10:50 2012

     Raid Level : raid5

   Raid Devices : 3

 Avail Dev Size : 1931841384 (921.17 GiB 989.10 GB)

     Array Size : 3863681024 (1842.35 GiB 1978.20 GB)

  Used Dev Size : 1931840512 (921.17 GiB 989.10 GB)

    Data Offset : 2048 sectors

   Super Offset : 8 sectors

          State : clean

    Device UUID : f7f1d49b:a0272bc3:c46251a2:e0502319

    Update Time : Wed Apr 11 21:34:23 2012

       Checksum : 3823edf3 - correct

         Events : 16936

         Layout : left-symmetric

     Chunk Size : 512K

   Device Role : Active device 0

   Array State : AAA ('A' == active, '.' == missing)

```

```
# mdadm -E /dev/sdb4

/dev/sdb4:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x0

     Array UUID : d42e5336:b75b0144:a502f2a0:178afc11

           Name : domo-kun:carrier  (local to host domo-kun)

  Creation Time : Wed Apr 11 02:10:50 2012

     Raid Level : raid5

   Raid Devices : 3

 Avail Dev Size : 1931841384 (921.17 GiB 989.10 GB)

     Array Size : 3863681024 (1842.35 GiB 1978.20 GB)

  Used Dev Size : 1931840512 (921.17 GiB 989.10 GB)

    Data Offset : 2048 sectors

   Super Offset : 8 sectors

          State : clean

    Device UUID : b4009981:9f2fb7a3:0627bfaa:066872ba

    Update Time : Wed Apr 11 21:34:48 2012

       Checksum : a830e033 - correct

         Events : 16936

         Layout : left-symmetric

     Chunk Size : 512K

   Device Role : Active device 1

   Array State : AAA ('A' == active, '.' == missing)

```

```
 # mdadm -E /dev/sdc4

/dev/sdc4:

          Magic : a92b4efc

        Version : 1.2

    Feature Map : 0x0

     Array UUID : d42e5336:b75b0144:a502f2a0:178afc11

           Name : domo-kun:carrier  (local to host domo-kun)

  Creation Time : Wed Apr 11 02:10:50 2012

     Raid Level : raid5

   Raid Devices : 3

 Avail Dev Size : 1931841384 (921.17 GiB 989.10 GB)

     Array Size : 3863681024 (1842.35 GiB 1978.20 GB)

  Used Dev Size : 1931840512 (921.17 GiB 989.10 GB)

    Data Offset : 2048 sectors

   Super Offset : 8 sectors

          State : clean

    Device UUID : 4a8d21e3:15026b07:bfacaedc:b5160599

    Update Time : Wed Apr 11 21:35:09 2012

       Checksum : 7def73bb - correct

         Events : 16936

         Layout : left-symmetric

     Chunk Size : 512K

   Device Role : Active device 2

   Array State : AAA ('A' == active, '.' == missing)

```

```
# mdadm --detail /dev/md127

/dev/md127:

        Version : 1.2

  Creation Time : Wed Apr 11 02:10:50 2012

     Raid Level : raid5

     Array Size : 1931840512 (1842.35 GiB 1978.20 GB)

  Used Dev Size : 965920256 (921.17 GiB 989.10 GB)

   Raid Devices : 3

  Total Devices : 3

    Persistence : Superblock is persistent

    Update Time : Wed Apr 11 21:35:33 2012

          State : clean 

 Active Devices : 3

Working Devices : 3

 Failed Devices : 0

  Spare Devices : 0

         Layout : left-symmetric

     Chunk Size : 512K

           Name : domo-kun:carrier  (local to host domo-kun)

           UUID : d42e5336:b75b0144:a502f2a0:178afc11

         Events : 16936

    Number   Major   Minor   RaidDevice State

       0       8        4        0      active sync   /dev/sda4

       3       8       20        1      active sync   /dev/sdb4

       2       8       36        2      active sync   /dev/sdc4

```

For a while, sdb4 and sdc4 were both marked as "spare" and the array wasn't starting. I ran 'mdadm --zero-superblock' and added one, failed, then deleted the array and rebuilt it. It re-synced the drives after a reboot (which dropped sdb4 into "md126" as a sole member).

Ugh. It's kinda messy. Hoestly, there's not much on those drives I really care about (most of it is replacable, and I have backups), but there are a few things I'd hate to lose completely. 

How hosed am I?

Thanks,

EE

----------

## NeddySeagoon

ExecutorElassus,

The array is assembled and assumed clean - the information I was after has gone.

Find the stuff you want - copy it off then remake the filesystem and restore from backups. There is no way to detect any data integrity issues any longer.

With a 3 spindle raid set, you have three ways to bring it up indegraded mode to look around. Thats mostly harmless as with a drive missing it won't resync.

Your best bet would have been the two drives with the closest 

```
Events : 
```

count.

----------

## ExecutorElassus

sigh. I had a feeling.

Okay. On that array are the following:

/usr

/usr/portage

/usr/portage/distfiles

/opt

/tmp

/var

/var/tmp

/home

and nine partitions, totalling about 2TB of storage space

What is the best way to recover from that? Am I going to have to boot into a liveCD, re-do the partition table, and re-install? I do not seem able to re-emerge anything, as emerge fails to create /var/tmp/portage/[package]/work

double sigh. Is there a short list of essential files that I need to back up to enable a recovery? /etc/make.conf, the world file, /etc/fstab, grub.conf all come to mind. Any others?

Should I just start drinking and crying now?

----------

## NeddySeagoon

ExecutorElassus,

```
/usr/portage 

/usr/portage/distfiles
```

are expendable - you can get them from the web.

/tmp is wiped at reboot, so it doesn't matter either.  Consider putting /tmp in RAM

/usr and /opt will be rebuilt on reinstall, unless you have some manually installed packages there.

/home is all of your user files - which I hope you have backed up.

Salvage your /var/lib/portage/world file.  Salvage /etc. Then reinstall

You cannot reuse /etc directly but you can reuse make.conf,  and peek at /etc/conf.d/ for settings

After you have untarred the stage3, reinstate  /var/lib/portage/world then do 

```
emerge -e world
```

and all your apps will be rebuilt.

If you have some way of validation your raid and fixiing the damage this can be avoided.

----------

## ExecutorElassus

Indeed, I seem to have been incorrect about the partition tables: they still exist. I should probably copy over /etc/mdadm.conf as well, yes?

I'm able to mount and copy over all my /home sub-partitions (and since they're mostly just music files I already backed up, or small documents, that isn't hard). 

Okay, so here's the thing: what is the process for reinstall? Just download the tarball and unpack it? Remember, I can't emerge things on the current system: I get errors about being unable to create the working directories in /var/tmp.

world, make.conf, package.use package.mask, and other similar files are all backed up. 

So, I guess my basic question is this: is there a way to re-install without booting a liveCD and chrooting in? Can I just wipe /var/tmp and reformat it to start emerging things?

thanks again for your help,

EE

----------

## NeddySeagoon

ExecutorElassus,

I would remake all of the filesystems, which means chrooting in.

I run lvm2 over raid5 and I don't have a mdadm.conf file. It can't be used anyway as I need the raid assembled to mount root, so its explicit mdmadm -A calls in the initrd.

Preserve it if it gives you a warm fuzzy feeling.

You say you have managed to salvage things - that may mean that there is little or no damage - there is just no way to tell without comparing against a backup.

Do you feel lucky?

If so, poke about and see whats broken and whats fixable. Its just possible that everything is ok.

What are the permissions on  /var /var/tmp and /var/tmp/portage?

Is /var mounted rw ?  

I suspect so or you would have got errors during boot.

Is /var full?

----------

## ExecutorElassus

okay, lemme back up a bit, because maybe I'm not as badly off as I thought.

first, the "børked" of the subject was from the array freezing when I ran too many simultaneous rw-intensive operations on it (ie, shredding a big file while copying dozens over from another, while watching a large video file, and, uh, maybe a couple other things. So the drive didn't so much fail as get booted out of the array when I did a cold restart. I doubt I lost much (if any) data. 

Honesty, I'm (mostly) confident that the data itself is okay (and the stuff that might not be is statistically going to be program files that I replace when I reinstall them, maybe?). So, uh, I feel lucky?

/var and /var/tmp are both mounted rw, and both about 20% full. The build.log reads:

```
* Package:    x11-base/xorg-server-1.12.0-r1

 * Repository: gentoo

 * Maintainer: x11@gentoo.org

 * USE:        amd64 elibc_glibc ipv6 kernel_linux multilib udev userland_GNU xorg

 * FEATURES:   ccache sandbox

install: invalid option -- 'm'

Try `install --help' for more information.

 * ERROR: x11-base/xorg-server-1.12.0-r1 failed (unpack phase):

 *   Failed to create dir '/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work'

 * 

 * Call stack:

 *            ebuild.sh, line 701:  Called ebuild_main 'unpack'

 *   phase-functions.sh, line 955:  Called dyn_unpack

 *   phase-functions.sh, line 243:  Called die

 * The specific snippet of code:

 *              install -m${PORTAGE_WORKDIR_MODE:-0700} -d "${WORKDIR}" || die "Failed to create dir '${WORKDIR}'"

 * 

 * If you need support, post the output of 'emerge --info =x11-base/xorg-server-1.12.0-r1',

 * the complete build log and the output of 'emerge -pqv =x11-base/xorg-server-1.12.0-r1'.

 * The complete build log is located at '/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/temp/build.log'.

 * The ebuild environment file is located at '/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/temp/environment'.

 * S: '/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0'

```

I get similar errors if I try to emerge nvidia-drivers (with the correct directory, of course). I cannot run 'startx' because I get an error about a missing /root/.serverauth.XXXXXXX file. 

So, it seems the drives themselves are maybe okay (the services all start up), just no ability to install things. How should I proceed?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Ok, so lets see if we can fix it.

Does /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/temp/build.log exist?

If so please put it on a pastebin.

Sight of your emerge --info would be good too.

If you do  

```
mount /dev/shm /var/tmp/portage
```

does emerge workl

With over 2G RAM you can emerge most things like this. It just puts the portage build space in RAM.   If you don't have a lot of RAM builds will fail because  /var/tmp/portage gets full.

However that will get you a different error.

Can you remove the content of  /var/tmp/portage/  ?

Its only needed while an emerge is in progress.

----------

## ExecutorElassus

what I posted in the previous post was the entirety of that build.log. 

emerge --info gives:

```
# emerge --info

Portage 2.1.10.51 (default/linux/amd64/10.0, gcc-4.5.3, glibc-2.14.1-r2, 3.3.0-gentoo x86_64)

=================================================================

System uname: Linux-3.3.0-gentoo-x86_64-AMD_Phenom-tm-_9950_Quad-Core_Processor-with-gentoo-2.1

Timestamp of tree: Thu, 05 Apr 2012 20:45:01 +0000

ccache version 3.1.7 [enabled]

app-shells/bash:          4.2_p24

dev-java/java-config:     2.1.11-r3

dev-lang/python:          2.7.2-r3, 3.1.4-r3, 3.2.2-r1

dev-util/ccache:          3.1.7

dev-util/cmake:           2.8.7-r5

dev-util/pkgconfig:       0.26

sys-apps/baselayout:      2.1

sys-apps/openrc:          0.9.9.3

sys-apps/sandbox:         2.5

sys-devel/autoconf:       2.13, 2.68

sys-devel/automake:       1.9.6-r3, 1.10.3, 1.11.3

sys-devel/binutils:       2.22-r1

sys-devel/gcc:            4.4.6-r1, 4.5.3-r2

sys-devel/gcc-config:     1.6

sys-devel/libtool:        2.4.2

sys-devel/make:           3.82-r3

sys-kernel/linux-headers: 3.3 (virtual/os-headers)

sys-libs/glibc:           2.14.1-r2

Repositories: gentoo pd-overlay x-portage

ACCEPT_KEYWORDS="amd64 ~amd64"

ACCEPT_LICENSE="* -@EULA dlj-1.1 Mendeley-EULA"

CBUILD="x86_64-pc-linux-gnu"

CFLAGS="-march=athlon64 -O2 -pipe"

CHOST="x86_64-pc-linux-gnu"

CONFIG_PROTECT="/etc /usr/share/config /usr/share/gnupg/qualified.txt /var/lib/hsqldb"

CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"

CXXFLAGS="-march=athlon64 -O2 -pipe"

DISTDIR="/usr/portage/distfiles"

EMERGE_DEFAULT_OPTS="--autounmask=n"

FEATURES="assume-digests binpkg-logs ccache distlocks ebuild-locks fixlafiles news parallel-fetch protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"

FFLAGS=""

GENTOO_MIRRORS="ftp://gentoo.lagis.at/ http://gentoo.mirror.dkm.cz/pub/gentoo/ http://ftp.fi.muni.cz/pub/linux/gentoo/ http://gentoo.mirror.web4u.cz/ ftp://gentoo.mirror.web4u.cz/ ftp://ftp.klid.dk/gentoo/ http://mirror.uni-c.dk/pub/gentoo/ ftp://ftp.spline.inf.fu-berlin.de/mirrors/gentoo/ http://mirror.netcologne.de/gentoo/ ftp://ftp.wh2.tu-dresden.de/pub/mirrors/gentoo ftp://ftp.join.uni-muenster.de/pub/linux/distributions/gentoo http://gentoo.mneisen.org/ http://de-mirror.org/distro/gentoo/ ftp://ftp.uni-erlangen.de/pub/mirrors/gentoo ftp://ftp.tu-clausthal.de/pub/linux/gentoo/ http://ftp.spline.inf.fu-berlin.de/mirrors/gentoo/ ftp://ftp-stud.hs-esslingen.de/pub/Mirrors/gentoo/ ftp://de-mirror.org/distro/gentoo/ http://linux.rz.ruhr-uni-bochum.de/download/gentoo-mirror/ http://ftp6.uni-erlangen.de/pub/mirrors/gentoo ftp://linux.rz.ruhr-uni-bochum.de/gentoo-mirror/ ftp://mirror.netcologne.de/gentoo/ ftp://ftp6.uni-erlangen.de/pub/mirrors/gentoo ftp://ftp6.uni-muenster.de/pub/linux/distributions/gentoo ftp://sunsite.informatik.rwth-aachen.de/pub/Linux/gentoo http://ftp.uni-erlangen.de/pub/mirrors/gentoo http://ftp-stud.hs-esslingen.de/pub/Mirrors/gentoo/ ftp://ftp.ipv6.uni-muenster.de/pub/linux/distributions/gentoo ftp://gentoo.inf.elte.hu/ http://gentoo.inf.elte.hu/ http://ftp.heanet.ie/pub/gentoo/ ftp://ftp.heanet.ie/pub/gentoo/ ftp://ftp.df.lth.se/pub/gentoo/ http://mirror.switch.ch/ftp/mirror/gentoo/ ftp://mirror.switch.ch/mirror/gentoo/ http://gentoo.kiev.ua/ftp/"

LANG="en_US.utf8"

LDFLAGS="-Wl,-O1 -Wl,--as-needed"

LINGUAS="en en_US.utf8 de de_DE.utf8"

MAKEOPTS="-j5"

PKGDIR="/usr/portage/packages"

PORTAGE_CONFIGROOT="/"

PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

PORTDIR_OVERLAY="/var/lib/layman/pd-overlay /usr/local/portage"

SYNC="rsync://rsync.europe.gentoo.org/gentoo-portage"

USE="X Xaw3d a52 aac acl acpi aim alsa amd64 apm audiofile bash-completion berkdb bzip2 cairo cddb cdinstall cdparanoia cdr clamav cli consolekit cracklib crypt css cups curl curlwrappers cxx dbus directfb dri dvd dvdr dvdread encode ffmpeg fftw firefox flac fortran ftp gdbm geoip gif gimp glut gpm graphite gstreamer gtk hddtemp iconv icq ieee1394 imagemagick imap imlib ipv6 jack java java6 javascript jikes joystick jpeg kde kde4 lame latex ldap libsamplerate libwww lm_sensors mad matroska mmx modules motif mp3 mpeg mplayer mudflap multilib ncurses nls nptl nptlonly nsplugin offensive ogg openal opengl openmp openssl oscar pam pcre pdf perl png policykit posix pppd python qt3support qt4 quicktime raw readline rss scanner session sndfile sockets speex spell sse sse2 ssl suid svg symlink sysfs syslog tcl tcpd tetex theora threads tidy tiff tk translucency truetype udev unicode usb videos vorbis wmf wxwindows x264 xcomposite xetex xine xml xorg xpm xscreensaver xulrunner xv xvid zlib" ALSA_CARDS="hda-intel usb-audio" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en en_US.utf8 de de_DE.utf8" PHP_TARGETS="php5-3" RUBY_TARGETS="ruby18" USERLAND="GNU" VIDEO_CARDS="nvidia" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"

Unset:  CPPFLAGS, CTARGET, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON

```

(sorry for the wall-o).

```
#mount /dev/shm /var/tmp/portage

mount: /dev/shm is not a block device
```

I emptied out /var/tmp/portage. I haven't remerging anything yet (I'm waiting for the backup copy operation to finish using up all my rw cycles).

Thanks again,

EE

----------

## NeddySeagoon

ExecutorElassus,

I'll some more tomorrow night - about 19 hours from now.  Meanwhile its late in Scotland.

----------

## ExecutorElassus

I was gonna say. I'm at UTC+2, so it's bedtime for me as well.

Thanks a ton for the help. You're my new imaginary internet boyfriend or whatever.

----------

## NeddySeagoon

ExecutorElassus,

I admit to being male ... I'm also rumored to be the oldest Gentoo dev.

Your log should have read ...

```
 * Package:    x11-base/xorg-server-1.12.0-r1

 * Repository: gentoo

 * Maintainer: x11@gentoo.org

 * USE:        amd64 elibc_glibc ipv6 kernel_linux nptl udev userland_GNU xorg

 * FEATURES:   preserve-libs sandbox userpriv

>>> Unpacking source...

>>> Unpacking xorg-server-1.12.0.tar.bz2 to /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work

>>> Source unpacked in /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work

>>> Preparing source in /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0 ...

 * Applying xorg-server-1.12-disable-acpi.patch ..
```

.

It should have gone on with the unpacking source ...

then its supposed to prepare and build before it gets to install.  Lets assume your portage tree is broken as thats the easiest to fix.  

```
emerge --sync
```

 will fix that.

I think thats a long shot as everything is protected by several digests.  If somethinf was wrong with your tree, you would get digest errors before the ebuild was even used.

Do you get the same error with every package?

```
install: invalid option -- 'm' 

Try `install --help' for more information. 

 * ERROR: <package> failed (unpack phase): 
```

If so, its time to introduce you to the tinderbox.  Either your portage or python is in a mess.

The tinderbox contains individual binary packages that can be used one at a time of fixing installs that cannot be fixed any other way.

Think of them as stage3 tarballs that contain a single package.  That link is to the default/linux/amd64 branch, which from your emerge --info, is right for you.

There are two ways to use these packages. The portage friendy way is to put them into  /usr/portage/packages/<catagory>/<package>/tinderbox_file  so that you can do 

```
emerge -K =<catagory>/<package>-<ver>
```

This installs properly without leaving any shrapnel anound your box but it does need a working portage.

If portage is broken this won't work. The emergency fallback is to fetch the file from the tinderbox and put in into /  (your root).  Now untar it there, 

```
tar xpf tarball_name
```

The p there is important.

Ignore the warning about extra garbage at end of file ignored.

If tar is broken use 

```
busybox tar ....
```

but you will need the tarball of tar from the tinderbox too.

Whichever method you use, as soon as portage is working, rebuild the packages you fetched from the tinderbox so they are built with your USE flags anf your CFLAGS.

sys-apps/portage-2.1.10.51 is a good place to start as thats your installed portage version.  If your exact version is not there, choose the nearest version.

```
 mount: /dev/shm is not a block device
```

was unexpected.  Maybe you don't have Shared Memory Filesystem in your kernel. Anyway thats a side issue for now.

----------

## ExecutorElassus

```
# emerge --sync

>>> Starting rsync with rsync://212.24.172.37/gentoo-portage...

>>> Checking server timestamp ...

Unknown option: recursive

Unknown option: links

Unknown option: safe-links

Unknown option: perms

Unknown option: times

Unknown option: compress

Unknown option: force

Unknown option: whole-file

Unknown option: delete

Unknown option: stats

Unknown option: human-readable

Unknown option: timeout

Unknown option: exclude

Unknown option: exclude

Unknown option: exclude

Unknown option: verbose

Type shasum -h for help

 * Rsync has reported that there is a syntax error. Please ensure

 * that your SYNC statement is proper.

 * SYNC=rsync://rsync.europe.gentoo.org/gentoo-portage

```

hrm. whoops. would it be valid to attempt to download and unpack a portage tree snapshot first? Or should I assume python in broken, and go with tinderbox?

I'm going to assume /usr is probably hosed at this point, but I could be wrong. I can use ssh and nfs, thank heavens, so I can copy things as needed from my working laptop.

Thanks again for the help.

EE

----------

## ExecutorElassus

also, 'mdadm --detail /dev/md127' now shows this at the bottom:

```
# mdadm --detail /dev/md127

/dev/md127:

        Version : 1.2

  Creation Time : Wed Apr 11 02:10:50 2012

     Raid Level : raid5

     Array Size : 1931840512 (1842.35 GiB 1978.20 GB)

  Used Dev Size : 965920256 (921.17 GiB 989.10 GB)

   Raid Devices : 3

  Total Devices : 3

    Persistence : Superblock is persistent

    Update Time : Thu Apr 12 22:17:25 2012

          State : clean, degraded 

 Active Devices : 2

Working Devices : 2

 Failed Devices : 1

  Spare Devices : 0

         Layout : left-symmetric

     Chunk Size : 512K

           Name : domo-kun:carrier  (local to host domo-kun)

           UUID : d42e5336:b75b0144:a502f2a0:178afc11

         Events : 19457

    Number   Major   Minor   RaidDevice State

       0       8        4        0      active sync   /dev/sda4

       1       0        0        1      removed

       2       8       36        2      active sync   /dev/sdc4

       3       8       20        -      faulty spare   /dev/sdb4

```

but smartctl reports no errors for the drive. but then this:

```
# mdadm --manage /dev/md127 --remove /dev/sdb4

mdadm: hot removed /dev/sdb4 from /dev/md127

domo-kun ~ # mdadm --manage /dev/md127 --add /dev/sdb4

mdadm: /dev/sdb4 reports being an active member for /dev/md127, but a --re-add fails.

mdadm: not performing --add as that would convert /dev/sdb4 in to a spare.

mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdb4" first.

domo-kun ~ # mdadm --zero-superblock /dev/sdb4

mdadm: Unrecognised md component device - /dev/sdb4

```

Any idea why /dev/sdb4 keeps getting set faulty and dropped out of the array? PS, here's the result of smartctl:

```
# smartctl --all /dev/sdb

smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.3.0-gentoo] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===

Model Family:     Seagate Barracuda 7200.12

Device Model:     ST31000528AS

Serial Number:    5VP6R8T8

LU WWN Device Id: 5 000c50 030a9c7e9

Firmware Version: CC44

User Capacity:    1.000.204.886.016 bytes [1,00 TB]

Sector Size:      512 bytes logical/physical

Device is:        In smartctl database [for details use: -P show]

ATA Version is:   8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Thu Apr 12 22:23:06 2012 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status:  (0x82)   Offline data collection activity

               was completed without error.

               Auto Offline Data Collection: Enabled.

Self-test execution status:      (  40)   The self-test routine was interrupted

               by the host with a hard or soft reset.

Total time to complete Offline 

data collection:       (  600) seconds.

Offline data collection

capabilities:           (0x7b) SMART execute Offline immediate.

               Auto Offline data collection on/off support.

               Suspend Offline collection upon new

               command.

               Offline surface scan supported.

               Self-test supported.

               Conveyance Self-test supported.

               Selective Self-test supported.

SMART capabilities:            (0x0003)   Saves SMART data before entering

               power-saving mode.

               Supports SMART auto save timer.

Error logging capability:        (0x01)   Error logging supported.

               General Purpose Logging supported.

Short self-test routine 

recommended polling time:     (   1) minutes.

Extended self-test routine

recommended polling time:     ( 196) minutes.

Conveyance self-test routine

recommended polling time:     (   2) minutes.

SCT capabilities:           (0x103f)   SCT Status supported.

               SCT Error Recovery Control supported.

               SCT Feature Control supported.

               SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       155751471

  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       14

  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       91093547

  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       8105

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       14

183 Runtime_Bad_Block       0x0000   001   001   000    Old_age   Offline      -       1683

184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0

188 Command_Timeout         0x0032   001   001   000    Old_age   Always       -       1632

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0022   065   055   045    Old_age   Always       -       35 (Min/Max 33/38)

194 Temperature_Celsius     0x0022   035   045   000    Old_age   Always       -       35 (0 13 0 0 0)

195 Hardware_ECC_Recovered  0x001a   025   023   000    Old_age   Always       -       155751471

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1627

240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       59047210393549

241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2949500869

242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2131520475

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Interrupted (host reset)      80%      8079         -

SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

```

enjoy?

Thanks,

EE

PPS- it seems sdb3 has also been failed out of its array. Faulty drive, perhaps? Or should I check the cables?

----------

## NeddySeagoon

ExecutorElassus,

I guess you have a bad sector on /dev/sdb4 that causes controller resets and eventually, mdadm gives up on it.

dmesg may show somethig useful anything like 

```
[415840.462727] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0

[415840.462734] ata1.00: irq_stat 0x40000008

[415840.462746] ata1.00: cmd 60/d8:08:00:91:0c/03:00:be:00:00/40 tag 1 ncq 503808 in

[415840.462748]          res 41/40:00:f0:92:0c/00:00:be:00:00/40 Emask 0x409 (media error) <F>

[415840.472641] ata1.00: configured for UDMA/133

[415840.472688] ata1: EH complete
```

is a very bad sign.  The drive that was doing that didn't show any smart errors either.

```
[417885.092354] sd 0:0:0:0: [sda] Unhandled sense code

[417885.092358] sd 0:0:0:0: [sda]  Result: hostbyte=0x00 driverbyte=0x08

[417885.092363] sd 0:0:0:0: [sda]  Sense Key : 0x3 [current] [descriptor]

[417885.092369] Descriptor sense data with sense descriptors (in hex):

[417885.092373]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 

[417885.092384]         bf 46 92 30 

[417885.092390] sd 0:0:0:0: [sda]  ASC=0x11 ASCQ=0x4

[417885.092394] sd 0:0:0:0: [sda] CDB: cdb[0]=0x28: 28 00 bf 46 92 30 00 00 f8 00

[417885.092406] end_request: I/O error, dev sda, sector 3209073200

[417885.092412] md/raid:md2: read error NOT corrected!! (sector 3193072176 on sda3).

[417885.092418] md/raid:md2: Disk failure on sda3, disabling device.
```

That was game over.

emerge --sync does little more than call rsync.  Your emerge --sync output suggets that make.conf, your profile or rsync itself is broken.

Untarring a portage snapshort is worth a try ... but I tend to agree that user is damaged. The snapshot would fix the profile.

Does tar work ?

----------

## ExecutorElassus

This is just the last part of dmesg:

```
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: configured for UDMA/33

ata2: EH complete

ata2.00: exception Emask 0x50 SAct 0x1 SErr 0x680900 action 0x6 frozen

ata2.00: irq_stat 0x08000000, interface fatal error

ata2: SError: { UnrecovData HostInt 10B8B BadCRC Handshk }

ata2.00: failed command: READ FPDMA QUEUED

ata2.00: cmd 60/08:00:c8:6c:70/00:00:74:00:00/40 tag 0 ncq 4096 in

         res 40/00:00:c8:6c:70/00:00:74:00:00/40 Emask 0x50 (ATA bus error)

ata2.00: status: { DRDY }

ata2: hard resetting link

ata2: softreset failed (device not ready)

ata2: applying PMP SRST workaround and retrying

ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: configured for UDMA/33

ata2: EH complete

ata2.00: exception Emask 0x50 SAct 0x1 SErr 0x680900 action 0x6 frozen

ata2.00: irq_stat 0x08000000, interface fatal error

ata2: SError: { UnrecovData HostInt 10B8B BadCRC Handshk }

ata2.00: failed command: READ FPDMA QUEUED

ata2.00: cmd 60/08:00:c8:6c:70/00:00:74:00:00/40 tag 0 ncq 4096 in

         res 40/00:00:c8:6c:70/00:00:74:00:00/40 Emask 0x50 (ATA bus error)

ata2.00: status: { DRDY }

ata2: hard resetting link

ata2: softreset failed (device not ready)

ata2: applying PMP SRST workaround and retrying

ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: configured for UDMA/33

ata2: EH complete

ata2.00: exception Emask 0x50 SAct 0x1 SErr 0x680900 action 0x6 frozen

ata2.00: irq_stat 0x08000000, interface fatal error

ata2: SError: { UnrecovData HostInt 10B8B BadCRC Handshk }

ata2.00: failed command: READ FPDMA QUEUED

ata2.00: cmd 60/08:00:50:d6:4a/00:00:01:00:00/40 tag 0 ncq 4096 in

         res 40/00:00:50:d6:4a/00:00:01:00:00/40 Emask 0x50 (ATA bus error)

ata2.00: status: { DRDY }

ata2: hard resetting link

ata2: softreset failed (device not ready)

ata2: applying PMP SRST workaround and retrying

ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)

ata2.00: revalidation failed (errno=-5)

ata2: hard resetting link

ata2: softreset failed (device not ready)

ata2: applying PMP SRST workaround and retrying

ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: configured for UDMA/33

ata2: EH complete

ata2.00: exception Emask 0x50 SAct 0x1 SErr 0x680900 action 0x6 frozen

ata2.00: irq_stat 0x08000000, interface fatal error

ata2: SError: { UnrecovData HostInt 10B8B BadCRC Handshk }

ata2.00: failed command: READ FPDMA QUEUED

ata2.00: cmd 60/08:00:50:d6:4a/00:00:01:00:00/40 tag 0 ncq 4096 in

         res 40/00:00:50:d6:4a/00:00:01:00:00/40 Emask 0x50 (ATA bus error)

ata2.00: status: { DRDY }

ata2: hard resetting link

ata2: softreset failed (device not ready)

ata2: applying PMP SRST workaround and retrying

ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: configured for UDMA/33

ata2: EH complete

mdadm: sending ioctl 800c0910 to a partition!

mdadm: sending ioctl 800c0910 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

ata2.00: exception Emask 0x50 SAct 0x0 SErr 0x680900 action 0x6 frozen

ata2.00: irq_stat 0x08000000, interface fatal error

ata2: SError: { UnrecovData HostInt 10B8B BadCRC Handshk }

ata2.00: failed command: SMART

ata2.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in

         res 50/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x50 (ATA bus error)

ata2.00: status: { DRDY }

ata2: hard resetting link

ata2: softreset failed (device not ready)

ata2: applying PMP SRST workaround and retrying

ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: SB600 AHCI: limiting to 255 sectors per cmd

ata2.00: configured for UDMA/33

ata2: EH complete

scsi_verify_blk_ioctl: 16 callbacks suppressed

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

mdadm: sending ioctl 1261 to a partition!

```

 So, should I try jiggling the cables, or just suck it up and take the drive in for repair?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

First step it to check the drive warranty status on the vendors website.

You fill in the part number and serial number, which smartctrl provides and the website will tell you if you are covered by warranty or not. 

```
ata2.00: irq_stat 0x08000000, interface fatal error 

ata2: SError: { UnrecovData HostInt 10B8B BadCRC Handshk } 
```

together with the lack of nastly messages from smartctrl, suggests it may be an interface error.

That means the cable at either end, some of the electronics on the drive or some of the electronics on the motherbaord.

Its worth trying a replacement cable.  Don't move the cables round, that will just cause another drive to drop out if the cable is faulty.

If a replaqcement cable doesn't help, try another SATA connector on the motherboard with the new cable.

After that, its drive vendors test software. Be careful with that - some tests will destroy your data.

You don't take a drive in for repair. Dead drives can only be repaired by the vendor.  You either get a warranty replacement or you buy a new drive.  

The

```
 ata2.00: configured for UDMA/33 
```

is also a bad sign.  The system is running the interface very slowly (its should be UDMA/133) in an attempt to get valid data across it.

This too may indicate an interface issue.

----------

## ExecutorElassus

Hi Neddy,

all right, I'll swap some cables out, and let you know what happens. Adventure!

The drive has another two years of warranty on it, so I'm covered if it's faulty. 

If I plug the drive into a different SATA socket, with a different cable, will the OS give it a different identifier? Or will sda and sdc retain those labels?

Thanks,

EE

PS- holy crap, I haven't bought a drive in a year, and everything looks like it costs double or more. Were the floods in Thailand really that bad?

----------

## NeddySeagoon

ExecutorElassus,

Its warranty - you only pay return postage.

The OS may rearrange all your drives if you move one around. mdadm and the raid set won't mind, the raid superblock on each drive tells what goes where inthe raid set, so it will be good.

If you use UUIDs or filesystem labels in /etc/fstab, that much will just work.

Hmm you mentioned a mdadm.conf.  If that is actually used, refereices to sdX may change.

Grub may get confused as BIOS discovery order may change. Any partitions you have identified as /dev/sd ... may change drives.

If root is not on raid, yor root=/dev/sd... may change drive too.

----------

## ExecutorElassus

Hi Neddy,

well, I shouldn't even have to pay postage: I bought it locally. Now that I know I have the S/N, I can be sure I'm yanking the correct drive (I may not have last time). 

/etc/mdadm.conf is completely commented out, so I'm guessing it's not being used. 

So, step one is to swap the cable and the slot on the offending drive. If it still has issues, then take it in for replacement tomorrow. 

Thanks again for the help. I'll report back tomorrow (unless there's something else I should know about beforehand?)

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Only change one thing at a time.

First the cable as its easy and less trouble.  If its still faulty, swap the SATA port on the motherboard for an unused one.

That may give you boot issues.

If you change two (or more) things at a time, you will never know what fixed it.

----------

## ExecutorElassus

sigh. oops. Too late!

rebooted once. sdb4 and sdc4 (the latter of which still incorrectly identified itself as a member of a fully active array, so that's the wonky one) both got dumped as spares into md126. I deleted that array, added sdb4 back into md127 (as well as deleting md125, which contained an orphaned sdc3 from md3), rebooted, and now only sdc4 was not a member of any array. I added it back to md127, and now it's re-syncing. I have … 500 minutes to go, so just under nine hours (is that a long time to re-sync a raid5 array of about 1.8TB?).

I'll see what's up tomorrow morning, when it's done with the re-sync, and report back. Hopefully it's just the cable, but I wouldn't sweat replacing the drive too much, either.

Stay tuned.

Cheers,

EE

UPDATE: So, recovery finished, and now I'm back at 'install -m' being unable to create directories (not just for xorg-server), and rsync being unable to sync (because it also doesn't understand options). So, uh … tinderbox?

UPDADE 2: It's been sitting stable now for a good six hours, with nothing in dmesg about failed writes or I/O errors, so I'm cautiously optimistic. However, every operation I attempt - emerge --sync, emerge [any package] - spits back errors about unknown options. Any guesses what would cause that? Or, more usefully, what are the best steps to repair that? is there a tarball I can unpack to cover over my portage toolchain, so I can at least start re-emerging programs? Also, I should mention, that at I have /usr on a separate partition, I cannot use >udev-182 until I can figure out how to premount /usr and /var. Since a lot of programs now seem to depend on newer udev, I can't run 'emerge -uD world' automatically. I'm following to forum thread about it, so hopefully that'll be sorted soon. 

Let's talk more about getting emerge to work.

----------

## NeddySeagoon

ExecutorElassus,

Resysn speeds are not deterministic.  You are supposed to be able to use the raid while it resyncs, the more you use it, the slower the sync goes.

That the resync completed properly is a good sign.  It read two drives and either veriifed or rewrite the data on the third drive.

Do you know if the writing happend on the suspect drive or was that driver read ?

Thats key.  If the suspect drive was written, any write fails will have caused sector remapping.

So  

```
 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0 
```

would have changed.

```
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0 
```

shows the number of sectors the drive is thinking about remapping.

If the suspect drive was read, then you know there were no read errors, in which case I would conclude your hardware problem is fixed but you don't know if it was the cable or the SATA port.

Throw the cable out anyway.  It not worth experimenting with further.

If you want to do a read test you can dd the entire content of the drive to /dev/null and keep an eye on dmesg for errors.

I have the same issue with udev and I've run into another nasty.  openrc-0.9.9.3 and a 1.7.x udev don't play nicely together. My root works as it kernel autoassembled vut lvm isn't started until udev has tried to mount everything, so its just me and the root filesystem.  I have another system with root in lvm on top of raid5.  The raid is assembled by an initrd, which also starts the lvm, since root is in there..  Still udev doesn't play nicely with openrc-0.9.9.3.  Nothing gets mounted.

emerge needs lots of things to work. portage, python, rsync to fetch files, assorted decompression utilities to unpack them, the gcc toolchain ...

Try replacing your portsge with one from the tinderbox and see what happens.  After you have used the tinderbox for 5 or 6 packages, its time to give up and reinstall.

----------

## ExecutorElassus

Hi Neddy,

both lines 5 and 197 still show 0 for the raw value, so I'll assume there were no write errors (I added the suspect drive to the array last, so it was written). (Incidentally, resync sped up considerably when I stopped boinc).

Okay, so I'll try tinderboxing (?) portage, then python, and see what that does. So, just unpack them from root?

If that fails, is "reinstall" something less drastic than "chroot in from a liveCD and start over from scratch"?

Cheers,

EE

UPDATE: for some reason, trying to reply boots me to the main index, so I'll reply here. Uh, progress! On a lark, I guessed that 'install' - as part of coreutils - might be broken. I used the tinderbox version, and now I've emerged portage to its latest version. I'll try to sync and see what happens.Last edited by ExecutorElassus on Fri Apr 13, 2012 7:23 pm; edited 1 time in total

----------

## NeddySeagoon

ExecutorElassus,

Less haste.  Get the right portage for you and unpack it to the root of your filesystem.

I posted the details earlier.  You must use the p option to tar ir it still won't work as it will be unpacked with -x in the permissions.  That means it won't eXecute, not even for root.

Then test - see what has changed if anything.

----------

## ExecutorElassus

okay. I managed to emerge portage successfully, and then tried to re-emerge coreutils. That failed on a broken /usr/include/mntent.h, so now I'm emerging glibc. After that I'll try coreutils again.

It seems the re-syncing (or rather, several iterations of it, along with bad journal/fsck management on my part - "sure, just auto-fix everything!") has left some files corrupted. But if I can run emerge, and then rebuild the toolchain, I can start getting things put back together. 

I'll keep you posted.

Thanks again,

EE

UPDATEokay, glibc won't install due to a broken /usr/include/mntent.h, which belongs to linux-headers. I can't emerge linux-headers due to a broken file belonging to glibc. Using tinderbox files for both of those results in the following error:

```
# emerge -p glibc

/usr/bin/python2.7: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /usr/lib64/libpython2.7.so.1.0)

domo-kun / # emerge -p linux-hearders

/usr/bin/python2.7: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /usr/lib64/libpython2.7.so.1.0)

```

So, what' my next step? I'm assuming reinstall. Can I do that without wiping everything? I have a system rescue CD I could use to re-install stuff.

----------

## NeddySeagoon

ExecutorElassus

Can you do

```
cd /usr/portage

scripts/bootstrap.sh 
```

Thats a stage1 from a stage1 install.  It builds your toolchain.  Do not interrupt it - it must be run at one sitting as its not resumable.

----------

## ExecutorElassus

Hrm: Apparently not:

```
# scripts/bootstrap.sh 

/bin/bash: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /bin/bash)

/bin/bash: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /lib64/libreadline.so.6)

/bin/bash: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /lib64/libncurses.so.5)

```

unless there's a tinderbox version of glibc-2.14 somewhere?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

What version of glibc did you have? 

and what version do you have now?

glibc must not be downgraded.  I guess you used to have glibc-2.14 and have a lower version now?

----------

## ExecutorElassus

I had glibc-2.14.1-r2, but the tinderbox version was glibc-2.13-r4 (and is thus my current version.

Sigh. So, if glibc can't be downgraded, what's my next step?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

I have sys-libs/glibc-2.14.1-r2 so I can post a tarball.  It will be similar to what you would get from the tinderbox except its optimosed for an AMD Phenom II 1090.

I will have it for other 64 bit AMD arches too, like an E350 and whatever is in the HP Microserver.

If you prefer, I can tell you how to make you own packages. You will need 10G or so of space for this.

----------

## ExecutorElassus

Hi Neddy,

I'm going to go with the tarball option, because 1) I'm not sure I have 10GB free space to build without moving things around, and 2) I'm not certain of my toolchain's integrity. 

would you mind posting your version? I might very well have a similar CPU, but right now, I'm mainly just worried about getting to a point where I can build my own toolchain (which apparently will need working python, linux-headers, coreutils, glibc, gcc, and probably rebuilding the kernel for good measure).

Thanks for the help.

EE

----------

## NeddySeagoon

ExecutorElassus,

Heres my glibc-2.14.1-r2.

You don't use your toolchain to make your own packages.

Long story short ... make a ext2 fs in a file ... about 10G

Loopback mount the file on /mnt/gentoo.  put a stage3 and portage snapshot in there

chroot into the new install in a file.  Set FEATURES to include buildpkg, emerge --sync, emerge whatever you need.

From outside the chroot in a file copy the packages you want out of /mnt/gentoo/usr/portage/packages/...

Install them as anything else you fetch from the tinderbox.

rm the install in a file when you are done.

----------

## ExecutorElassus

So, now:

```
# tar xpf glibc-2.14.1-r2.tbz2 

tar: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by tar)

```

busybox also fails with the same error. 

So, LiveCD?

* sadpanda *

----------

## NeddySeagoon

ExecutorElassus,

Yes. liveCD.  This is why you build busybox with the static USE flag. So you still have something when glibc gets trashed.

You cn no longer chroot into your install as bash won't run.

Its cd /mnt/gentoo tar ...

----------

## ExecutorElassus

Okay, just to be clear: since I have partitions or /usr, /var, etc, I should mount them before I start untarring things, yes? Am I going to have to recreate all the device nodes and VGs first? How close to "from scratch" do I have to get?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Thats correct - you want all the component parts of the tarball to go into the right places, so your filesystem needs to be assembled.

You will be safer with the command

```
tar -xpf /path/to/tarball -C /mnt/gentoo
```

too as it does not depend on using the Present Working Dir.

----------

## ExecutorElassus

Hi Neddy,

okay, so, I do this:

boot the LiveCD (in my case, SystemRescueCD 2.4.1), set up networking so I can ssh over from my laptop, and then … will the VGs already be mountable? Will I need to recreate all the device nodes and LVs? Or can I simply mount things that the CD will auto-detect?

ugh. I hate this. I'm sorry for all the trouble, and am really thankful you're walking me through this. I'll start the boot up with the liveCD, and get back to you.

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

SystemRescueCd will start your raid sets and activate your logical volums.  You should just need to do the mounts.

----------

## ExecutorElassus

Okay, I got all the RAID sets mounted okay. Now, this:

```
% tar -xpf /mnt/gentoo/glibc-2.14.1-r2.tbz2 -C /mnt/gentoo 

tar: This does not look like a tar archive

bzip2: Compressed file ends unexpectedly;

   perhaps it is corrupted?  *Possible* reason follows.

bzip2: Inappropriate ioctl for device

   Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.

You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover

data from undamaged sections of corrupted files.

tar: Child returned status 2

tar: Error is not recoverable: exiting now

```

Why does your tarball hate my happiness?

Is this something I haven't properly configured?

EDIT: whoops. I kinda copied the file over before it was completely finished downloading. Okay, I untarred glibc, and it didn't seem to puke (except for the "unexpected EOF" you said I can safely ignore).

So, do the same with the rest of the toolchain? I have the following from tinderbox:

```
% ls *.tbz2

coreutils-8.7.tbz2  glibc-2.14.1-r2.tbz2       portage-2.1.10.41.tbz2

glibc-2.13-r4.tbz2  linux-headers-2.6.39.tbz2  python-3.1.4-r3.tbz2
```

Missing are linux-headers-3.3, and a more recent portage. If glibc is functional, can I reboot and go back to emerging things? Or should I tinderbox more of the toolchain first?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Nope,  You can either attempt the chroot, or reboot to to test.  Success with either means fixing glibc worked, since almost nothing works without glibc.

Then you test one file at a time and only replace what you need.

At this time of night, I would reboot or chroot, then run the bootstrap.sh script and see what happens.

Can you leave it building while you sleep?

"Unexprected EOF" ???   Extra Garbage at End Ignoed is safe to ignore.

----------

## ExecutorElassus

I can chroot into the system. I'll try running the bootstrap.sh, and report back tomorrow.

Thanks again for all the help. People like you are why gentoo is awesome.

Cheers,

EE

UPDATE: trying to run the bootstrap script from chroot results in 

```
# scripts/bootstrap.sh 

realpath: no command specified

Try `realpath --help' for more information.

 * Error:  '' does not exist.  Exiting.

```

So I tried rebooting. It's stuck sitting on usb and firewire discovery, so I'm not really sure what it#s up to. If I can't get it to boot, I'll let you know. The last lines I see at startup are:

```
usb 5-1: new low-speed USB device number 2 using ohci_hcd

input: Logitech USB trackball as /devices/pci0000:00/0000:00:13.4/usb5/5-1/5-1:1.0/input/input4

generic-usb 0003:046D:C408.0003: input: USB HID v1.10 Mouse [Logitech USB Trackball] ib usb-0000:00:13.4-1/input0

firewire_core: giving up on config rom for node id ffc0
```

Any guess what's going on? Or should I just keep going with a reinstall? SystemRescueCD seems to get me into a state where I can run portage fairly quickly. Maybe I'll just do that. Or...?

UPDATE 2: Now I see what it was waiting on.  Now I have a kernel panic: /dev/md3 is not recognized, and it's trying to find a boot sector on fd0, etc. Seems like the liveCD renamed my RAID arrays again. 

From a booted system I get the same error about realpath as previously. So, should I just reinstall from the liveCD?

UPDATE 3: Well, although the bootstrap script won't work, I can emerge things in my toolchain. I'm working on coreutils now, after linux-headers and bash went in successfully. Should I just keep going with glibc, gentoolkit, gcc, etc, and manually emerge stuff until I have working system?

AFTER-HOURS UPDATE: So, now I'm hanging on emerging xorg-server:

```
Making all in dix

make[1]: Entering directory `/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0_build/dix'

make  all-am

make[2]: Entering directory `/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0_build/dix'

  CC     atom.lo

  CC     colormap.lo

  CC     cursor.lo

  CC     devices.lo

  CC     dispatch.lo

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/atom.c: In function 'MakeAtom':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/atom.c:134:7: warning: cast discards qualifiers from pointer target type

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/atom.c: In function 'FreeAtom':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/atom.c:186:2: warning: cast discards qualifiers from pointer target type

  CC     dixfonts.lo

In file included from /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/xkbsrv.h:55:0,

                 from /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:66:

/usr/include/X11/extensions/XKBproto.h:491:1: error: expected identifier or '(' before '}' token

In file included from /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:73:0:

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/dixevents.h:84:52: warning: redundant redeclaration of 'PostSyntheticMotion'

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/input.h:525:13: note: previous declaration of 'PostSyntheticMotion' was here

In file included from /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:83:0:

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/Xi/exglobals.h:62:12: warning: redundant redeclaration of 'DeviceKeyPress'

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/xkbsrv.h:309:51: note: previous declaration of 'DeviceKeyPress' was here

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/Xi/exglobals.h:63:12: warning: redundant redeclaration of 'DeviceKeyRelease'

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/xkbsrv.h:309:66: note: previous declaration of 'DeviceKeyRelease' was here

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/Xi/exglobals.h:64:12: warning: redundant redeclaration of 'DeviceButtonPress'

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/xkbsrv.h:310:51: note: previous declaration of 'DeviceButtonPress' was here

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/Xi/exglobals.h:65:12: warning: redundant redeclaration of 'DeviceButtonRelease'

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/xkbsrv.h:310:69: note: previous declaration of 'DeviceButtonRelease' was here

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/Xi/exglobals.h:66:12: warning: redundant redeclaration of 'DeviceMotionNotify'

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/xkbsrv.h:309:83: note: previous declaration of 'DeviceMotionNotify' was here

In file included from /var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:87:0:

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/enterleave.h:42:13: warning: redundant redeclaration of 'DoFocusEvents'

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/dix.h:451:13: note: previous declaration of 'DoFocusEvents' was here

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/enterleave.h:87:13: warning: redundant redeclaration of 'DeviceFocusEvent'

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/include/exevents.h:184:1: note: previous declaration of 'DeviceFocusEvent' was here

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'SendDevicePresenceEvent':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:324:43: warning: declaration of 'type' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'FreeDeviceClass':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:732:21: warning: declaration of 'type' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'FreeFeedbackClass':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:798:23: warning: declaration of 'type' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'BadDeviceMap':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:1637:30: warning: declaration of 'length' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'GetMaster':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:2610:33: warning: declaration of 'which' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'AllocDevicePair':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:2653:18: warning: declaration of 'pointer' shadows a global declaration

make[2]: *** [devices.lo] Error 1

make[2]: *** Waiting for unfinished jobs....

```

etc etc. I'm assuming a header file or dependency is corrupt. Any guess which?

----------

## ExecutorElassus

Hi Neddy,

okay, I guess we're now past the "how do I get the drives working again?" stage, and on to "how do I remerge everything?" stage. Today's error I can't figure out is from e2fsprogs:

```
make[2]: Entering directory `/var/tmp/portage/sys-fs/e2fsprogs-1.42.1/work/e2fsprogs-1.42.1/debugfs'

        MK_CMDS debug_cmds.c

        CC util.c

        CC debugfs.c

        CC ncheck.c

        CC icheck.c

make[2]: execvp: mk_cmds: Permission denied

make[2]: *** [debug_cmds.c] Error 127

make[2]: *** Waiting for unfinished jobs....

make[2]: Leaving directory `/var/tmp/portage/sys-fs/e2fsprogs-1.42.1/work/e2fsprogs-1.42.1/debugfs'

make[1]: *** [all-progs-recursive] Error 1

make[1]: Leaving directory `/var/tmp/portage/sys-fs/e2fsprogs-1.42.1/work/e2fsprogs-1.42.1'

make: *** [all] Error 2

emake failed

 * ERROR: sys-fs/e2fsprogs-1.42.1 failed (compile phase):

```

so, execvp is trying to write something, and failing on permissions. But what?

I've managed to get perl, gcc, glibc glib, cairo, python, portage, and a good chunk of @system to emerge okay. Now I'm stuck on this.

Any guess what it might be?

Cheers,

Andrew

UPDATE: okay, nevermind that. I figured out the offesnding package (e2fsprogs-libs), remerged it, and then resumed emerging @system. Assuming that works out, what next? I can't do a full @world emerge, because I have udev-182 blocked (and more things are starting to depend on it). So, kernel, then xorg, nvidia, and so forth? And what's going on with the qt packages? I got a whole pile of blocked packages when I tried to emerge qt. Is there a way to track down which program is forcing an old qt to be installed?

----------

## ExecutorElassus

Okay, maybe I still have drive problems.

I just rebooted, and - yet again - one of the members of md127 was put into its own array, and md127 itself was set inactive, with both of its member drives marked as spares. What's going on with that? I don't find any errors with dmesg, or with 'mdadm -E /dev/sdX4', so I'm not really sure why mdadm keeps dropping the drives out. Can you give any advice?

Thanks,

EE

----------

## NeddySeagoon

ExecutorElassus,

Look at dmesg and the event count on each member of your raid set.

If you raid set assembled in degraded mode (only n-1) drives, it would not rin unless your forced it to run.

You would remember doing 

```
mdadm --run /dev/md...
```

If a drive dropped out later, it would be in dmesg.

On a healthy raid tthe event count is identical on all members.  If you can find n-1 drived with an identical event count, its probably OK to assemeble the raid with only those drives, then run it manually.

IF you still have hardware issues there is no point in doing any more rebuilding of software. It will just break again.  You can teake the suspect drive out of the array and run it in degraded mode for a while.

Its probably worth trying to read the drive to /dev/null and watching dmesg for errors

```
dd if=/dev/sdX of=/dev/null bs=4096000
```

The large bs= (about 4Mb) speeds up the process. It will be several hours.

read

```
 man dd
```

 to see how to get a progress report from dd

edit: looking back fixing a broken glibc is one of the hardest gentoo fixes.  Thats behind you now

----------

## ExecutorElassus

Hi Neddy,

the two drives that were in an inactive array - and marked as spares - had the same event count. The one that got dropped out had six fewer.

So, I'll try running dd on the array, once it's built in about six hours. Is it possible that the wonky role numbers for the drives (sda4[0] sdc4[3] sdb4[2], whereas the other two arrays are respectively [0] [1] [2]) is causing mdadm to assume that a a drive in between is missing, and that sda4 (which was not in the array at startup) did not belong (as sdc4 had a role number of [3], already beyond the drive count)? Is there any way to fix that on the fly?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Put 

```
>=sys-fs/udev-180

>=sys-fs/udev-init-scripts-10

>=sys-auth/consolekit-0.4.5-r3

>=sys-apps/openrc-0.9.9.3

=sys-apps/net-tools-1.60_p20120127084908

```

 into /etc/portage/package.mask to keep udev at bay meanwhile.  You may find you need to add other things too.

----------

## ExecutorElassus

Hi Neddy,

I'll add in those packages to package.mask once I can boot up with RAID working (right now, I don't even have access to nano, much less the files I need to edit). The only thing I can see at the end of dmesg is "mdadm: sending ioctl 1261 to a partition!" which another forum told me was a kernel error I can ignore. 

If there's nothing in dmesg, can you think of any reason my RAID would be dropping drives out of the array? It's a different dive from the one that was suspect last time; I'd find it really hard to believe that two drives out of three failed. 

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

I cannot think of any reason for a drive to drop out of a raid set and not leave a sign in your log.

If the system shut down correctly and something happened to make the next assemble fail, you might miss it as it may not be possibe to log, particulary if the log was on the raid that did not start.

dmesg keeps a ring buffer in RAM, so the dmesg command works for the content of the buffer, even if the log location is not mounted.  Of course, you lose that on power down.

I've not been around much today - my system has been mostly failing to boot while I did the udev-182 upgrade.

It only boots now if I skip the fsck in the initrd.

----------

## ExecutorElassus

Hi Neddy,

well, then you can tell me how to get udev-182 to play nice with my RAIDed /usr once I have it working.   :Wink: 

For the assembly problems, would it be useful to make use of the mdadm.config file? Like, actually specify the arrays manually? Right now, it seems to be building them based on their own superblocks (or whatever else mdadm uses when there's no config file), and maybe stating explicitly which partitions go into which array might make things work better.

In any case, once the recovery is done, I'll run dd as per your instructions to check for read errors. If dmesg says nothing, I'll try rebooting later tonight (tomorrow morning) and report back.

Grr... my failures are always the ones that make no sense. 

Thanks again for the help, and good luck with udev. Have you tried using the earlymount script posted here?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

raid is OK, I'm having problems with separate /usr and /var which are in lvm2 on raid5.

I'm going with the wiki.gentoo.org page but with additions for raid and lvm.

The raid bits work fine - the lvm ones don't ... not yet.

It looks like I don't get any /dev nodes for the logical volumes.

Do you use raid autodetect in the kernel.?

If so, to use an mdadm.conf file you will need to either move to an initrd or not have root on raid, since root on raid needs the raid assembled before root is mounted.

----------

## ExecutorElassus

Hi Neddy,

I … think I have an initrd? But I do for certain have / on a RAID1 (mirrored, yes?) array. So is /boot, for that matter. 

That forum seems to have a script to run pre-mount and checking on RAID/LVM at the sysinit level. I don't know for certain if that would work for my setup (which is the one specified in the gentoo RAID/LVM2 quick-install handbook: /boot on RAID1, x swap partitions, / on RAID1, and then a big RAID5 for everything else [in my case, /usr, /opt, /var, /var/tmp, /usr/portage, /usr/portage/distfiles, and /home all get an lv, along with 1.7TB of storage partitions). Since the kernel source directories are under /usr/src/linux, which is on the big RAID5, is that going to cause problems with the kernel loading (with <udev-182)?

The earlymount script seemed to work okay, but then my drives shut down, and I rebooted into a broken system. 

Since I have a couple hours to go before I can test the md127 that's rebuilding, lemme plug you with questions. Is it possible to change the role numbers of an active array? I'm still bothered by sdc4 being [3] instead of [1] like all the other sdcX partitions are in their respective arrays, and wonder if that might be causing problems.

So far, all dmesg says is the mdadm message I posted before, and 

```
scsi_verify_blk_ioctl: 16 callbacks suppressed
```

.

Any advice?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

The role numbers don't matter. The do not need to map to the partitions in any particular order.

What happens in /usr/src has no impact on booting.  The kernel you boot is a binary file, normally in /boot

That you have /boot on raid1 rells me that yur /boot raid is a version 0.9 superblock, so could be kernel auto assebled.   Grub is not interested in that since raid assembly, however its done, happens after grub has done its stuff and exited.

You may have an initrd to assemble the rest and start your logical volume.  You may not too. look in grub.conf.  Do you have an initrd entry user tour kernel line ?

----------

## ExecutorElassus

Hi Neddy,

nope, no initrd line. So, I guess that means no initrd. 

Both /boot and / are indeed on 0.9 superblocks, as instructed by the install guide. The RAID5 is 1.2. 

What I've heard about initrd or initramfs is that both of them add to boot time, and are thus undesirable (or so I gathered from that thread about the earlymounts script). But I'll worry about that part of setup once I've sorted why my RAID5 keeps barfing. 

90 minutes to go from this message. I imagine you'll be off to bed by then. Should I then proceed with 'dd', and then try rebooting?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

Yes. The dd is harmless.  As you have no initrd, you must use kernel auto assembly, so you can't use an mdadm.conf.

For you, your initrd would only assemble your raid and start your lvm, which has to happen anyway.

It would not be a huge bloated full of kernel modules initrd that you rebuild with every kernel. It would just be a space for some userspace tools. It need not increase the boot time.

----------

## ExecutorElassus

Hrm,

and as I understand it, this is what's causing problems with udev-182, yes? So, an initrd would mount the MDs, and start the RAID/LVM, earlier in the boot process, yes?

Maybe I should just suck it up and use one. Is there documentation on it?

But first things first. I'll see if I can get this RAID5 array to survive a reboot, and then proceed wit the rest.

I'll report back as soon as dd is finished (unless there are other things I should do in the meantime?)

Thanks again,

EE

----------

## ExecutorElassus

Okay, how about this:

dd finished with no messages in dmesg. So I reboot. As before, sda4 gets moved into md125 (a bogus array), and that array fails to start. BUT. I reboot, and on the next bootup, everything is in its correct (active array), partitions are mounted, and fsck checks all the partitions.

This has been happening for some time, now that I think of it: the first reboot always results in a disabled RAID5, but rebooting results in a functional one.

Any idea why?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

That sounds like a race condition.  Maybe one drive is taking longer then the others to come ready, so the kernel gives up waiting for it.

If it were spin up time related, at the reboot, the drives would already be spun up ... so it would just work.

Spin up time is a wearout paramater reported in smartctl.

----------

## ExecutorElassus

Hi Neddy,

but I'm not booting from a full shutdown. I'm using 'init 6' both times. This morning, after dd finished, I hit 'init 6'. The first boot, I got one drive dropped out of the array. The next boot, it's back in place without my doing anything.

smartctl shows 0 for "Spin Up Time" for all three drives. sdc (the suspect one) does show:

```
183 Runtime_Bad_Block       0x0000   001   001   000    Old_age   Offline      -       1683

188 Command_Timeout         0x0032   100   001   000    Old_age   Always       -       1632

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1627

```

where the other two show zero. Is that anything?

On the other side, I cannot emerge xorg-server due to these errors:

```
/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'FreeDeviceClass':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:732:21: warning: declaration of 'type' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'FreeFeedbackClass':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:798:23: warning: declaration of 'type' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'BadDeviceMap':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:1637:30: warning: declaration of 'length' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'GetMaster':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:2610:33: warning: declaration of 'which' shadows a global declaration

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c: In function 'AllocDevicePair':

/var/tmp/portage/x11-base/xorg-server-1.12.0-r1/work/xorg-server-1.12.0/dix/devices.c:2653:18: warning: declaration of 'pointer' shadows a global declaration

make[2]: *** [devices.lo] Error 1

make[2]: *** Waiting for unfinished jobs....

```

This is just a snippet. Any guess what that is?

Let me know what you think.

Cheers,

EE

PS- now that I can emerge world again, some problems I had (like qt not emerging due to a block) aren't a problem any more. But xorg-server still won't emerge, due to this error. I'll let you know if any other of the 214 packages I'm emerging conks out.

----------

## NeddySeagoon

ExecutorElassus,

Its difficult for me working though a keyhole ... make friends with wgetpaste and post the whole build log.

The failure message at the end of every failed build tells where the log is.

Warning are just that - warnings and thats all thats in your log snippit.

You need tp look on your drive vendors web site to understand what the raw values mean.  They are often several bit fields in a 32 bit vlaue.

The normalised values would be a pass but

```
  183 Runtime_Bad_Block       0x0000   001   001   000    Old_age   Offline
```

is only updated with an offline test, so thats not really telling us anything.

----------

## ExecutorElassus

pastebin is my new imaginary internet boyfriend (sorry): build.log

I'll do some digging with the drive vendor, and let you know inn a sec which ebuilld failed this time.

Cheers,

EE

UPDATE: looking at that log, the only lines which return explicit errors are from a file that belongs to kbproto. I've remerged that, and I'll get back to you about xorg-server once the @world update is done.

----------

## ExecutorElassus

I got a little further in building X. The new build.log (which still fails) is here.

I'm also failing my @world emerge on pango. The log for that is here.

Any ideas what's going wrong with those?

Cheers,

EE

UPDATE:pango is fixed. It was libXft, not Xutil, that was choking.

----------

## ExecutorElassus

Okay, I've gotten a bit further. Now I'm on kdepimlibs, which fails on the following:

```
[  0%] Built target kcal_automoc

Generating contactsearchjob.moc

Generating transactionjobs.moc

Scanning dependencies of target kimap_automoc

Scanning dependencies of target kio_sieve_automoc

Generating session.moc

Generating messagethreaderproxymodel.moc

Generating deletejob.moc

Generating preprocessorbase_p.moc

[  0%] Built target kio_sieve_automoc

Generating standardmailactionmanager.moc

Scanning dependencies of target kio_imap4_automoc

/var/tmp/portage/kde-base/kdepimlibs-4.8.2/work/kdepimlibs-4.8.2/akonadi/contact/contactsearchjob.h:81: Error: Template classes not supported by Q_OBJECT

automoc4: process for /var/tmp/portage/kde-base/kdepimlibs-4.8.2/work/kdepimlibs-4.8.2_build/akonadi/contact/contactsearchjob.moc failed: Unknown error

pid to wait for: 0

```

This is preventing akonadi from emerging as well. Also, redlands seems to be broken due to a missing tab space in the makefile (though I managed to build nepomuk headers anyway).

Anyway, do you know what's up with kdepimlibs? Is that something wrong with the ebuild?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

I've never built KDE, so I'm not a lot of help here.  Try searching on bugs.gentoo.org to see if its a known issue.

If so, there may be a fix for it there too.

----------

## NeddySeagoon

ExecutorElassus,

I've never built KDE, so I'm not a lot of help here.  Try searching on bugs.gentoo.org to see if its a known issue.

If so, there may be a fix for it there too.

----------

## ExecutorElassus

In fact, KDE bugs aren't even handled by the gentoo tracker: they have to be filed upstream. I did so, and we'll see what happens.

I have, however, managed to get xorg back up, so I'm back into my wm (hurrah!). I now have just four packages that won't update for various reasons, and then I imagine I'll be finding stray misnamed files for months. 

But at least the system is up, and (sorta) stable, so now let's get back to the original issue: mdadm seems to be randomly dropping one of the drives out of the RAID array on bootup, with no errors in the log from shutdown (and, since logging doesn't work on the non-RAID system, nothing but dmesg to tell me what might have gone wrong at boot [but I can't scroll through that, so it isn't helpful]). 

So, you're suggesting it's just spool-up times? Is there anything else that might cause it?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

```
dmesg | less
```

or pastebin your dmesg

----------

## ExecutorElassus

when the array is inactive, I do not have access to less. Can I pipe it to a regular file? This presumes I can reboot into a working array, since nfs and ssh are also inaccessible without the RAID5. *sadpanda*

ALSO: I just noticed, that 'ld' will, in the middle of some compiles - in this case firefox - chew up close to 40% of my RAM (about 1.6GB). Is that normal? Or is this an unfortunate consequence of my file system being riddled with files identified as directories, and actual files emerged into backups, etc. etc.?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

You should be a little more adventuerous.  Try

```
 dmesg > dmesg.txt
```

When your raid drops a drive, you can still start it. 

```
mdadm --run /dev/mdX
```

This will run the raid in degraded mode. It will probably be at the expense of a resync when you readd the dropped drive.

Build your busybox with the static USE flag.  That has a pager and it lives in /bin

busybox --help will tell you about it, or man busybox.

You will be building busybox, mdadm and lvm with the static use flag for your initrd, which you need to get past udev-182.

Learn about /etc/portage/package.use for per package USE flags.  Do not set static in make.conf.

----------

## ExecutorElassus

Okay. 

For the time being, I'm only going to do that if 1) the array boots inactive and separated, and 2) won't boot back normal the next (one or more) reboot. But thank you for the tips. Okay. Now that I have a system I can (mostly) work with, let's talk about that initrd. So, first step is to build mdadm, lvm, and busybox with +static? I'm fine with using package.use. What next?

Cheers,

EE

----------

## NeddySeagoon

I used a combination of this wiki post which covers the straight forward (relatively) case of root on lvm on raid and this

wiki thread covering the new bit for mounting the separate partitions.

To actually build the initramfs, I used the kenel provided script.

The second wiki article is writen for a 32 bit install.  That cause me some grief as I'm 64bit no-multilib.

I like to build my initrd by hand in /root/initrd/ with everything I need there.  That way I know it won't change with a

```
 emerge --sync && emerge world -uDN
```

so I'm not a big fan of the kernel provided script that sucks random files off your live filesystem.  Its a real PITA when you make a useless initrd.  I still have the work to do to make my initrd independant of my life filesystem but my box boots cleanly now, so its non urgent. I really don't care if my initrd is fullof security holes - it runs once at boot before networking is up, so it can't be exploited.  I do get paranoid when it doesn't work

My  /root/initrd/initramfs_list contains 

```
# directory structure

dir /proc       755 0 0

dir /usr        755 0 0

dir /bin        755 0 0

dir /sys        755 0 0

dir /var        755 0 0

#dir /lib        755 0 0

dir /lib64      755 0 0

dir /sbin       755 0 0

dir /mnt        755 0 0

dir /mnt/root   755 0 0

dir /etc        755 0 0

dir /root       700 0 0

dir /dev        755 0 0

# busybox

file /bin/busybox /bin/busybox  755 0 0

# for raid on lvm

file /sbin/mdadm                /sbin/mdadm              755 0 0 

file /sbin/lvm.static           /sbin/lvm.static         755 0 0 

# libraries required by /sbin/fsck.ext4 and /sbin/fsck

slink   /lib                            /lib64                          777 0 0

file    /lib64/ld-linux-x86-64.so.2     /lib64/ld-linux-x86-64.so.2     755 0 0

file    /lib64/libext2fs.so.2           /lib64/libext2fs.so.2           755 0 0

file    /lib64/libcom_err.so.2          /lib64/libcom_err.so.2          755 0 0

file    /lib64/libpthread.so.0          /lib64/libpthread.so.0          755 0 0

file    /lib64/libblkid.so.1            /lib64/libblkid.so.1            755 0 0

file    /lib64/libuuid.so.1             /lib64/libuuid.so.1             755 0 0

file    /lib64/libe2p.so.2              /lib64/libe2p.so.2              755 0 0

file    /lib64/libc.so.6                /lib64/libc.so.6                755 0 0

file    /sbin/fsck              /sbin/fsck                      755 0 0

file    /sbin/fsck.ext4         /sbin/fsck.ext4                 755 0 0

# our init script

file    /init                   /root/initrd/init               755 0 0
```

If you don't use ext4, you need to run ldd on your fsck helper and include it and its libraires in place of fsck.ext4.

If your /usr and /var are different filesystems, you need both fsck helpers and their libraries.

My initscript ended up as 

```
#!/bin/busybox sh

rescue_shell() {

    echo "$@"

    echo "Something went wrong. Dropping you to a shell."

    /bin/busybox --install -s

    exec /bin/sh

}

# allow the use of UUIDs or filesystem lables

uuidlabel_root() {

    for cmd in $(cat /proc/cmdline) ; do

        case $cmd in

        root=*)

            type=$(echo $cmd | cut -d= -f2)

            echo "Mounting rootfs"

            if [ $type == "LABEL" ] || [ $type == "UUID" ] ; then

                uuid=$(echo $cmd | cut -d= -f3)

                mount -o ro $(findfs "$type"="$uuid") /mnt/root

            else

                mount -o ro $(echo $cmd | cut -d= -f2) /mnt/root

            fi

            ;;

        esac

    done

}

check_filesystem() {

    # most of code coming from /etc/init.d/fsck

    local fsck_opts= check_extra= RC_UNAME=$(uname -s)

    # FIXME : get_bootparam forcefsck

    if [ -e /forcefsck ]; then

        fsck_opts="$fsck_opts -f"

        check_extra="(check forced)"

    fi

    echo "Checking local filesystem $check_extra : $1"

    if [ "$RC_UNAME" = Linux ]; then

        fsck_opts="$fsck_opts -C0 -T"

    fi

    trap : INT QUIT

    # using our own fsck, not the builtin one from busybox

    /sbin/fsck -p $fsck_opts $1

    ret_val=$?

    case $ret_val in

        0)      return 0;;

        1)      echo "Filesystem repaired"; return 0;;

        2|3)    if [ "$RC_UNAME" = Linux ]; then

                        echo "Filesystem repaired, but reboot needed"

                        reboot -f

                else

                        rescue_shell "Filesystem still have errors; manual fsck required"

                fi;;

        4)      if [ "$RC_UNAME" = Linux ]; then

                        rescue_shell "Fileystem errors left uncorrected, aborting"

                else

                        echo "Filesystem repaired, but reboot needed"

                        reboot

                fi;;

        8)      echo "Operational error"; return 0;;

        16)     echo "Use or Syntax Error"; return 16;;

        32)     echo "fsck interrupted";;

        127)    echo "Shared Library Error"; sleep 20; return 0;;

        *)      echo $ret_val; echo "Some random fsck error - continuing anyway"; sleep 20; return 0;;

    esac

# rescue_shell can't find tty so its broken

    rescue_shell

}

# start for real here

# temporarily mount proc and sys

mount -t proc none /proc

mount -t sysfs none /sys

mount -t devtmpfs none /dev

# disable kernel messages from popping onto the screen

###echo 0 > /proc/sys/kernel/printk

# clear the screen

###clear

# assemble the raid set(s) - they got renumbered from md1, md5 and md6

# /boot

/sbin/mdadm --assemble /dev/md125 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

# don't care if /boot fails to assemble

# /  (root)  I wimped out of root on lvm for this box

/sbin/mdadm --assemble /dev/md126 /dev/sda5 /dev/sdb5 /dev/sdc5 /dev/sdd5 || rescue_shell

# if root won't assemble, we are stuck

# LVM for everything else

/sbin/mdadm --assemble /dev/md127 /dev/sda6 /dev/sdb6 /dev/sdc6 /dev/sdd6 || rescue_shell

# and if the LVM space won't assemble there is no /usr or /var so we are really in a mess

# TODO could auto cope with degraded raid operation

# lvm runs as whatever its called as

ln -s /sbin/lvm.static /sbin/vgchange

# start the vg volume group - we only have one volume group

/sbin/vgchange -ay vg || rescue_shell

# if this failed we have no /usr or /var

# get here with raid sets assembled and logical volumes available

# mounting rootfs on /mnt/root

uuidlabel_root || rescue_shell "Error with uuidlabel_root"

# space separated list of mountpoints that ...

mountpoints="/usr /var"

# ... we want to find in /etc/fstab ...

ln -s /mnt/root/etc/fstab /etc/fstab

# ... to check filesystems and mount our devices.

for m in $mountpoints ; do

#echo $m

    check_filesystem $m

    echo "Mounting $m"

    # mount the device and ...

    mount $m || rescue_shell "Error while mounting $m"

    # ... move the tree to its final location

    mount --move $m "/mnt/root"$m || rescue_shell "Error while moving $m"

done

echo "All done. Switching to real root."

# clean up. The init process will remount proc sys and dev later

umount /proc

umount /sys

umount /dev

# switch to the real root and execute init

exec switch_root /mnt/root /sbin/init
```

A few gotchas not listed in those wiki pages.  The filesystems checked and monted by the initrd need to be set to noauto in /etc/fstab or you will be told that some mounts failed.  Thats expected. /usr at and maybe /var will already be mounted.

When DEVTMPFS makes your logical volume nodes, /dev/mapper/vg-user and friends are made, as are /dev/dm-0 and friends but the symbolic links in /dev/vg/ are not created.

This mens you can use the first two in /etc/fstab but not the latter.   I found that out the hard way.

As the initrd contains the code for mounting everything by UUID, this is probably a good time to switch to UUID mounts.  Don't do it all in one go though.

My raid assembly is explicit on the mdadam command line because its easy to follow. You could put /etc/mdadm.conf in the initrd and call that.

mdadm also understands how to assembe a raid set given its UUID.  Thats still a TODO.

I don't have root on lvm on this system - thats the next one to convert.

----------

## ExecutorElassus

Hi Neddy!

so, after 10 days of basically working okay, I rebooted today. Now, when I reboot, the large RAID5 array - the one holding /usr, /var, /home, etc - is active as "auto-read-only" and none of its partitions are mounted. 

So, I'm back to where I started. Sorta.

As far as I can tell, the array is fine, and all its drives are active; they just … aren't being mounted by mdadm. Is there a way to re-initialize the system, so that mdadm and lvm re-do mounting and checking everything, and re-load all the stuff that lives on that array?

On a second question, can you think of any reason why that array would always start up auto-read-only?

Thanks,

EE

(and once again, I don't have access to a pager, because I haven't rebuilt busybox, mdadm, or lvm static. Do I need to do anything besides rebuild them with USE="static" to have access to them?)

----------

## NeddySeagoon

ExecutorElassus,

If busybox, mdam and lvm are not built with USE=static, you need to remake them and rebuild your initrd.

Without the staic USE, they will have never worked in your initrd, never mind being ok for 5 days.

What versions of openrc and udev do you have ?

The is an alternative to the USE=static.  You can add the libraries these applications need to your initrd. 

I prefer static.

----------

## ExecutorElassus

openrc is 0.9.9.2.

So, since the RAID5 array isn't loading right now, is it safe to assume that this is because mdadm and lvm aren't built static? If that's the case, I should be able to fix it by booting a liveCD, mounting everything and chrooting in, then simply emerging them as static, yes? Will I need to copy any binaries over to /sbin, or does emerge take care of that?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

What version is udev?

If you type

```
mount -a
```

does everything mount?

Building busybox, lvm and mdadm with USE=static is not enouogh.  You have to get the new binaries into your initrd too.

----------

## ExecutorElassus

Hey Neddy,

I'll have to wait until I get back to my home office Monday evening before I can try mounting everything (I have a music festival this weekend), but I'll report my results once I try.

udev is blocked as per your instructions past 181 (or whatever version starts treating mounting more strictly). 

Cheers,

EE

----------

## ExecutorElassus

Hi Neddy,

so, first: 'mount -a' mounts all the drives/partitions without a hitch. No errors in the console, nothing in dmesg. I have udev-171-r5 running. 

Now that I have a pager, I can read the dmesg log. I don't see anything too strange. I see complaints about "invalid raid superblock magic on sdX4" which I just take to mean that it isn't v0.90 (and md then says it it consequently not importing the superblock). md126 (the root RAID1 mirror across the three sdX3 partitions) is listed as having an unknown partition table (which makes sense, since it's a logical partition, right?).

Then there is a long gap of info about USB, and udev then starts, before md127 is loaded. md127 appears to load fine (but its included partitions are not mounted). 

One more thing: on shutdown ('init 0' or 'init 6') I get an error that md "cannot get exclusive access to md126 [the md array holding / ]," and that the array fails to stop. Could that be causing issues?

Anyway, should I now try to rebuild mdadm, lvm, and busybox as static, and make an initrd?

Thanks,

EE

PS, I notice that if I drop to runlevel 1 - thus shutting down the md arrays - and then reboot, sda4 gets punted off into a single-drive array - md125 - and md127 is stopped (meaning that I have to stop md125, restart md127 degraded, then add /dev/sda4 to it and sit back for a six-hour re-sync). Is there a reason for this? Is there a mismatched UUID somewhere that makes mdadm think the drives are in different arrays? If so, how do I fix it?

----------

## NeddySeagoon

ExecutorElassus,

I've been there - it all works when you do it by hand.

I *think* but its too difficult to prove, that your openrc is no longer tolerant of udev failures, which have existd for a long time but which are no longer retried.

/usr is not mounted when uden starts, lots of udev things fail and the retries to piuck up the pieces are no longer tried.

The only way is forward, since you need an initrd anyway.

Thats an improved howto over the one I posted earlier in thise thread.

----------

## ExecutorElassus

Hi Neddy,

actually, I think most of this started when I tried to use the earlymount script - I linked it earlier in the thread; it attempts to pre-mount RAID arrays without an initrd - and something *went wrong*. 

Anyway, heaven help me, I'll start following your guide once the re-sync is done (in about an hour), and let you know what happens. 

Cheers,

EE

----------

## ExecutorElassus

also, I note that my fstab is listing all my RAID partitions as "/dev/vg/[path]" and not "/dev/mapper/vg-[path-shorthand];" mght that be causing issues?

No matter: I'm replacing them all with UUIDs now, as per your guide. I'll report issues on that thread, and anything that appears to be my system acting schizoid on this one.

Cheers,

EE

----------

## ExecutorElassus

So, as an example of my system acting wonky, here's what I get when I run 'ldd /sbin/fsck.ext2:"

```
# ldd /sbin/fsck.ext2

   linux-vdso.so.1 =>  (0x00007fff051ff000)

   libext2fs.so.2 => /lib64/libext2fs.so.2 (0x00007fe85174b000)

   libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007fe851547000)

   libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fe851320000)

   libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fe85111b000)

   libe2p.so.2 => /lib64/libe2p.so.2 (0x00007fe850f13000)

   libc.so.6 => /lib64/libc.so.6 (0x00007fe850b88000)

   libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe85096b000)

   /lib64/ld-linux-x86-64.so.2 (0x00007fe85198e000)

```

Note the first entry: it appears to be linking somewhere, but has no destination. I can't find the file anywhere, and e2fsprogs emerges fine, so I'm not sure what that is. Any ideas?

thanks,

EE

----------

## ExecutorElassus

And more on the side of strangeness in setting upyour script:

is there a reason why one of the arrays - specifically, the large one holding everything beyond / and /boot - would have a non-hex UUID? Check this out:

```
# blkid

--SNIP--

/dev/md126: UUID="74d54c6f-6a2d-47a6-acf3-5a902d13899f" TYPE="ext3" 

/dev/md1: UUID="8d1b95b6-6e06-48a7-946a-3b739c8ee637" TYPE="ext2" 

/dev/md127: UUID="P1IbQY-JpO7-uBWA-5Jyr-hnRj-jB9S-LbIdsZ" TYPE="LVM2_member" 
```

Note that md127 uses a different numbering system for the UUID. Should I care? Does that have something to do with it being lvm2?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

```
/dev/md127: UUID="P1IbQY-JpO7-uBWA-5Jyr-hnRj-jB9S-LbIdsZ" TYPE="LVM2_member"
```

its a piece of a lvm2 physical volume.

You don't put that in /etc/fstab as you can't mount it.

Your md1 and md126 hold ext2 and ext3 filesystems. md127 its a lvm member which will contain filesystems in the individual logical volumes.

----------

## ExecutorElassus

so, in the example script, where you have:

```
# assemble the raid set(s) - they got renumbered from md1, md5 and md6

# /boot

/sbin/mdadm --assemble /dev/md125 --uuid=d678d02e-28ab-84e0-c44c-77eb7ee19756

# don't care if /boot fails to assemble

# /  (root)  I wimped out of root on lvm for this box

/sbin/mdadm --assemble /dev/md126 --uuid=ad5fe0cb-775d-38b4-7169-e221fc96089f || rescue_shell

# if root won't assemble, we are stuck

# LVM for everything else

/sbin/mdadm --assemble /dev/md127 --uuid=52be4797:edab2349:eb21497e:52035eaa || rescue_shell

# and if the LVM space won't assemble there is no /usr or /var so we are really in a mess

# TODO could auto cope with degraded raid operation

```

I would only modify the first two as appropriate, and comment out the third?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

You need the UUID of the raid set, not the LVM2 it caries when you assemble the raid.

The safest way to get the UUID of the raid set is to ask madam

```
mdadm -E /dev/sda1
```

will show the UUID of the raid set that /dev/sda1 belongs to.

Its these UUIDs you need to feed to mdadm to assemble the raid sets, not the UUID of any filesystes or LVM2 physical volumes they may carry.

----------

## ExecutorElassus

Hi Neddy,

ah, that's what confused me: the fourth partition on each member drive is a member of a large RAID5 array, which is what is made into /dev/md127. However, that array - as it's an LVM array - does not return a UUID to blkid. That might be useful to add to the guide, since I got confused.

Cheers,

EE

----------

## ExecutorElassus

Hi Neddy,

I posted to the your guide, but moved it here instead because it seems more to do with my system acting bizarre.

So, I failed to boot, and got dumped to a shell. mdraid is not starting, and consequently the root partition - or, for that matter, /boot - are not getting mounted, and everything stops. I'm getting error messages about being unable to find a boot disk.

I should note: my / is on a RAID1 array; mirrored across the three partitions. Does that make a difference for the UUID from your setup? I changed the kernel line in grub.conf to use the UUID for the md array on which / resides, but it doesn't seem to work (nor if I set it to /dev/md126) .

Any suggestions?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

The mdadm --assembe calls need the UUID of the raid, as determined by mdadm -E /dev/<raid_member>

All members of the same raid set carry the same UUID, which is how mdasm finds them

```
# mdadm -E /dev/sda1

/dev/sda1:

          Magic : a92b4efc

        Version : 0.90.00

           UUID : 9392926d:64086e7a:86638283:4138a597

...

# mdadm -E /dev/sdb1

/dev/sdb1:

          Magic : a92b4efc

        Version : 0.90.00

           UUID : 9392926d:64086e7a:86638283:4138a597

...
```

That two elements of my four element raid1 /boot.

Feed your corresponding UUID(s) to mdadm to get the raid set(s) assembled.  Once the raid is assembled, those UUIDs are of no further interest.

blkid shows

```
/dev/md125: UUID="741183c2-1392-4022-a1d3-d0af8ba4a2a8" TYPE="ext2"
```

/dev/md125 is my /boot.  It contains an ext2 filesystem with UUID 741183c2-1392-4022-a1d3-d0af8ba4a2a8, which is the UUID needed to mount /boot

root is similar, so the UUID you need in root=uuid= is that of the filesystem, on the root block device, not the UUID of the underlying raid.

mdraid does not start in the initrd. It runs and exits each time its called to assemble a raid set.

----------

## ExecutorElassus

Hi Neddy,

I was feeding the bootloader the UUID for the / md filesystem as returned by blkid (that is, what the UUID for "/dev/md126" was showing), with no luck.  

Also, is there a reason why the first three partitions return a differently-formatted output for "mdadm -E" than the fourth? The three other partitions all return a table at the bottom listing all three devices, major and minor number, raid device, and state; but 'mdadm -E /dev/sdX4' returns the lines for Device Role and Array State as the last two lines. That may not mean anything (or maybe it's because they're in an lvm group? Or it's a v1.2 superblock?). The sdX4 are also not showing up on the blkid list while in the busybox shell.

'mdadm -A' seems to require the UUID of the array, not of the constituent devices. Is there a flag I use to specify them? None of the devices for sdX3 (those which would constitute / on /dev/md126) have an array UUID returned by 'mdadm -E' so the only UUID I know for this array is that of the filesystem. Is there another way to get to it?

thanks,

EE

UPDATE: I assembled it just using the device nodes (ie, 'mdadm -A /dev/md126 /dev/sda3 /dev/sdb3 /dev/sdc3'), and now it's active. However, the UUID it shows is exactly the one I have in the bootloader command line. Any guess why it wouldn't be working?

UPDATE 2: okay, I've rebooted, and I can assemble all my md arrays in the busybox shell. here's what happens when I run 'blkid':

```
#blkid

/dev/md126: UUID="74d54c6f-6a2d-47a6-acf3-5a902d13899f" TYPE="ext3"

/dev/md1: UUID="8d1b95b6-6e06-48a7-946a-3b739c8ee637" TYPE="ext2"

/dev/sdc1: UUID …

```

Even though md127 is assembled, it does not show in blkid. Is that because it's an lvm device?Last edited by ExecutorElassus on Tue May 01, 2012 10:09 pm; edited 3 times in total

----------

## NeddySeagoon

ExecutorElassus,

Show me the output of blkid and mdam -E /dev/sda[1234]

Sight of your initscript would also be good.

----------

## ExecutorElassus

Hi Neddy,

blkid output (from within busybox, after assembling the three md arrays) is in "UPDATE 2" above.

'mdadm -E /dev/sda1' (copied manually, please kill me):

```
/dev/sda1

         Magic : a92b4efc

       Version : 0.90.00

         UUID  : UUID=707f4cba:12af970b:cb201669:f728008a

Creation Time : Mon Apr 25 18:48:43 2011

   Raid Level : raid1

Used Dev Size: 97536 (95.27 MiB 99.88 MB)

    Array size: 97536 (95.27 MiB 99.88 MB)

Raid Devices : 3

Total Devices : 3

Preferred Minor : 1

Update Time : Mon Apr 30 23:27:56 2012

          State : clean

Active Devices : 3

Working Devices : 3

Failed Devices : 0

Spare Devices : 0

Checksum : 8bf8aa1f - correct

Events : 67

       Number       Major       Minor           RaidDevice  State

this    0                 8             1                     0            active sync    /dev/sda1

0       0                  8            1                      0           active sync    /dev/sda1

1       1                  8            33                    1           active sync    /dev/sdc1

2       2                  8            17                    2           active sync    /dev/sdb1
```

'mdadm -E /dev/sda3' (sdX2 are all swap partitions, and not in raid arrays; this array is where / resides):

```
/dev/sda3

         Magic : a92b4efc

       Version : 0.90.00

         UUID  : UUID=23a73541:b1ad3343:cb201669:f728008a

Creation Time : Mon Apr 25 18:48:43 2011

   Raid Level : raid1

Used Dev Size: 9765504 (9.31 GiB 10.00 GB)

    Array size: 9765504 (9.31 GiB 10.00 GB)

Raid Devices : 3

Total Devices : 3

Preferred Minor : 126

Update Time : Mon Apr 30 23:28:01 2012

          State : clean

Active Devices : 3

Working Devices : 3

Failed Devices : 0

Spare Devices : 0

Checksum : deb1da64 - correct

Events : 4840

       Number       Major       Minor           RaidDevice  State

this    0                 8             3                     0            active sync    /dev/sda3

0       0                  8            3                      0           active sync    /dev/sda3

1       1                  8            35                    1           active sync    /dev/sdc3

2       2                  8            19                    2           active sync    /dev/sdb3
```

'mdadm -E /dev/sda4' (this is the lvm for everything else):

```
/dev/sda4

         Magic : a92b4efc

       Version : 1.2

Feature Map : 0x0

Array UUID  : d42e5336:b75b0144:a502f2a0:178afc11

         Name : domo-kun:carrier

Creation Time : Mon Apr 25 18:48:43 2011

   Raid Level : raid5

raid Devices : 3

Avail Device Size : 19318413841 ( 921.17GiB 989.10 GB)

    Array size: 3863681024 (1842.35 GiB 1978.20 GB)

Used Dev Size: 1931840512 (921.17 GiB 989.10 GB)

Data Offset : 2048 sectors

Super Offset : 8 sectors

          State: clean

Device UUID . f7f1d49b:a0272bc3:c46251a2:e0502319

Update Time : Mon Apr 30 23:27:56 2012

   Checksum : 383d5a61 - correct

        Events : 27481

       Layout : left-symmetric

Chunk Size :  512K

Device Role : Active Device 0

Array State : AAA ('A' == active, '.' == missing)

```

That's the mdadm -E info for all partitions that are raid members on sda. 

init:

```
#!/bin/busybox sh

rescue_shell() {

    echo "$@"

    echo "Something went wrong. Dropping you to a shell."

    /bin/busybox --install -s

    exec /bin/sh

}

# allow the use of UUIDs or filesystem lables

uuidlabel_root() {

    for cmd in $(cat /proc/cmdline) ; do

         case $cmd in

         root=*)

              type=$(echo $cmd | cut -d= -f2)

              echo "Mounting rootfs"

              if [ $type == "LABEL" ] || [ $type == "UUID" ] ; then

                 uuid=$(echo $cmd | cut -d= -f3)

                 mount -o ro $(findfd "$type"=$uuid")  /mnt/root

              else

                 mount -o ro $(echo $cmd | cut -d= -f2) /mnt/root

              fi

              ;;

         esac

    done

}

check_filesystem() {

    # most of code coming from /etc/init.d/fsck

    local fsck_opts= check_extra= RC_UNAME=$(uname -s)

    #FIXME : get_bootparam forecefsck

    if [ -e /forcefsck ]; then

         fsck_opts="$fsck_opts -f"

         check_extra="(check forced)"

    fi

   echo "Checking local filesystem $check_extra : $1"

   if [ "$RC_UNAME" = Linux ]; then

          fsck_opts="$fsck_opts -C0 -T"

   fi

   trap : INT QUIT

   # using out fsck, not the builtin one from busybox

   /sbin/fsck -p $fsck_opts $1

   ret_val=$?

   case $ret_val in

           0)          return 0;;

           1)          echo "Filesystem repaired"; return 0;;

           2|3)       if [  "$RC_UNAME" = Linux ]; then

                                echo "Filesystem repaired, but reboot needed"

                                reboot -f

                         else

                                rescue_shell "Filesystem still have errors; manual fsck required"

                         fi;;

          4)            if [ "$RC_UNAME" = Linux ]; then

                                rescue_shell "filesystem errors left uncorrected, aborting"

                         else

                                echo "Filesystem repaired, but reboot needed"

                                reboot

                         fi;;

         8)             echo "Operational error"; return 0;;

         16)           echo "Use or Syntax Error"; return 16;;

         32)           echo "fsck interrupted";;

         127)         echo "Shared Library Error"; sleep 20; return 0;;

         *)             echo $ret_val; echo "Some random fsck error - continuing anyway"; sleep 20; return 0;;

      esac

# rescue_shell can't find tty so its broken

      rescue_shell

}

# start for real here

# temporarily mount proc and sys

mount -t proc none /proc

mount -t sysfs none /sys

mount -t devtmpfs none /dev

# disable kernel messages from popping onto the screen

###echo 0 > /proc/sys/kernel/printk

# clear the screen

###clear

# assemble the raid set(s) - they got renumbered from md1, md5 and md6

#/boot

/sbin/mdadm --assemble /dev/md1 --uuid=8d1b95b6-6e06-48a7-946a-3b739c8ee637

# don't care if /boot fails to assemble

# / (root) I wimped out of root on lvm for this box

/sbin/mdadm --assemble /dev/md126 --uuid=74d54c6f-6a2d-47a6-acf3-5a902d13899f || rescue_shell

# if root won't assemble, we are stuck

# LVM for everything else

/sbin/mdadm --assemble /dev/md127 --uuid=d42e5336:b75b0144:a502f2a0:178afc11 || rescue_shell

# and if the LVM space won't assemble there is no /usr or /var so we are really in a mess

# TODO could auto cope with degraded raid operation

# lvm runs as whatever its called as and we need vgchange

ln -s /sbin/lvm.static /sbin/vgchange

# start the vg volume group - we only have one volume group

/sbin/vgchange -ay vg || rescue_shell

# if this failed we have no /usr or /var

# get here with raid sets assembled and logical volumes available

# mounting rootfs on /mnt/root

uuidlabel_root || rescue_shell "Error with uuidlabel_root"

# space separated list of mountopoints that …

mountpoints="/usr /usr/portage /usr/portage/distfiles /var /var/tmp /home /opt /tmp"

# … we want to find in /etc/fstab …

ln -s /mnt/root/etc/fstab /etc/fstab

# … to check filesystems and mount our devices.

for m in $mountpoints ; do

#echo $m

   check_filesystems $m

   echo "Mounting $m"

   # mount the device and …

   mount $m || rescue_shell "Error while mounting $m"

   # … move the tree to its final location

   mount --move $m "/mnt/root"$m || rescue_shell "Error while moving $m"

done

echo "All done. Switching to real root"

# clean up. The init process will remount proc sys and dev later

umount /prov

umount /sys

umount /dev

# switch to the real root and execute init

exec switch_root /mnt/root /sbin/init
```

See anything wrong?

Cheers,

EE

PS: Within the rescue_shell, I see that /proc /sys and /dev are mounted, which means that the init script has made it at least to that point. I'm assuming the failure is in assembling the / array (/dev/md126), since it doesn't care if /boot fails. Is there any reason to worry about the difference in formatting between how blkid returns /dev/md126, and how it's listed in the init script? ie, that the latter uses colons?

EDIT: I mis-copied the mdadm --assemble line for md126. It's correct now, and matches what blkid returns and is listed in the bootloader.

----------

## NeddySeagoon

ExecutorElassus,

Distilling what you posred.

```
/dev/sda1

UUID  : UUID=707f4cba:12af970b:cb201669:f728008a 

Preferred Minor : 1
```

shows that /dev/sda1 belongs to /dev/md1 and /dev/md1 has UUID=707f4cba:12af970b:cb201669:f728008a 

Your init script says

```
/sbin/mdadm --assemble /dev/md1 --uuid=8d1b95b6-6e06-48a7-946a-3b739c8ee637 
```

You update 2 shows

```
 /dev/md1: UUID="8d1b95b6-6e06-48a7-946a-3b739c8ee637" TYPE="ext2"
```

Its clear from the above that you are using the UUID of the filesystem on md1 to attempt to assemble md1, not the UUID of the raid.

You have done the same thigs for md126 too, so the initrd will not assemble your raid sets.

```
/sbin/vgchange -ay vg || rescue_shell 
```

is your lvm volume group called vg ?

----------

## ExecutorElassus

'vg' is indeed the name of my volume group.

okay, so I have the UUIDs wrong in my init (and thus likely also in the bootloader). I can fix the former, but how do I fix the latter? Is there a means from within busybox to remake the initrd, or will I have to use a liveCD?

Cheers,

EE

----------

## NeddySeagoon

ExecutorElassus,

You need to use the liveCD to fix the initrd.

The root=uuid= needs to be the uuid of the root filesystem, not that of the underlying raid.

----------

## ExecutorElassus

Hi Neddy,

alas! I was afraid you'd say that. It'll have to wait until tomorrow night, when I can spend the time opening up the box and plugging in the old IDE CD drive I keep around for this sole purpose. 

Once I've booted into the live CD, I know how to start up the md arrays and mount everything. Do I just edit the init script I have stored to match the correct UUIDs, and then repeat the '/usr/src/linux/scripts/gen_initramfs_list.sh -o /boot/initrd.cpio.gz /root/initrd/initramfs_list ' command from your guide?

so, to be clear: 

1) the UUID for md126 in the bootloader is for the filesystem, and is returned by 'blkid'

but

2) the UUID for that same array in the init script is for the raid array, and will thus not match, and is the UUID returned by mdadm -E /dev/sdXn

Is that correct?

Cheers,

EE

----------

## ExecutorElassus

Hi Neddy,

so, I updated your script and now it gets past assembling the md arrays, hurrah!

here's my next problem: my mountpoints in the init script are:

```
mountpoints="/usr /usr/portage /usr/portage/distfiles /var /var/tmp /home /opt /tmp"
```

the script checks and mounts /usr fine but on checking /usr/portage, I get an error:

```
Checking local filesystem : /usr

/dev/mapper/vg-usr: clean, 748763/1310720 files, 3776000/5242880 blocks

Mounting /usr

kjournald starting. Commit interval 5 seconds

EXT3-fs (dm-0): using internal journal

EXT3-fs (dm-0): mounted filesystem with writeback data mode

Checking local filesystem : /usr/portage

/dev/mapper/vg-portage: clean, 184986/200704 files, 415453/2097152 blocks

Mounting /usr/portage

mount: mounting /dev/dm-1 on /usr/portage failed: No such file or directory

Error while mounting /usr/portage

Somewhing went wrong. Dropping you to a shell
```

The relevant portion of my /etc/fstab is:

```
/etc/fstab:

UUID=7f880ef6-833c-4d19-96fa-524f78e822f8    /usr           ext3         noatime,noauto    1 0

UUID=422e349f-7f3b-4037-9621-1c786e16e48b /usr/portage ext2       noatime,noauto    1 0

UUID=b4335b32-6bcc-44c3-9f85-bf2c91eb400e   /usr/portage/distfiles ext2 noatime, noauto  1 0

```

which matches the values returned by blkid for those filesystems.

Any suggestions what's going on?

Cheers,

EE

UPDATE: since udev only really cares that /usr and /var are premounted, I dropped the rest from the init script, reverted their lines in /etc/fstab back to allow automounting and checking, and rebooted. hurrah! I have a root prompt, and can emerge stuff!

… and now I'm back to a gui. My beloved has been returned to me! omg omg omg

I'll keep you posted, but it looks like everything is in order. Only wonky thing I saw on boot was some errors about nonexistent /dev/vg nodes for some partitions, but they mounted anyway.

holy crap, this nightmare might finally be over. <3

PS: do I need to re-run the gen_initramfs_list.sh script each time I install a new kernel?

----------

## NeddySeagoon

ExecutorElassus,

I was about to post to say only mount /usr and /var but went for a beer instead.

When I got back, you had already done it.

As the initrd does not contain anything kernel specific, there is no need to remake it for every kernel.

Indeed, it uses random binaries from your system, thats a good reason not to remake it unless you really need to.

If there are security updates for the packages in the initrd, do you care?

They cannot be exploited remotely as networking is not started until the initrd has done its thing and been discarded.

When you do update it, give the new initrd a new name.  You really don't want to overwrite your only working initrd with a broken one, just like your kernel. 

All the initrd really does is to appease >=udev-182 by mounting /usr and /var before the real init script starts udev.

----------

## ExecutorElassus

All right, then this will be initrd-1.0. Hurrah for a feature-complete (ie, it boots!) release! Out of Beta and releasing on time, etc etc.

Well, if I could buy you a beer, I would. Thanks for all the (very patient) help you've provided. I'm (mostly) sure the system - at least as far as having a working lvm for recent udev releases - is functional. I'll mark this as solved. (now I just have to figure out why boinc, kdepimlibs, kgpg, and gnome-settings-daemon won't emerge, but that's more a portage/programming question). 

Cheers, mate.

EE

----------

