# mpirun goes zombie (kernel bug?)

## pomaranca

When I run my program with mpirun -c 16, the system sometimes gets stuck (first run is ok, but the next one is problematic). htop shows that all cores are fully loaded and the program is running (although it should calculate the result already). Furthermore all the processes become defunct (zombie like), they cannot be killed in any way (not with kill -9). dmesg output suggest this is the kernel bug related with ext4, which I'm in fact using. I am wondering if this could really be the problem. I have never encountered such a strange behavior on any Linux box before. When the system is rebooted (echo b > /proc/sysrq-trigger) it works just normally until the second mpirun.

mpirun exits with the following message:

```

mpirun noticed that process rank 12 with PID 9574 on node x exited on signal 11 (Segmentation fault).

```

dmesg output:

```

[64188.938475] ------------[ cut here ]------------

[64188.938730] kernel BUG at fs/ext4/inode.c:1852!

[64188.938981] invalid opcode: 0000 [#1] SMP

[64188.939238] last sysfs file: /sys/devices/system/cpu/cpu15/topology/physical_package_id

[64188.939733] CPU 3

[64188.939983] Modules linked in: ipt_ipp2p acpi_cpufreq nvidiafb fb_ddc vgastate nvidia(P) cciss

[64188.940510] Pid: 10153, comm: meep-mpi Tainted: P           2.6.32-gentoo-r7 #11 ProLiant ML150 G6

[64188.941005] RIP: 0010:[<ffffffff8117c676>]  [<ffffffff8117c676>] ext4_da_get_block_prep+0xec/0x249

[64188.941512] RSP: 0000:ffff8801aa18bbb8  EFLAGS: 00010297

[64188.941763] RAX: 0000000000000005 RBX: ffff8801bef8cf50 RCX: 0000000000000154

[64188.942018] RDX: 0000000000000004 RSI: 0000000000000153 RDI: 0000000000000004

[64188.942273] RBP: ffff8801aa18bc18 R08: ffff8801bef8cf50 R09: 0000000000000000

[64188.942527] R10: 0000000000000003 R11: ffff88033ed6ee78 R12: ffff88033ed6ed60

[64188.942782] R13: 0000000000000000 R14: ffffea0005cfa300 R15: ffff88033ed6ecc0

[64188.943037] FS:  00007fd597790700(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000

[64188.943532] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[64188.943784] CR2: 00007fd58aa6a980 CR3: 00000001be045000 CR4: 00000000000006e0

[64188.944038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[64188.944293] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[64188.944548] Process meep-mpi (pid: 10153, threadinfo ffff8801aa18a000, task ffff8801bf33bd00)

[64188.945041] Stack:

[64188.945287]  ffffea0005cfa300 ffffffffffff0000 ffffea0005cfa300 ffff88033e6ed000

[64188.945553] <0> ffff88033ed6f028 0000000000726000 00000000aa18bc18 ffffea0005cfa300

[64188.946063] <0> ffff8801aa18bc68 0000000000001000 ffffea0005cfa300 0000000000000000

[64188.946816] Call Trace:

[64188.947067]  [<ffffffff8110490c>] __block_prepare_write+0x142/0x29d

[64188.947322]  [<ffffffff8117c58a>] ? ext4_da_get_block_prep+0x0/0x249

[64188.947578]  [<ffffffff81104bd6>] block_write_begin+0x80/0xd0

[64188.947831]  [<ffffffff8117f324>] ext4_da_write_begin+0x183/0x205

[64188.948086]  [<ffffffff8117c58a>] ? ext4_da_get_block_prep+0x0/0x249

[64188.948343]  [<ffffffff810ac107>] ? find_get_page+0x23/0x84

[64188.948595]  [<ffffffff81178d43>] ext4_page_mkwrite+0x10c/0x15d

[64188.948851]  [<ffffffff810c2f44>] __do_fault+0x11a/0x34e

[64188.949104]  [<ffffffff810c4dd9>] handle_mm_fault+0x30d/0x6b1

[64188.949360]  [<ffffffff81524abf>] do_page_fault+0x2a9/0x2c3

[64188.949614]  [<ffffffff81522a7f>] page_fault+0x1f/0x30

[64188.949865] Code: 45 c0 48 8b 7d c0 e8 b4 60 3a 00 41 8b b7 58 03 00 00 4c 89 e7 ff c6 e8 40 eb ff ff 48 63 d0 41 8b 87 5c 03 00 00 48 39 c2 73 04 <0f> 0b eb fe 48 29 c2 48 8b 45 c0 48 89 55 b0 fe 00 4c 8b 75 b0

[64188.950832] RIP  [<ffffffff8117c676>] ext4_da_get_block_prep+0xec/0x249

[64188.951090]  RSP <ffff8801aa18bbb8>

[64188.951792] ---[ end trace f7985c634e9172c9 ]---

```

emerge --info

```

Portage 2.1.8.3 (default/linux/amd64/10.0, gcc-4.3.4, glibc-2.10.1-r1, 2.6.32-gentoo-r7 x86_64)

=================================================================

System uname: Linux-2.6.32-gentoo-r7-x86_64-Intel-R-_Xeon-R-_CPU_E5520_@_2.27GHz-with-gentoo-1.12.13

Timestamp of tree: Fri, 28 May 2010 14:35:01 +0000

app-shells/bash:     4.0_p37

dev-java/java-config: 2.1.10

dev-lang/python:     2.6.5-r2, 3.1.2-r3

dev-python/pycrypto: 2.1.0

dev-util/cmake:      2.6.4-r3

sys-apps/baselayout: 1.12.13

sys-apps/sandbox:    1.6-r2

sys-devel/autoconf:  2.13, 2.65

sys-devel/automake:  1.9.6-r3, 1.10.3, 1.11.1

sys-devel/binutils:  2.18-r3

sys-devel/gcc:       4.3.4, 4.4.3-r2

sys-devel/gcc-config: 1.4.1

sys-devel/libtool:   2.2.6b

virtual/os-headers:  2.6.30-r1

ACCEPT_KEYWORDS="amd64"

ACCEPT_LICENSE="* -@EULA"

CBUILD="x86_64-pc-linux-gnu"

CFLAGS="-O2 -pipe -fomit-frame-pointer -march=native -mtune=native"

CHOST="x86_64-pc-linux-gnu"

CONFIG_PROTECT="/etc /usr/share/X11/xkb /usr/share/config /var/lib/hsqldb"

CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"

CXXFLAGS="-O2 -pipe -fomit-frame-pointer -march=native -mtune=native"

DISTDIR="/usr/portage/distfiles"

FEATURES="assume-digests distlocks fixpackages news parallel-fetch protect-owned sandbox sfperms strict unmerge-logs unmerge-orphans userfetch"

GENTOO_MIRRORS="http://distfiles.gentoo.org"

LANG="en_US.UTF-8"

LDFLAGS="-Wl,-O1"

LINGUAS="en"

MAKEOPTS="-j33"

PKGDIR="/usr/portage/packages"

PORTAGE_CONFIGROOT="/"

PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"

PORTAGE_TMPDIR="/var/tmp"

PORTDIR="/usr/portage"

PORTDIR_OVERLAY="/usr/local/portage"

SYNC="rsync://rsync.gentoo.org/gentoo-portage"

USE="X aac acl acpi alsa amd64 apache2 apm avi bash-completion berkdb bluetooth branding bzip2 cblas cli cracklib crypt css cups cxx dbus djvu dri dvb dvd encode examples ffmpeg fftw fortran fuse gd gdbm gif gnome gpm gtk hal iconv ipv6 java jpeg latex lm_sensors logrotate matroska mmx modules mp3 mp4 mpeg mpi mudflap multilib mysql mysqsl ncurses nforce2 nls nptl nptlonly nsplugin nvidia offensive ogm opengl openmp pam pcre pdf perl php pmu png postscript pppd python qt3support qt4 quicktime readline reflection samba session sms snmp sound spl sse sse2 ssl svg sysfs tcpd theora tiff tk truetype unicode usb v4l v4l2 vhosts vorbis wifi x264 xcomposite xinerama xorg xvid zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="keyboard mouse" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en" RUBY_TARGETS="ruby18" USERLAND="GNU" VIDEO_CARDS="nvidia nv" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" 

Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LC_ALL, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

```

----------

## massimo

Seems like you hit [1].

[1] https://bugzilla.kernel.org/show_bug.cgi?id=15231

----------

