# sync hangs

## Pasketti

I'm just looking for some sort of pointer here or maybe something else I can try.

My problem is that the sync command will intermittently hang up and never return.

When it does this, the sync process is unkillable.  

I am also unable to shut down the system via "shutdown -r now" - that just says "System going down now!" and hangs up.

CTRL-ALT-DEL also doesn't work.

I can switch to another console or open another terminal session and everything seems fine, but if I issue another sync, that session will also lock up.

The only way to get rid of it is to manually shut down all server processes and power the box off.

When it comes back up, sync will work fine.  Until it doesn't.  Then it locks up.

I discovered this when a nightly job that does a sync never stopped.  ps reported half a dozen processes that had been hanging for days.  I commented out the sync line in the script, but that's just a bandaid.  I'd like to find out what's going on and fix it.

I checked /sys/block/<device>/stat.  Supposedly the 9th column shows pending requests, and that is 0, so it seems there is nothing for it to do, but still the sync never returns.

Has anyone else ever had this happen, and what did you do to fix it?

I'm running vanilla-sources 3.4.9, amd64.  It's a server, and does not have X installed.  I can be more vague on my configuration if necessary.

Thanks!

----------

## NeddySeagoon

Pasketti,

That its random, suggests a hardware error of some sort.

A properly working sync should retry after a limit and when that fails too exit cleanly after the retry count is exhausted.

We can say thats its unlikely to be a network issue as the above behavior is not observed.

Find a boot CD that has memtest or mentest86+ on it and run a few cycles.  Its important that you boot directly into memtest as it need to run on the bare hardware to get useful results.

Errors reported by memtest are not always memory errors.  If it reports problems, post the error reports.

Install lm-sensors and check the CPU tempreture.  It could be overheating for any number of reasons, but syncing isn't nearly as CPU intensive as building packages.

It can also be a CPU Vcore regulator issue.  That will require a visual covers off inspection to check.

This part of the motherboard gets hot and is worked very hard. Cheap motherboards skimp here and early failures are common. How old is the system?

----------

## Pasketti

Bah.  I was hoping for something like "Oh all you need to do is enable option XYZ and recompile your kernel."

I ran memtest for two passes, no errors found.

I installed lm_sensors.  I'll see what it tells me.

It's older hardware (Core 2 Duo).  The install is maybe three years old.

The only other thing I can think of is the power supply.  Maybe I'll try that next.

Thank you for your assistance, good sir and/or madam!

----------

## NeddySeagoon

Pasketti,

memtest actually gives most of your motherboard componets a good workout as it uses the CPU to thrash the RAM.

As it passed, I'm inclined to think its probably not a thermal issue either.

As you still have control when the hang occurs, look in dmesg for any disc IO errors.

I'm not aware of any kernel issues.  If there were, it would be all over the forums as many users would hit the issue.

As it seems to be 'just yourself' its likely to be hardware.  Its still possible that its software due to hardware. For example, rsync or something that it uses, compiled incorrectly because of a transient hardware issue.

Its also possible that if you do not use an rsync rotation that the single rsync sever you do use has a problem.  Its been known

Run emerge --info,  the SYNC=  line should be one of the following. IF not fix it and try again. 

```
#   Default:       "rsync://rsync.gentoo.org/gentoo-portage"

#   North America: "rsync://rsync.namerica.gentoo.org/gentoo-portage"

#   South America: "rsync://rsync.samerica.gentoo.org/gentoo-portage"

#   Europe:        "rsync://rsync.europe.gentoo.org/gentoo-portage"

#   Asia:          "rsync://rsync.asia.gentoo.org/gentoo-portage"

#   Australia:     "rsync://rsync.au.gentoo.org/gentoo-portage"

SYNC= "rsync://rsync.europe.gentoo.org/gentoo-portage"
```

These are not individual rsync servers.  They are 'rotations' in that they pass you on to a real random rsync server.  IF you get a dud, the next try normally gets you a different server.  Make sure you are using a rotation (one of the above), not a single server.

----------

## eccerr0r

'Sync' is the command to tell the kernel to dump all dirty buffers to disk.  When sync hangs, it means it thinks it has something to write (perhaps metadata) and can't do it.

If the problem is to a disk, it could mean a failing disk but usually it will get alerted with other dmesg errors.

Are you using NFS or other network filesystem?  Are you (or your users) using FUSE (which I despise though it's a good concept)?  These two cause a lot of hangs for me...

----------

## NeddySeagoon

eccerr0r,

How did I misread that ...  

Sorry Pasketti

----------

## Pasketti

No worries.  It did get me to consider hardware issues, which I had been avoiding thinking about.

smartctl tells me the drives pass all the diagnostics.

I bought a new (bigger) power supply, but when I put it in, the machine became unstable and would cycle power randomly within a few minutes.  Putting the old one back in made it stop doing that, so I'm returning the PS.

I did blow a huge amount of dust out of the thing before I put it back.

But I did notice something.  I back everything up to a second hard drive using rsync.  When that's done, I delete several directories via rm -rf from the backup drive that don't need to be backed up.  And the rm terminated with a "Killed" message.  Once that happened, sync would start hanging.

So I tried to delete the source folders (they hold backups from my kids' Windows machines, and get recreated every Sunday).  And the rm terminated with "Killed" and my shell process hung.  I opened a new terminal and tried a sync.  It hung.  I couldn't reboot either.

I shut everything down manually, then cycled power, booted from my handy Live CD and ran fsck on all my filesystems.

I'm thinking there may have been some corruption in the file system causing rm to puke, but it left some flag set in the kernel that caused a race condition with sync.  Or not.

As of right now, sync is not hanging.  If it's still working in a couple of days, then I'll start to be more optimistic.

I do appreciate having someone to bounce things off of.  It helps.

Thanks again!

----------

## eccerr0r

When you see the "Killed" it's because the kernel decided that program was doing something really bad (or was confused itself) and there should be some diagnostics in 'dmesg' - check that when you get it killed.  When you run dmesg you should see exactly what it didn't like.  You might want to open another terminal and run "dmesg" once in a while so that it's cached in RAM so the next time something happens, it's cached and won't have to worry about it not being able to read from disk.

What filesystem is this on for curiosity sake?

----------

## Pasketti

It's ext4.

My backup ran last night, and there were no problems with rm.  And sync isn't hanging either.

I am cautiously optimistic.

----------

## Hu

 *Pasketti wrote:*   

> But I did notice something.  I back everything up to a second hard drive using rsync.  When that's done, I delete several directories via rm -rf from the backup drive that don't need to be backed up.

 You may be able to avoid the deletion step by using rsync exclude directives to skip copying the files in the first place.

----------

## Pasketti

 *Hu wrote:*   

> You may be able to avoid the deletion step by using rsync exclude directives to skip copying the files in the first place.

 

Yeah, I know.  It was a tradeoff.

Here's the backup script:

```

#!/bin/bash

DATE=`date "+%Y-%m-%d"`

date

# No leading slash on dir name

for DIR in root etc var/bind var/www var/lib/portage opt/msm home

do

  echo ============ Backup $DIR

  mkdir -p /backup/$DATE/$DIR

  rsync -avx --delete --link-dest=/backup/current/$DIR /$DIR/ /backup/$DATE/$DIR

done

rm -f /backup/prev

mv /backup/current /backup/prev

ln -s /backup/$DATE /backup/current

```

So I'd have to unwind the FOR loop if I wanted to exclude some directories.  But the for loop makes it easy to add more paths.  So I back it all up and then delete a few paths afterward.  It happens at 3 AM, so it's not like I'm waiting for it to finish.

----------

## Pasketti

I think it's fixed.  It's been two days now, and everything's fine.

Before I ran fsck, it would hang up every morning.

So the lesson here is "if sync hangs, and you can't find a hardware problem, run fsck -pf"

Thank you guys again for letting me bounce things off you!

----------

