# rsync always grinds to a halt ..

## dirtbag

when trying to transfer a lot of pictures from my digital camera from my gentoo server (running 

 3.2.12-gentoo #1 SMP Sat May 12 17:06:27 EDT 2012 i686 AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ AuthenticAMD GNU/Linux

up to another linux box via rsync, the transfer always grinds to a halt after about a minute.. ive upgraded rsync on both sides to the latest version, but the same thing happens..

 the only thing I can find is 

https://bugzilla.samba.org/show_bug.cgi?id=5478  which seems to be the same problem im having.. do we know of any funky tcp bugs in this kernel?

I never had this problem before..  previously, I was running kernel 3.0.6

-db

----------

## krinn

try add -e ssh

so your rsync will be done thru ssh, that could bypass your transport trouble, except if your ssh is also affect  :Very Happy:  (but at least it should gave you another clue it's not really rsync, but your connection or kernel)

if it work and you don't consider that as a solve or if you cannot use ssh, bugs.gentoo.org is where to report such bug with the software.

----------

## dirtbag

yeah its hanging with -e ssh as well.

I guess ill start lookin fer kernel bugs..  :Sad: 

-db

----------

## Hu

 *dirtbag wrote:*   

> yeah its hanging with -e ssh as well.
> 
> I guess ill start lookin fer kernel bugs.. 
> 
> 

 Have you confirmed that booting the older kernel allows rsync to work correctly?  What is the output of emerge --info net-misc/rsync net-misc/openssh?

----------

## khayyam

 *dirtbag wrote:*   

> I guess ill start lookin fer kernel bugs.

 

dirtbag ... I don't think it relates to 3.2.12-gentoo as I've used it and had no problems with rsync. None the less, I would follow Hu's advice as it maybe something other than the tcp stack.

I seem to remember there being some recent issue with ~arch versions of openssh and TcpRcvBufPoll ... if your running ~arch then you might try adding 'TcpRcvBufPoll no' to /etc/ssh/sshd_config and see if the problem persists (after restarting sshd of course).

best ... khay

----------

## dirtbag

ok so i tried on my 3.0.6 kernel and got the same issue.. 

I upgraded to  openssh-5.9_p1-r4 and still have the same issue..

i put the 

TcpRcvBufPoll no

in my /etc/ssh/sshd_config 

and that didnt seem to help either .

-db

----------

## khayyam

 *dirtbag wrote:*   

> ok so i tried on my 3.0.6 kernel and got the same issue.

 

OK, which suggests its not the kernel (at least 3.2.12-gentoo specifically)

 *dirtbag wrote:*   

> I upgraded to  openssh-5.9_p1-r4 and still have the same issue.

 

5.9_p1-r4 is arch ... the bug (which you'll find if seaching the forums) relates ~arch (so 6.0_pX).

 *dirtbag wrote:*   

> i put the TcpRcvBufPoll no in my /etc/ssh/sshd_config and that didnt seem to help either.

 

Well, this was what I remember to be the fix for the issue with 6.0. I did say, "if you are running ~arch".

Can this be isolated  to rsync, does pushing data with scp produce a similar result? Do you have any kind of TOS (traffic shaping) in place, have you run 'mtr', or similar, to see if there is any bottleneck between you and the remote host? Bascially, the more you can isolate the problem the easier it'll be to pin-point exactly what kernel and/or app is the cause, right now it could be your DSL-modem/gateway, and so before we start thinking ssh, rsync, or the kernel, are at issue we should have excluded other perhaps more likely possiblities.

best ... khay

----------

## dirtbag

i just tried to scp a whole directory and it was going good, until..

24.html                                                                                                           100% 1821     1.8KB/s 352.1KB/s   00:00    

IMAG0056.jpg                                                                                                      100%  136KB 135.8KB/s 352.1KB/s   00:00    

IMAG0040.jpg                                                                                                      100%  181KB 180.9KB/s 352.1KB/s   00:00    

igal2.css                                                                                                         100%  741     0.7KB/s 352.1KB/s   00:00    

.indextemplate2.html                                                                                              100% 1274     1.2KB/s 352.1KB/s   00:00    

.tile.png                                                                                                         100%  237     0.2KB/s 352.1KB/s   00:00    

25.html                                                                                                           100% 1821     1.8KB/s 352.1KB/s   00:00    

.thumb_IMAG0082.jpg                                                                                               100%   21KB  20.9KB/s 352.1KB/s   00:00    

IMAG0055.jpg                                                                                                      100%  190KB 189.9KB/s 352.1KB/s   00:00    

7.html                                                                                                            100% 1814     1.8KB/s 352.1KB/s   00:00    

.thumb_IMAG0076.jpg                                                                                               100%   32KB  31.7KB/s 352.1KB/s   00:00    

.slidetemplate2.html                                                                                              100% 1759     1.7KB/s 352.1KB/s   00:00    

.thumb_IMAG0064.jpg                                                                                               100%   28KB  28.0KB/s 352.1KB/s   00:00    

IMAG0070.jpg                                                                                                      100%  115KB 115.4KB/s 352.1KB/s   00:00    

.thumb_IMG_0554.jpg                                                                                               100%   15KB  15.2KB/s 352.1KB/s   00:00    

IMAG0050.jpg                                                                                                      100% 1164KB 581.9KB/s 992.0KB/s   00:02    

IMAG0054.jpg                                                                                                      100% 2060KB 294.2KB/s   1.1MB/s   00:07    

IMAG0067.jpg                                                                                                      100% 1792KB 358.4KB/s   1.1MB/s   00:05    

IMAG0036.jpg                                                                                                      100% 1476KB 492.1KB/s   1.1MB/s   00:03    

IMAG0030.jpg                                                                                                       60% 1472KB   7.3KB/s   0.0KB/s - stalled -

im not doing any TOS stuff.

heres mtr output.. 

jason@beast ~ $ sudo /usr/sbin/mtr --report  xx.xx.xx.xx

HOST: beast                       Loss%   Snt   Last   Avg  Best  Wrst StDev

  1.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0

  2.|-- 99-3-168-2.lightspeed.rlg 60.0%    10   21.6  21.8  21.4  22.8   0.7

  3.|-- 99.134.77.24              40.0%    10   23.7  24.3  23.7  25.9   0.8

  4.|-- 99.134.77.14              90.0%    10   26.5  26.5  26.5  26.5   0.0

  5.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0

  6.|-- 72.157.44.114              0.0%    10   23.3  23.2  22.9  23.4   0.1

  7.|-- 12.81.56.26                0.0%    10   28.0  26.0  24.6  28.0   1.2

  8.|-- 12.81.56.13                0.0%    10   61.7  28.6  24.7  61.7  11.6

  9.|-- 74.175.192.58              0.0%    10   23.4  26.1  23.2  37.3   5.1

 10.|-- cr2.rlgnc.ip.att.net       0.0%    10   34.8  39.5  34.7  44.7   3.9

 11.|-- cr1.wswdc.ip.att.net       0.0%    10   36.5  35.6  34.2  37.4   1.3

 12.|-- wswdc03jt.ip.att.net       0.0%    10   32.6  41.2  32.3 118.5  27.1

 13.|-- 192.205.34.246             0.0%    10   35.3  35.3  34.9  35.7   0.2

 14.|-- if-2-2.tcore1.AEQ-Ashburn  0.0%    10   35.0  35.2  34.8  35.6   0.2

 15.|-- 66.198.154.2               0.0%    10   34.2  34.4  34.2  34.8   0.2

 16.|-- 107.14.19.132              0.0%    10   38.1  40.6  38.1  43.8   1.7

 17.|-- 107.14.19.21               0.0%    10   43.8  43.6  43.1  44.3   0.3

 18.|-- ae19.rlghnca-rtr1.nc.rr.c 20.0%    10   43.8  43.6  43.3  44.0   0.2

 19.|-- gig14-1.rlghncj-ar42.nc.r 10.0%    10  405.8 128.1  55.8 405.8 124.3

 20.|-- thehost.somewhere.com  10.0%    10  106.4  86.1  55.1 110.3  23.5

jason@beast ~ $ 

regards,

db

----------

## khayyam

 *dirtbag wrote:*   

>   2.|-- 99-3-168-2.lightspeed.rlg 60.0%    10   21.6  21.8  21.4  22.8   0.7
> 
>   3.|-- 99.134.77.24              40.0%    10   23.7  24.3  23.7  25.9   0.8
> 
>  [...]
> ...

 

Well, thats 60% packet loss on the second hop, a further 40% on the third hop, and more by the time you reach your destination. Basically, the issue is with your network.

best ... khay

----------

## khayyam

 *dirtbag wrote:*   

> 1.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
> 
>   2.|-- 99-3-168-2.lightspeed.rlg 60.0%    10   21.6  21.8  21.4  22.8   0.7
> 
>   3.|-- 99.134.77.24              40.0%    10   23.7  24.3  23.7  25.9   0.8
> ...

 

Actually, this doesn't look right at all ... your 5th hop seems to land you right back at the first (and notice the packet loss) 

best ... khay

----------

## Hu

 *khayyam wrote:*   

>  *dirtbag wrote:*   1.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
> 
>   2.|-- 99-3-168-2.lightspeed.rlg 60.0%    10   21.6  21.8  21.4  22.8   0.7
> 
>   3.|-- 99.134.77.24              40.0%    10   23.7  24.3  23.7  25.9   0.8
> ...

 Not all unresponsive hosts are the same host.  This output states that both the first and fifth hops failed to return enough information to describe them.  It does not state that they are both the same unresponsive host.

----------

## khayyam

 *Hu wrote:*   

> Not all unresponsive hosts are the same host. This output states that both the first and fifth hops failed to return enough information to describe them. It does not state that they are both the same unresponsive host.

 

Hu, your absolutely correct. None the less the issue is with the network ...

best .. khay

----------

## dirtbag

so I did some other tests and it seems fine to other sites.. so I still think theres some funky tcp windowing problem somewhere.. 

if I rate limit thte rsync with like 

 rsync -av --bwlimit 170  

it seems to work fine. if I watch a transfer, (im sorry, I dont know the technical vernacular for whats really going on) but something like the tcp window gets bigger as the transfer goes on to more better utilize the available bandwidth.  but it seems to grind to a halt after some point for some reason. I assume it should automagically scale back, but that doesnt seem to be happening.  so,  I do have a workaround for the moment.  I have tried the same transfer of the same directory to another host on the internet and there was no stoppage of the transfer.

 :Confused: 

-db

----------

## khayyam

dirtbag ...

the problem with such tests is you can't be sure to isolate the problem, but with the mtr/traceroute it clearly shows that packet loss is at issue. You may be seeing this due to CRC checking at the hardware/MAC level (meaning its localised) or due to network configuration and/or bad cabling ... however, each of these would effect the entire TCP/IP layer, though not necessarily consistantly.

As for the effect of "stepping" this is how TCP works, it steps down on errors, however "scaling back" can only go so far. If you limit bandwith then this will have the effect of producing fewer packets, and so fewer to potencially loose, but also perhaps not create the conditions for the problem to arrise.

So, unless you can reproduce the same with a seperate NIC, cable, router, then I don't think you can isolate it to rsync .. infact the mtr seems to rule that out, it may trigger the issue, but I think its not the cause.

best ... khay

----------

