Wireshark-users: Re: [Wireshark-users] TCP Previous segment lost > connection lost (bank transact
From: Vikki Taxdal <vtaxdal@xxxxxxxxx>
Date: Sun, 12 Apr 2009 13:21:02 -0400
On Sun, Apr 12, 2009 at 3:41 AM, Sake Blok <sake@xxxxxxxxxx> wrote:

>>    It would be helpful to have this data as a binary capture file. Could
>>    you post a binary capture file with packets 8705-8726 of the original
>>    file?

[snip]

>  Having a file that can be loaded in our favorite tool is much more easy than having to
> anlyse it in a text editor ;-)

yes yes yes!

[snip]

> 8720    S->C    data (response, seq 2666, next 2492)
> 8721    C->S    ACK (2492)
> ~17 sec delay
> 8722    S->C    FIN (seq 2515, previous segment lost)

So, does this part mean maybe not one, but _some_ packets were lost?
One with the segment transporting 23 bytes, and one or more
retransmissions after that, depending on the Server TCP's timeout
value for waiting for ACK?

> 8723    C->S    Dup.ACK (2492)
> 8724    S->C    Encryted Alert (seq 2492, next 2514)
> 8725    C->S    ACK (2516)
> ~16 sec delay
> 8726    S->C    RST
>
> Frame 8717-8721 look like a normal request/response. The 17 sec delay is
> usually caused by either the browser or the server in an HTTP/1.1
> conversation when the preconfigured time-out expires while waiting for
> another request. Then the server wants to close the connection, usually
> this is done by sending a SSL alert, which in this phase of the
> communication would of course be encrypted and then a TCP FIN.

But, why would SSL alert be what was in the missing 23 bytes, if the
server really had sent those bytes right away?  (I don't know what
those SSL alerts mean, anyway - they confound me!)

It looks
> like the SSL Alert somehow did not make it to the client (assuming the
> trace is made at the client side).

Yes.  Looks like packet loss on the server's side of the  firewall.
But I wonder what the firewall is... if Cisco FWSM or ASA, there are
TCP bugs (both having to do with SACK but in a different way for each
device - FWSM advertises SACK but doesn't do it, whereas ASA just
turns it off...  FWSM's bug adds to perceived congestion on the
outside, whereas ASA's bug just makes for lowered performance.  Effect
of FWSM's bug shows up when FWSM perceives congestion on the outside,
and ASA's shows up only in the latest couple of versions.).

When the TCP FIN arrives, the client
> knows it missed some data, so it asks for the data with the ACK in frame
> 8723. The server resends the missing data i(the Alert) in frame 8724.
> The client now ACKs the data, but since it has already seen the TCP FIN,
> it adds 1, so it ACKs 2516 instead of 2515.

I don't understand.. why does it ACK 2516 (add the 1)?  Doesn't that
meant it's expecting to get more?  Why would it think that if the
server has said I'm done, let's close the connection (set the FIN
flag).

After a 16 second delat, the
> server sends a TCP RST.
>
> To me there are two issues here, first of all, why does the SSL Alert
> gets lost from the server to the client. This could just be random
> packets being dropped (do you see other packets being retransmitted in
> other sessions?). Another possibility is that they are dropped on
> purpose by an intermediate device. But as you say about 10% of the
> transactions fail, I assume it's just random packet loss.

But it happens 10% of the time..  that's way more than random, isn't
it?  Does it happen only from certain clients or some OS's?  Maybe
some clients need to update their SSL?

> Second issue is why the client and server are not capable of restoring
> the communication properly (which is the responsibility of the TCP
> protocol). I would suggest that there is a device in between the client
> and the server (a firewall, IDP, Loadbalancer, etc) which was not
> keeping track of sequence numbers properly and dropped the ACK in frame
> 8725 on its way from the client to the server. The device would have
> expected an ACK of 2515 instead of 2516 if it was not for the already
> transmitted FIN. This would also account for the RST from the server
> after 15 seconds. If the server never saw the ACK, it would start a
> timer for the connection closure and since it never saw anything anymore
> from the client, it will close the connection, sending the client a RST
> to inform it of the (unclean) closure.

Maybe the firewall dropped the client's ACK because it had already
cleared the connection from its table.  (Maybe also, in that case, the
firewall is the one that sent the RST on behalf of the server.)

> So, assuming the packet loss is random, there might be a bug in an
> intermediate device. I would make traces at the client and the server to
> verify these findings and if they are correct, work your way inwards to
> find the device that is causing this behavior. Then you can open up a
> bug-report with the vendor of that device.

Couldn't there also be something amiss at the application layer of the
end hosts in these 10% of cases?  I really really hope the answer is
not going to be, "it's the firewall"... do the 10% of failures
routinely  happen during the busiest traffic periods, or do they occur
at random times, day/night/weekend/holiday?

This is a very good discussion for me.  I like that Bart isolated just
the right section of a sample of the problem (except I would like also
to have something I could load into Wireshark, or at least more of the
protocol tree so we could see for example what IP is doing vis a vis
fragmentation and what the TCP options are).  I also like Sake's very
clear (uncluttered) analysis and I appreciate the opportunity to
participate.

Vikki