Wireshark-users: Re: [Wireshark-users] TCP Previous segment lost > connection lost (bank transact
From: Sake Blok <sake@xxxxxxxxxx>
Date: Mon, 13 Apr 2009 19:37:18 +0200
On Sun, Apr 12, 2009 at 01:21:02PM -0400, Vikki Taxdal wrote:
> On Sun, Apr 12, 2009 at 3:41 AM, Sake Blok <sake@xxxxxxxxxx> wrote:
> 
> > 8720    S->C    data (response, seq 2666, next 2492)
> > 8721    C->S    ACK (2492)
> > ~17 sec delay
> > 8722    S->C    FIN (seq 2515, previous segment lost)
> 
> So, does this part mean maybe not one, but _some_ packets were lost?
> One with the segment transporting 23 bytes, and one or more
> retransmissions after that, depending on the Server TCP's timeout
> value for waiting for ACK?

No, only the 23 bytes between seq 2492 and 2515 were lost. They could
have been transmitted in several frames, but that is not likely
considering the content of the missing bytes.

> > 8723    C->S    Dup.ACK (2492)
> > 8724    S->C    Encryted Alert (seq 2492, next 2514)
> > 8725    C->S    ACK (2516)
> > ~16 sec delay
> > 8726    S->C    RST
> >
> > Frame 8717-8721 look like a normal request/response. The 17 sec delay is
> > usually caused by either the browser or the server in an HTTP/1.1
> > conversation when the preconfigured time-out expires while waiting for
> > another request. Then the server wants to close the connection, usually
> > this is done by sending a SSL alert, which in this phase of the
> > communication would of course be encrypted and then a TCP FIN.
> 
> But, why would SSL alert be what was in the missing 23 bytes, if the
> server really had sent those bytes right away?  (I don't know what
> those SSL alerts mean, anyway - they confound me!)

You have to consider the layered character of this communication. TCP
does not know about the application stuff going on in the layer above
it. It just knows about the bytes that it needs to send and needs to
receive and pass on. It keeps the bytes it has sent on the wire save in
its buffer until the receiver has ACKd the reception of the data. Hence,
it can not have sent other data and now decided to drop that data in
favor of the SSL Alert. Remember, to TCP, the SSL protocol is just
payload to be sent...

For the signicanse of the SSL alert, please read RFC 2246 paragraph 7.2
(see: http://www.ietf.org/rfc/rfc2246.txt). In this case, I suspect the
alert to have been the "Close Notify" alert.

> It looks
> > like the SSL Alert somehow did not make it to the client (assuming the
> > trace is made at the client side).
> 
> Yes.  Looks like packet loss on the server's side of the  firewall.
> But I wonder what the firewall is... if Cisco FWSM or ASA, there are
> TCP bugs (both having to do with SACK but in a different way for each
> device - FWSM advertises SACK but doesn't do it, whereas ASA just
> turns it off...  FWSM's bug adds to perceived congestion on the
> outside, whereas ASA's bug just makes for lowered performance.  Effect
> of FWSM's bug shows up when FWSM perceives congestion on the outside,
> and ASA's shows up only in the latest couple of versions.).

I don't think SACK was the problem here, as it was not used AFAICT.

> When the TCP FIN arrives, the client
> > knows it missed some data, so it asks for the data with the ACK in frame
> > 8723. The server resends the missing data i(the Alert) in frame 8724.
> > The client now ACKs the data, but since it has already seen the TCP FIN,
> > it adds 1, so it ACKs 2516 instead of 2515.
> 
> I don't understand.. why does it ACK 2516 (add the 1)?  Doesn't that
> meant it's expecting to get more?  Why would it think that if the
> server has said I'm done, let's close the connection (set the FIN
> flag).

In RFC 793, paragraph 3.3 it says:

"For sequence number purposes, the SYN is considered to occur before 
the first actual data octet of the segment in which it occurs, while 
the FIN is considered to occur after the last actual data octet in a 
segment in which it occurs."

That's why the first data byte is sent with a sequence number 1 higher
than the initial sequence number in the SYN packet (even though the
length of the SYN is 0, which would make you expect the same sequence
number). In a similar way, the FIN is sent with a sequence number which
is 1 higher than what you would have expected after receiving the last
bytes of data.

> After a 16 second delat, the
> > server sends a TCP RST.
> >
> > To me there are two issues here, first of all, why does the SSL Alert
> > gets lost from the server to the client. This could just be random
> > packets being dropped (do you see other packets being retransmitted in
> > other sessions?). Another possibility is that they are dropped on
> > purpose by an intermediate device. But as you say about 10% of the
> > transactions fail, I assume it's just random packet loss.
> 
> But it happens 10% of the time..  that's way more than random, isn't
> it?  Does it happen only from certain clients or some OS's?  Maybe
> some clients need to update their SSL?

10% is quite a lot, but only analysing more sessions can tell whether it
is a random act or if there is some logic in the drops. Also, you would
need traces at both sides to compare...

> > Second issue is why the client and server are not capable of restoring
> > the communication properly (which is the responsibility of the TCP
> > protocol). I would suggest that there is a device in between the client
> > and the server (a firewall, IDP, Loadbalancer, etc) which was not
> > keeping track of sequence numbers properly and dropped the ACK in frame
> > 8725 on its way from the client to the server. The device would have
> > expected an ACK of 2515 instead of 2516 if it was not for the already
> > transmitted FIN. This would also account for the RST from the server
> > after 15 seconds. If the server never saw the ACK, it would start a
> > timer for the connection closure and since it never saw anything anymore
> > from the client, it will close the connection, sending the client a RST
> > to inform it of the (unclean) closure.
> 
> Maybe the firewall dropped the client's ACK because it had already
> cleared the connection from its table.  (Maybe also, in that case, the
> firewall is the one that sent the RST on behalf of the server.)

That could be another explanation, one more reason to make traces on
both sides and work your way inwords to find the device that is causing
this behavior. Also, I don't think there is "The" firewall as both sides
will probably have one :-)

> > So, assuming the packet loss is random, there might be a bug in an
> > intermediate device. I would make traces at the client and the server to
> > verify these findings and if they are correct, work your way inwards to
> > find the device that is causing this behavior. Then you can open up a
> > bug-report with the vendor of that device.
> 
> Couldn't there also be something amiss at the application layer of the
> end hosts in these 10% of cases?  I really really hope the answer is
> not going to be, "it's the firewall"... do the 10% of failures
> routinely  happen during the busiest traffic periods, or do they occur
> at random times, day/night/weekend/holiday?

Comparing good sessions with bad sessions at both sides of the
connection, combined with a thorough determination of when the problem
does occur and when not will be the way forward indeed :-)

> This is a very good discussion for me.  I like that Bart isolated just
> the right section of a sample of the problem (except I would like also
> to have something I could load into Wireshark, or at least more of the
> protocol tree so we could see for example what IP is doing vis a vis
> fragmentation and what the TCP options are).  I also like Sake's very
> clear (uncluttered) analysis and I appreciate the opportunity to
> participate.

:-)  Thank you! And indeed it is a nice issue to discuss here!

Cheers,
    Sake