Wireshark-users: Re: [Wireshark-users] TCP question: retransmission or prodding the peer?
From: Bill Meier <wmeier@xxxxxxxxxxx>
Date: Fri, 21 Feb 2014 15:23:59 -0500
On 2/21/2014 3:23 AM, netztier@xxxxxxxxxx wrote:


2. I have several observations:

     a. The basic request/response sequence as follows:

     [ SEQ/ACK ] analysis snipped

So: The fact that the seq & ack in 4 and 5 are the same is
      just as expected.
      packet 4 is just an "ack" with no data
      packet 5 is data (with same seq/ack as the previous)

However: for some reason, B took 2.5 secs to send (the start of)
           a response to packet 3 in packet 5.

           We know that B received packet 3 immediately because
           B sent an ack in packet 4 (after the usual 200 ms delay).

           So: The "B" application failed to respond immediately even
               though we know that "B" received the packet at the network
               level.

A (10.33.53.121) is the card reader and TCP initiator, while B
(147.88.243.121) is the card server and TCP responder.

I beg to disagree here: (if we remove the two ARP packets from the
capture and restart numbering from 1):


I guess we disagree a bit about the details (see below).   :)

In any case, I would expect that getting a capture somewhere near the server may help to clear things up.

Packet 3 of the TCP session (packet 5 in the originally attached
capture) was the A's 3rd packet of the 3-way-handshake.

Correct

Packet 4 is such a suspicous "retransmission" after 7.5 ms, with a
1byte payload (Kind of a request?)

    Uh, no: this is not a retransmission. This is "A" sending the
    first byte of data after the connection has been established.
    I don't consider it suspicious.

    If you look at most any TCP connection startup, you'll see that the
    sequence is exactly the same. (That is: the seq num and the ack
    for the first data packet sent from the connection initiator
    after the 3-way handshake will match that of the previous
    ack sent in the 3nd packet of the 3 way handshake).

Packet 5 is B's response to packet 4, an empty ACK

    acking the first data byte received from A.

Packet 6 is probably A's request to B, the first packet with some discernible payload, and probably
  is a "true" request of something.

   As I noted above: I'm pretty sure that the complete request consists
   of the first byte sent plus these bytes. The split between the two
   packets has to do with the Nagle algorithm.

Packet 7 is B's ACK to packet 6, with a delay of 0.2 seconds

Packet 8 is B's ACK to ... (what?) with the delay of 2.5 seconds and a 1 byte payload.

   B didn't receive any more data from A since B sent the ACK in 7 so
   the ACK number sent by B is unchanged.

Packet 9 is A's ACK to packet 8

etc... ad FIN


           I've idea as to why. Does "Only the during the first TCP
           connection" suggest some kind of initial setup
           going on in "B" ?

That is what I assume. User swipes the card - card reader contacts server.
This first TCP session is always this short, just 18 or 20 packets.  But a lot
of them have this strange delay between packets 7 and 8.  The next TCP
sessions follow immediately (retrieval of print job list), and while printing,
there is a longish one, probably for the "live billing" to the card of each single
page printed.

That being said: there's another issue having to do with the
     "send 1 byte", wait for ack, send remaining bytes" pattern.

     Rather than me trying to explain: Do a web search on "Nagle
     algorithm" and TCP_NODELAY for an explanation.

     Basically: the software isn't programmed quite right (IMHO).

If i understand the bit with TCP_NODELAY correctly, setting this socket option when
calling a socket causes TCP to send every data chunk that gets "pushed down the
stack" from the application immediately.

Given the application's nature, and the requirement that every page printed must be
reported back to the card reader (and written to the card), I think that disabling Nagle
is not quite wrong.

Another thing I find a bit interesting:
The widow size advertised by B (card server ?)just keeps decreasing as
data is received from A. Normally that would mean that the app isn't
taking the data from the network layer. However, that appears not to be
the case since the request/response sequence seems to complete OK.

The servers (B, 147.88.243.205) advertised window size starts with 8k in the SYN ACK,
then jumps to almost 64240 bytes and decreases down to 64102 bytes.

What kind of system is the card server. Some kind of minimal system ?

Currently, I do not know. I assume it is a Windows server.

Actually: I see that the continually decrementing window size
advertisement applies to both the card reader and the card server.

Agreed, the card reader starts at 32k, increases to 33580 and decreases to 33551.

Given that we're talking embedded devices, have you discussed this issue
with the vendor ?

The issue in its early days had been tracked with the vendor.

Back then however, the network did actually have a problem:
restrictive port security timers aged-out the card reader's MAC
address from the CAM table prematurely

When it was not used for more than 5min, it had cleared it's
own ARP table, and had to start with ARPing for it's default gateway.

With port security enabled, MAC-learning on the Cisco 2960S is
done in software on the CPU, not the port ASICs. The first ARP
request was lost and the card reader had to retransmit it,
sometimes even twice.

We worked around this by increasing port security timers and by
reducing arp timeout on the upstream L3-Switch to 4 minutes.
Cisco L3 devices with CEF enabled perform active ARP cache maintenance.

1min before expiry, they unicast-request an ARP resolution from all
known entries for a given subnet/interface. This request/reply sequence
every 3 minutes keeps the CAM table entires alive.


Interesting ...

No more lost ARPs since.

Thinking about this a bit more:

It's certainly possible that the issue is lost data from the server to
the reader.

IOW: packet 5 above is actually a retransmission which eventually makes
it through. Depending upon the TCP implementation, it could be that the
retransmission timeout is 2.5 secs.

I would agree that frame 8 with it's 2.5s delay is a retransmission that
eventually gets through, but then again...

We've seen frame 7 get to the card reader.

If the card reader's (A's) reaction to frame 7 were lost somewhere  upstream
in the network, the capture should've seen it go from card reader to
switchport.  The ethernet hub I used to capture was between the card reader
and its usual switchport - and that switchport hasn't seen a malformatted
frame in months.

So for some reason, there was no reaction from the card reader to frame 7.
This might be perfectly ok - probably there is no reaction required.


There's no need for any reaction from A to receiving frame
7 from B. Frame 7 from B is just an ACK with no data and therefore A does not
need to respond. (A is awaiting a reply with data from B).

And: If it were packet loss due to corruption (bad cabling) or congestion,
I believe it would have to be random. Always missing a packet after
"packet 7"  isn't random.

Of all the samples i have (more than a dozen), the 2.5s delay is always
between packet 7 and 8.

So if there actually is a packet missing from card reader (A) to Server (B),
it never left the card reader.  Which brings us back to the vendor.

See above. I don't think there's anything missing from A at this point.

I'm pretty convinced that the issue is with the data from B which A is awaiting at this point.

I would guess that the first step would be to do a capture adjacent to
the server to rule that possibility out.

That's what I'll attempt next. I hope it is a server that runs on a platform
where i can capture directly on the server.

Capture of a non-delay-affected session will follow.

Best regards & thanks a lot

Marc