Wireshark-users: Re: [Wireshark-users] tcp reassembly
From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Thu, 17 Dec 2009 02:14:44 -0800
On Dec 16, 2009, at 10:41 PM, Chun Chan wrote:

> ok. I understand and thanks for answers.
> Bu I have a only one qestion.
> I write a simple code client/server example with using socket.
> Server side is watitng to read 10000 bytes in while loop. Ans server side print a message after recv packet. like that "n bytes received." 
> client side is sending 5000 bytes two times to server.
> 
> I hope server side is only one message "10.000 bytes received" but server side printed a message two times "5000 bytes received".

If you do a 10000-byte read/receive from a TCP socket, there is no guarantee that the read will return 10000 bytes.  If, for example, 5000 bytes are sent, followed by 5000 more bytes, if the time difference between the two sends is large enough - and it doesn't have to be very large at all - the 10000-byte read/receive might return only the first 5000 bytes, or, given that a 5000-byte send might not fit in a single link-layer packet, it might return fewer than 5000 bytes.

> Then I analyze tcpip packet there is many packet 1400 bytes.

An Ethernet packet is, at most, 1518 bytes, including the 14-byte Ethernet header and the 4-byte CRC (the 4-byte CRC might, or might not, be seen in a capture; the hardware or driver might remove it before Wireshark sees it).  That leaves 1500 bytes of data; an IPv4 header is at least 20 bytes long, and a TCP header is at least 20 bytes long, which leaves at most 1460 bytes of TCP segment data.

> How understand socket when finish message?

The socket *doesn't* understand when the message is finished.  TCP is a byte-stream protocol; the data is just a sequence of bytes, with *no* message boundaries.

If you are sending messages over a TCP socket, the protocol needs to be designed in a way that puts message lengths, or message boundaries, into the stream of bytes.  For example, a number of protocols that run over TCP put a "message length" value at the beginning of the message, where the "message length" is a count of the number of bytes in the message (either including, or not including, the message length itself).

That's what Sake Blok meant when he said

> TCP is a streaming protocol. This means it will just take the data is has been given from the upper layer and transmit it to the receiving end. The receiving end on it's turn just passes the traffic as a stream towards the upper layer. It is the upper layer that is responsible for reassembly of the data into it's PDU's.

and what Martin Visser meant when he said

> Your "protocol" needs to convey this information - there is nothing in TCP that knows when the SDU (Service Data Unit) is carrying is finished. Basically you have two options. Either your protocol (that defines that those 5000 bytes is a Protocol Data Unit) needs to provide  a header (indicating at least the length) OR a trailer, that has some sort of a delimiter (say a NULL character or CRLF) that indicates your PDU is finished. Together this is basically known as framing, by which you indicate the begin and end of your data units.

> I want to do that thing with sniffer how socket did.

Again, the socket *doesn't* know when the message is finished, and neither does Wireshark's TCP dissector.  If the protocol 
What you have to do in your application is to put something such as a protocol length at the beginning of each message, or some other mechanism to indicate when a message ends.  If you use a message length, you could, in the dissector for your protocol, use tcp_dissect_pdus() to detect the end of a message and to reassemble messages that require more than one TCP segment.

(I.e., not only do you have to write some code to defragment the packets in Wireshark, you have to write some code *in the application that receives the messages* to defragment the packets.)