Ethereal-dev: Re: [Ethereal-dev] Syntax for frame contains

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Wed, 27 Aug 2003 13:48:46 -0700

On Wednesday, August 27, 2003, at 11:55 AM, Gilbert Ramirez wrote:

I don't know Unicode very well, so I don't know all the different types
of Unicode encodings, so I won't even guess as to what the names for
those "functions" would be, but they would follow the above example.

(For now, we don't support non-ASCII characters very well in Ethereal, so I'll assume only ASCII in search strings for now.)

The encodings we'll probably have to deal with are:

1) little-endian UCS-2 - 2-byte characters, with the lower 8 bits first and the upper 8 bits after that (used in SMB and various DCE RPC protocols from Microsoft)

2) big-endian UCS-2 - (I don't know whether there are any protocols that do that - perhaps some DCE RPC-based protocols if the sender is big-endian);

3) UTF-8 - ASCII characters map to 1 byte containing the character, other characters map to multiple bytes (note that UTF-8 can encode 4-byte characters, so it gets ISO 10646 in its entirety, not just the Basic Multilingual Plane subset that's handled by UCS-2).

Unicode has a "byte order mark", which is a character that's a "zero width no-break space" (i.e., a space character that takes no space :-)) - the byte-swapped version of it is not a legal Unicode character (and never will be, as far as I know), so a Unicode string can start with a byte order mark, and something scanning it can infer the byte order from that byte order mark. Not all Unicode strings necessarily begin with a byte order mark, however; Microsoft don't use it in SMB or their RPCs, for example. (The byte order is implicitly little-endian for SMB; it's presumably the byte order from the DCE RPC header in the RPCs, although, in practice, little-endian might even be used on big-endian machines, at least for the Microsoft RPCs.)