On Wednesday, August 27, 2003, at 11:55 AM, Gilbert Ramirez wrote:
I don't know Unicode very well, so I don't know all the different types
of Unicode encodings, so I won't even guess as to what the names for
those "functions" would be, but they would follow the above example.
(For now, we don't support non-ASCII characters very well in Ethereal,
so I'll assume only ASCII in search strings for now.)
The encodings we'll probably have to deal with are:
1) little-endian UCS-2 - 2-byte characters, with the lower 8 bits
first and the upper 8 bits after that (used in SMB and various DCE RPC
protocols from Microsoft)
2) big-endian UCS-2 - (I don't know whether there are any protocols
that do that - perhaps some DCE RPC-based protocols if the sender is
big-endian);
3) UTF-8 - ASCII characters map to 1 byte containing the character,
other characters map to multiple bytes (note that UTF-8 can encode
4-byte characters, so it gets ISO 10646 in its entirety, not just the
Basic Multilingual Plane subset that's handled by UCS-2).
Unicode has a "byte order mark", which is a character that's a "zero
width no-break space" (i.e., a space character that takes no space :-))
- the byte-swapped version of it is not a legal Unicode character (and
never will be, as far as I know), so a Unicode string can start with a
byte order mark, and something scanning it can infer the byte order
from that byte order mark. Not all Unicode strings necessarily begin
with a byte order mark, however; Microsoft don't use it in SMB or their
RPCs, for example. (The byte order is implicitly little-endian for
SMB; it's presumably the byte order from the DCE RPC header in the
RPCs, although, in practice, little-endian might even be used on
big-endian machines, at least for the Microsoft RPCs.)