Wireshark-users: [Wireshark-users] Strings containing characters that don't map to printable ASCI
From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Sun, 20 May 2012 14:03:55 -0700
On May 20, 2012, at 11:32 AM, darkjames@xxxxxxxxxxxxx wrote:

> http://anonsvn.wireshark.org/viewvc/viewvc.cgi?view=rev&revision=42727
> 
> User: darkjames
> Date: 2012/05/20 11:32 AM
> 
> Log:
> Revert r35131 fix bug #5738
> 
> g_unichar_isprint() is for *wide characters*.
> For UTF-8 multibyte characters we could 
> use g_utf8_validate() and g_utf8_next_char(),
> but IMHO format_text_* should be ASCII-only.

I'm not sure it should always be ASCII-only.  Somebody might want, for example, to see file names as they would appear in the UI.

However, in other circumstances, somebody might want to see the raw octets of non-ASCII characters if, for example, they're dealing with encoding issues (e.g.., SMB servers sending Normalization Form D Unicode strings over the wire to Windows clients that expect Normalization Form C strings - this is not, BTW, a hypothetical case... - or strings sent over the wire that aren't valid {UTF-8,UTF-16,UCS-2,...}).

So perhaps at least two ways of displaying strings are needed, perhaps settable via a preference.  The first might, for example, display invalid sequences and characters that don't exist in Unicode as the Unicode REPLACEMENT CHARACTER:

	http://unicode.org/charts/nameslist/n_FFF0.html

and display non-printable characters as either REPLACEMENT CHARACTER or, for C0 control characters, the corresponding Unicode SYMBOL FOR XXX character:

	http://unicode.org/charts/nameslist/n_2400.html

The second might, for example, display octets that don't correspond to printable ASCII characters as C-style backslash escapes, e.g. CR as \r, LF as \n, etc., and octets that don't have specific C-style backslash escapes as either octal or hex escapes (we're currently using octal, but I suspect most of us don't deal with PDP-11's on a daily basic, so perhaps hex would be better).

All of the GUI toolkits we're likely to care about use Unicode, in some encoding, for strings, so we don't need to worry about translating from Unicode to ISO 8859/x or some flavor of EUC or... in the GUI - we can just hand the GUI Unicode strings.

For writing to files and to the "terminal", we might have to determine the user's code page/character encoding and map to that.  I think most UN*Xes support UTF-8 as the character encoding in the LANG environment variable these days, and sufficiently recent versions of Windows have code page 65001, a/k/a UTF-8:

	http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

(I don't know whether that dates back to W2K, which I think is the oldest version of Windows supported by current versions of Wireshark), so it is, at least in theory, possible for a user to configure their system so that non-UCS-2/non-UTF-16 text files and their terminal emulator/console program can handle Unicode.  In practice, some users might have reasons why they can't or wouldn't want to do that, however.

> We rather need to store encoding of FT_STRING[Z]
> and in proto_item_fill_label() call appropiate
> function.
> For ENC_ASCII use format_text(),
> for unicode (ENC_UTF*, ENC_UCS*) use format_text_utf(),
> etc..

This also raises some other questions.

For example, presumably if the user enters a Unicode string in a display filter comparison expression, they'd want it to match the field in question if it has that value, regardless of whether it's encoded as UTF-8 or UTF-16 or ISO 8859/1 or {fill in your flavor of EBCDIC} or....  (They might even want it to match regardless of whether characters are composed or not:

	http://unicode.org/reports/tr15/

I would argue that it should, given that the OS with the largest market share on the desktop prefers composed characters, the UN*X with the largest market share on the desktop prefers decomposed characters, and all the other UN*Xes prefer composed characters, but that's another matter.)  Thus, the comparison that should be done should be a comparison between the string the user specified and the value of the field as converted to Unicode (UTF-8, as that's the way we're internally encoding Unicode).  If the field's raw value *can't* be converted to Unicode, the comparison would fail.

However, if the user constructs a filter from a field and its value with Apply As Filter -> Selected, and the field is a string field and has a value that *can't* be represented in Unicode, the filter should probably do a match on the raw value of the field, not on the value of the field as converted to Unicode.

The latter could perhaps be represented as

	example.name == 48:65:6c:6c:6f:20:ff:ff:ff:ff:ff:ff:ff:ff

or something such as that.

This might mean we'd store, for a string field, the raw value and specified encoding.  When doing a comparison against a value specified as an octet string, we'd compare the raw values; when doing a comparison against a value specified as a Unicode string, we'd attempt to convert the raw value to UTF-8 and:

	if that fails, have the comparison fail;

	if that succeeds, compare against the converted value.

In addition, when getting the value of a field for some other code to process, what should be done if the field can't be mapped to Unicode?

And what about non-printable characters?  We could use %-encoding for the XML formats (PDML, PSML), but for TShark's "-e" option, or for "export as CSV", or other non-XML formats, what should be done?