Wireshark-dev: Re: [Wireshark-dev] How to print out string encoded data that contains nul chara
On Apr 9, 2014, at 2:06 PM, "John Dill" <John.Dill@xxxxxxxxxxxxxxxxx> wrote:
> I have several character data fields that happen to contain sections of non-ascii binary data including nul characters. I'd like to get a string display that shows all of the characters according to the length of the field, i.e.
>
> 20 20 20 20 20 20 01 00 01 00 48 31 20 20 20 20
>
> produces
>
> " \001\000\001\000H1 "
>
> In proto.c, I see that all of the format_text calls use strlen(bytes) as the length.
>
> case FT_STRING:
> case FT_STRINGZ:
> case FT_UINT_STRING:
> bytes = (guint8 *)fvalue_get(&fi->value);
> label_fill(label_str, hfinfo, format_text(bytes, strlen(bytes)));
>
> What is the recommended way of creating a text string that uses the octal encoding '\xxx' for non-ASCII data including nul characters that uses the 'length' field of 'proto_tree_add_item'?
The right short-term way would be to use proto_tree_add_string_format_value() to add the field, and format the string's value yourself, using format_text() with a byte count rather than strlen().
The right long-term way is to modify Wireshark so that this works. The way we handle strings should probably be changed so that we:
store the raw string octets as a counted array, along with the string encoding;
convert the octets from the encoding to UTF-8 *with invalid octets and sequences shown as escapes* when displaying the strings;
convert the octets from the encoding to UTF-8 with invalid octets and sequences shown as Unicode REPLACEMENT CHARACTERS when making the string available for processing by other software (e.g., "-T fields", etc.) (or somehow saying "this isn't a valid string in this encoding);
somehow arrange that strings with invalid octets or sequences are *always* unequal to any character string in packet-matching expressions (display/read filters, color "filters", etc.), and perhaps allow strings to be compared against octet sequences (e.g. "foobar.name = 20:20:20:20:20:20:01:00:01:00:48:31:20:20:20:20" matches the raw octets of the string), and use that with "Prepare As Filter" etc..
Alternatively, if they're *not* really character strings, display them as a set of subfields, with the text part shown as strings and the binary data shown as whatever it is, e.g.
Frobozz text 1: {blanks}
Frobozz count 1: 1
Frobozz count 2: 1
Frobozz text 2: H1{and more blanks}
or whatever it is.