Wireshark-bugs: [Wireshark-bugs] [Bug 8382] MS-MMS dissector crash
Date: Sun, 03 Mar 2013 02:22:36 +0000

Comment # 30 on bug 8382 from
(In reply to comment #21)
> - tvb_get_string_* functions should preserve embedded nulls, as they should
> be just dumb getters from the tvb

If so, then the string might not be usable by the dissector in, for example,
comparisons, as the string won't be guaranteed to be in a C-string-style
representation (string of octets, all ASCII characters represented as single
octets containing the ASCII value of the character).

If the dissector is just adding a string value, without needing to look at the
value, it should use proto_tree_add_item().

If it needs to look at the value, it should use a tvb_get_string_* routine that
takes an encoding argument, and either:

    1) the routine should return a C string (null-terminated, meaning we can't
handle embedded nulls)

or

    2) the routine should return something such as a counted string, so that we
can handle embedded nulls, and we should provide routines that handle that type
of string

and the string's encoding should be UTF-8.

> - tvb_get_string_* functions should provide the length of the returned
> string, since with multi-byte characters and embedded nulls, the caller has
> no other way of determining this value

"Length" in octets or "length" in characters?

That sounds like case 2), where we use counted strings.

> - the ms-mms dissector should be checking for overflow when doubling a value
> it gets from the packet
> 
> - the ms-mms dissector should be using proto_tree_add_unicode_string for
> strings it gets via tvb_get_ephemeral_unicode_string

> - it should be possible to add multibyte unicode strings (UCS-2, UTF-16,
> UCS-4, etc) to the tree with proto_tree_add_item and the appropriate
> encoding arguments

Yes.

This raises the question of how the value of a string should be stored in the
protocol tree.

Two options I see are:

    1) store it as some Unicode encoding (UTF-8, UTF-16, or UCS-4);

    2) store it as a raw collection of octets, along with the encoding used,
and convert it to some Unicode encoding when needed.

"When needed" would include:

    when presenting it in human-readable form;

    when presenting it in PDML or some other form intended for machine reading;

    when providing it to code for use in, for example, a tap;

    when comparing with it in a display filter.

If there's anything in the string that *can't* be converted (e.g., an invalid
UTF-8 sequence, or a character in some non-Unicode set that has no Unicode
equivalent, if any such characters exist), then:

    when presenting it in human-readable form, we'd have to convert it to
something to show to the user (Unicode REPLACEMENT CHARACTER?);

    when presenting in PDML, we'd have to do something;

    when providing it for code to use, we'd have to do something (REPLACEMENT
CHARACTER plus a "sorry, I couldn't convert this completely" indication?);

    when comparing it in a display filter, cause all comparisons other than
"not equal" to fail (unless we have a way to specify comparison against
arbitrary octet strings).

> - calling format_str is necessary when passing a packet-derived string into
> proto_tree_add_string, col_append_fstr and friends. These functions could
> potentially call format_str themselves on every string they get, but the
> performance hit would be big. Faster to continue calling it only where
> needed.

When presenting strings in human-readable form, we need to figure out what to
do about non-printable characters.  The format_str mechanism of showing them as
C string escapes might work, although would we then have to escape all
backslashes in the string?  That'd look Really Weird in the SMB dissector, for
example, with pathnames filled up with \\.  Perhaps string fields should have a
display option indicating whether to use C string escapes (complete with
escaping of backslashes) or to just show non-printable characters either as
REPLACEMENT CHARACTER or, at least for C0 control characters, as the SYMBOL FOR
xxx characters (0x2400+{value} for 0x00 through 0x1F and 0x2421 for 0x7F).

It might also be useful to have some way in the UI to look at the raw bytes
corresponding to individual glyphs, in case, for example, you're trying to
debug a problem with a client that's sending bad UTF-8 sequences over the wire,
or a problem with decomposed vs. composed accented characters.

My personal inclination is to store string values as encoding+raw bytes, doing
the conversion to UTF-8 lazily, and:

Having

    protocol.string {comparison} "hello"

being done either by converting protocol.string to UTF-8 and comparing the
UTF-8 strings or converting "hello" to the encoding, caching the result, and
failing if the conversion fails (>, >=, <, and <= might be done by converting
to UTF-8; they're probably less common, and they raise questions about what
order to impose on strings in any case).  The latter would probably involve
fewer conversions.

Having

    protocol.string {==,!=} xx:xx:xx:xx:xx:xx:xx:xx...

be done by comparing the sequence of octets with the raw bytes.

Having the "apply as filter" and "prepare as filter" stuff, when you try to
construct a filter from a string field, use the sequence-of-octets form if the
value can't be converted to Unicode and the string form otherwise (although
that would get tricky if the Unicode string contains non-printable characters).


You are receiving this mail because:
  • You are watching all bug changes.