Wireshark-dev: Re: [Wireshark-dev] [Wireshark-commits] rev 53819: /trunk/epan/ /trunk/epan/diss
On Dec 7, 2013, at 2:10 AM, darkjames@xxxxxxxxxxxxx wrote:
> http://anonsvn.wireshark.org/viewvc/viewvc.cgi?view=rev&revision=53819
>
> User: darkjames
> Date: 2013/12/07 10:10 AM
>
> Log:
> Add new string proto encoding for windows-1250 (ENC_WINDOWS_1250)
>
> - Move windows-1250 to unicode encoding table to charset.c
> - Add tvb_get_string_unichar2, tvb_get_stringz_unichar2 functions which recode tvb-string to UTF-8.
Note that
https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#gunichar2
says of a gunichar2 that it is
A type which can hold any UTF-16 code point[4].
with the footnote:
https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#ftn.utf16_surrogate_pairs
saying
[4] surrogate pairs
This means that a gunichar2 can hold either
1) a character from the Basic Multilingual Plane (BMP) of Unicode:
https://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane
or
2) a surrogate pair:
https://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF
so those routines can handle only encodings that don't include characters outside the BMP.
This is probably true of most non-Unicode encodings, such as the ISO 8859-n encodings, so it's OK for them, but be careful when using them.