Wireshark-dev: Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)
From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Tue, 28 Jun 2011 10:01:14 -0700
On Jun 28, 2011, at 3:22 AM, Jakub Zawadzki wrote:

> Btw. I know that nowadays I'm the only one who uses non-utf locales on console,
> but when we print on console (stdout/stderr) I think we should use strerror() from libc,
> i.e. strerror() which don't recode message to utf-8.

It's more complicated than that.

There are many source of strings in the non-GUI output of the programs in the Wireshark suite:

	the message text itself - that's generally ASCII;

	file names - internally to those programs, those are in UTF-8;

	error strings for errno values and signal-name strings from signals - those might be in the current locale for strerror()/strsignal() and would be in UTF-8 with g_strerror()/g_strsignal();

	etc.

In addition, the non-GUI output of the program can be sent either to the terminal or to files.

Output to the terminal should be in whatever character set the terminal expects.  I'm not sure what would indicate the character set the terminal expects.  On my machine, the "terminal" is Terminal.app, and can handle UTF-8 output; on other UN*Xes, in the GUI, it's probably similar.  For consoles (which I'm using here to mean "no GUI, just the console of a workstation/personal computer") it might be less capable.  For real terminals, it's almost certainly less-capable; I'm not sure whether there's ever be a real serial-port terminal that handles UTF-8.  I don't know what the various terminal emulators for Windows, e.g. cmd.exe, do.

Output to files, whether it's the result of redirecting the standard output or error of a command-line program to a file, or of one of the "export to a text file" operations in Wireshark, or..., is another matter.  It might be that the character encoding should be the same as would be used on a terminal.

In any case, that means that using strerror() is probably not going to be sufficient to fix the problem.  What we might want to do is use UTF-8 everywhere we can, and, for non-GUI output, convert to the appropriate character encoding - whatever that might be - at the last minute.