Introduction
In the beginning, there was
char. It could hold an ASCII character and then some, and that was good enough for everybody. But later, the need for internationalization (i18n) and localization (l10n) came up, and
char wasn't enough anymore to store all fancy characters. Thus, multi-byte character encodings were conceived, where two or more
chars represented a single character. Additionally, a vast set of incompatible character sets and encodings had been established, most of them incompatible to each other. Thus, a solution had to be found to unify this mess, and the solution was
wchar_t, a data type big enough to hold any character (or so it was thought).
Multi-byte and wide-character strings
To connect the harsh reality of weird multi-byte character encodings and the ideal world of abstract representations of characters, a number of interfaces to convert between these two was developed. The most simple ones are
mbtowc() (to convert a multi-byte character to a
wchar_t) and
wctomb() (to convert a
wchar_t to a multi-byte character). The multi-byte character encoding is assumed to be the current locale's one.
But even those two functions bear a huge problem: they are not thread-safe. The
Single Unix Specification version 3 mentions this for
wctomb, but not for
mbtowc, while glibc documentation mentions this for both. The solution? Use the equivalent thread-safe functions
mbrtowc and
wcrtomb. Both of these functions keep their state in a
mbstate_t variable provided by the caller. In practice, most functions related to the conversion of multi-byte strings to wide-character strings and vice versa are available in two versions: one that is simpler (one function argument less), but not thread-safe or reentrant, and one that requires a bit more work for a programmer (i.e. declare
mbstate_t variable, initialize it and use the functions that use this variable) but is thread-safe.
Coping with different character sets
To convert different character sets/encoding between each other, Unix provides another API, named
iconv(). It provides the user with the ability to convert text from any character set/encoding to any other character set/encoding. But this approach has a terrible disadvantage: in order to convert text of any encoding to multi-byte strings, the only
standard way that Unix provides is to use
iconv() to convert the text to the current locale's character set and then convert this to a wide-character string.
Assume we have a string encoded in Shift_JIS, a common character encoding for the Japanese language, and ISO-8859-1 (Latin 1) as the current locale's character set: we'd first need to convert the Shift_JIS text to ISO-8859-1, a step that is most likely lossy (unless only the ASCII-compatible part of Shift_JIS is used), and only then we can use to
mb*towc* functions to convert it to a wide-character string. So, as we can see, there is no standard solution for this problem.
How is this solved in practice? In glibc (and GNU libiconv), the
iconv() implementation allows the use of a pseudo character encoding named "WCHAR_T" that represents wide-character strings of the local platform. But this solution is messy, as the programmer who uses
iconv() has to manually cast
char * to
wchar_t * and vice versa. The problem with this solution is that support for the WCHAR_T encoding is not guaranteed by any standard, and is totally implementation-specific. For example, while it is available on Linux/glibc, FreeBSD and Mac OS X, it is not available on NetBSD, and thus not an option for truly portable programming.
Mac OS X (besides providing
iconv()) follows a different approach: in addition to the functions that by default always use the current locale's character encoding, a set of functions to work with any other locale is provided, all to be found under the umbrella of the
xlocale.h header. The problem with this solution is that it's not portable, either, and practically only available on Mac OS X.
Alternatives
Plan 9 was the first operating system that adapted UTF-8 as its only character encoding. In fact, UTF-8 was conceived by two principal Plan 9 developers, Rob Pike and Ken Thompson, in an effort to make Plan 9 Unicode-compatible while retaining full compatibility to ASCII. Plan 9's API to cope with UTF-8 and Unicode is now available as
libutf. While it doesn't solve the character encoding conversion issue described above, and assumes everything to be UTF-8, it provides a clean and minimalistic interface to handle Unicode text. Unfortunately, due to the decision to represent a
Rune (i.e. a Unicode character) as unsigned short (16 bit on most platforms), libutf is restricted to handling Unicode characters of the Unicode Basic Multilingual Plane (BMP) only.
Another alternative is
ICU by IBM, a seemingly grand unified solution for all topics related to i18n, l10n and m17n. AFAIK, it solves all of the issues mentioned above, but on the other side, is perceived as a big, bloated mess. And while its developers aim for great portability, it is non-standard and has to be integrated manually on the respective target platform(s).
Tracked: Jul 22, 03:11