Library: Localization
... codecvt_base codecvt_byname code_cvt ... ... locale::facet
A facet that performs conversions between named and unnamed encodings and character sets.
#include <locale> namespace std { template <class internT, class externT, class stateT> class codecvt_byname; }
template<> class codecvt_byname<wchar_t, char, mbstate_t>;
The codecvt_byname template includes the same functionality as the codecvt template, but specific to a particular named locale. For a description of the member functions of codecvt_byname, see the entry for codecvt.
namespace std { template <class internT, class externT, class stateT> class codecvt_byname : public codecvt<internT, externT, stateT> { public: typedef internT intern_type; typedef externT extern_type; typedef stateT state_type; explicit codecvt_byname(const char*, size_t refs = 0); protected: virtual ~codecvt_byname(); virtual codecvt_base::result do_out(state_type&, const intern_type*, const intern_type*, const intern_type*&, extern_type*, extern_type*, extern_type*&) const; virtual codecvt_base::result do_in(state_type&, const extern_type*, const extern_type*, const extern_type*&, intern_type*, intern_type*, intern_type*&) const; virtual codecvt_base::result do_unshift(state_type&, extern_type*, extern_type*, extern_type*&) const; virtual bool do_always_noconv() const throw(); virtual int do_max_length() const throw(); virtual int do_encoding() const throw(); }; }
explicit codecvt_byname(const char* name, size_t refs = 0);
Constructs a codecvt_byname object. Calls codecvt::codecvt(refs).
The refs argument is set to the initial value of the object's reference count. A codecvt_byname object f constructed with (refs == 0) that is installed in one or more locale objects will be destroyed and the storage it occupies will be deallocated when the last locale object containing the facet is destroyed, as if by calling delete static_cast<locale::facet*>(&f). A codecvt_byname object constructed with (refs != 0) will not be destroyed by any locale objects in which it may have been installed.
The primary template behaves identically to codecvt facet. Details particular to codecvt_byname<wchar_t, char, mbstate_t>:
The format of the name argument recognized by the codecvt_byname facet is a superset of the formats recognized by the locale constructors that accept a locale name as an argument. The extended format is as follows:
language[_territory][.codeset[@modifiers]] | codeset[@modifiers] | special_name
Where:
language is a 2 letter code specified by ISO 639-2.
territory is a 2 or 3 letter code specified by ISO 3166.
codeset is the name of the character set assigned by IANA .
modifiers is a set of optional codes.
special_name is a name denoting a locale in a readable format, for example, "german", "french", "dutch", etc.
A name in the format language_territory.codeset, for example, de_DE.ISO-8859-1, fr_FR.ISO-8859-1, etc., with the language, territory and codeset components being optional, designates a facet object that performs conversions between external encoding, as specified by the name of the codeset (for example, ISO-8859-15) and the internal representation of wchar_t. The internal representation of each external character matches the encoding of the character used by the mbtowc() C library function on that platform for the given locale, if such a locale exists, otherwise UCS-4 or UCS-2.
A name in the format language_territory.codeset@euro, for example, de_DE.ISO-8859-15@euro, with the territory and codeset components being optional (for example, de_DE@euro or just de@euro), designates a facet object that performs conversions between external encoding as specified by the name of the codeset (for example, ISO-8859-15) and the internal representation of wchar_t. The internal representation matches the encoding used by the mbstowcs() C library function on that platform for the given locale, if such a locale exists, otherwise UCS-4 or UCS-2. The common @euro modifier specifies that the locale database uses european Euro as currency and in general adheres to the standards of the European Community.
A name in the format language_territory.codeset@UCS, for example, js_JP.EUC-JP@UCS, with the language, territory, and codeset components being optional (for example, ja@UCS, ja_JP@UCS, or just EUC-JP@UCS), designates a facet object that performs conversions between external encoding as specified by the name of the codeset (e.g., EUC-JP) and the Universal Character Set (UCS) representation of wchar_t, which may be UCS-4 or UCS-2, depending on the size of wchar_t). The @UCS-4 modifier, recognized in environments where (sizeof (wchar_t) == 4) is true, instructs the facet to use UCS-4 as the internal encoding. The @UCS-2 modifier is recognized with analogous meaning for UCS-2. Portable code should use the @UCS modifier rather than making assumptions about the size of wchar_t.
A special name such as german, or french, that may be used as an alias for a canonical locale name. Such names may be provided as a convenience on some platforms and usually refer to the most common canonical locale provided by the platform. For example, german may be the equivalent of de_DE.ISO-8859-1, or french may be the equivalent of fr_FR.ISO-8859-15, and so on. Portable code should not rely on these names.
If the codeset component is provided, the language and territory components are optional and allowed only for compatibility with the format of locale names accepted by the class locale constructors, as the facet only makes use of the codeset and encoding information.
Table 15 lists the names of some of the supported common codesets. The names match those assigned by IANA (see http://www.iana.org/assignments/character-sets). For a complete list of supported codesets, see the contents of the nls/charmaps/ directory in the distribution of this implementation of the C++ Standard Library.
Name | Description |
ANSI_X3.4-1968 |
A 7-bit coded character set (ASCII). |
BIG5 |
A character set used to represent Chinese text in Taiwan. |
EBCDIC |
Extended Binary Coded Decimal Interchange Code. An 8-bit coded character set for information interchange between IBM computers. |
EUC-JP |
A multibyte encoding of the ANSI_X3.4-1968, JIS_X0201, JIS_X0208-1983, JIS_X0212-1990 character sets used to encode Japanese text. |
EUC-KR |
A multibyte encoding of ANSI_X3.4-1968, KSC5601-1987 used to encode Korean text. |
EUC-TW |
A multibyte encoding of CNS 11643-1992, planes 1 through 16, used to encode Taiwanese text. |
GB2312 |
A character set used to represent Chinese text in China, encoded with the EUC encoding, |
ISO-646 |
A 7-bit coded character set for information interchange (identical to ASCII). |
ISO-2022 |
Character code structure and extension techniques used for switching between code sets in 7-bit and 8-bit environments. |
ISO-2022-JP |
A stateful multibyte shift encoding encompassing the ANSI_X3.4-1968 and EUC-JP encodings. |
ISO-2022-JP-2 |
A stateful multibyte shift encoding encompassing the ANSI_X3.4-1968, EUC-JP, EUC-KR, GB2312, ISO-8859-1 and ISO-8859-7 encodings. |
ISO-8859-1 |
A fixed-width, single-byte coded character set, also known as Latin 1, used by most West European languages such as French (fr), Spanish (es), Catalan (ca), Basque (eu), Portuguese (pt), Italian (it), Albanian (sq), Rhaeto-Romanic (rm), Dutch (nl), German (de), Danish (da), Swedish (sv), Norwegian (no), Finnish (fi), Faroese (fo), Icelandic (is), Irish (ga), Scottish (gd), and English (en), as well as Afrikaans (af) and Swahili (sw). |
ISO-8859-2 |
A fixed-width, single-byte coded character set, also known as Latin 2, used by Central and Eastern European languages such as Czech (cs), Hungarian (hu), Polish (pl), Romanian (ro), Croatian (hr), Slovak (sk), Slovenian (sl), and Serbian (sr). |
ISO-8859-4 |
A fixed-width, single-byte coded character set, also known as Latin 4, used by Estonian (et), the Baltic languages Latvian (lv, Lettish) and Lithuanian (lt), Greenlandic (kl), and Lappish. |
ISO-8859-5 |
A fixed-width, single-byte coded character set, used to encode Cyrillic alphabets used by Bulgarian (bg), Byelorussian (be), Macedonian (mk), Russian (ru), Serbian (sr) and Ukrainian (uk). |
ISO-8859-6 |
A fixed-width, single-byte coded character set, used to encode Arabic alphabets. |
ISO-8859-7 |
A fixed-width, single-byte coded character set, used to encode Greek alphabets. |
ISO-8859-8 |
A fixed-width, single-byte coded character set, used to encode Hebrew alphabets used by used by Hebrew (iw) and Yiddish (ji). |
ISO-8859-15 |
An update to Latin 1 that includes the Euro currency symbol. |
Shift_JIS |
A stateless multibyte shift encoding used to encode Japanese character sets. |
UTF-8 |
A multibyte encoding used to encode the Universal Character Set (also referred to as UNICODE). |
NOTE -- The behavior of the facet member functions relies on the availability of locale database files produced by the localedef utility provided with this implementation from the character set description files shipped with this implementation of the C++ Standard Library (or their equivalents). In particular, the functionality for ISO-2022-JP is dependent on the ANSI_X3.4-1968 and EUC-JP encodings, while ISO-2022-JP-2 is dependent on ANSI_X3.4-1968, EUC-JP, EUC-KR, GB2312, ISO-8859-1 and ISO-8859-7 encodings. The appropriate databases are produced automatically whenever a locale that uses such an encoding is built. For example, the codecvt encoding database for EUC-JP is built whenever the Japanese locale ja_JP.EUC-JP is built. It is also possible to create the databases by providing a dummy (empty) locale definition file and process it with the character set description file corresponding to the desired codecvt database. The facet's member functions indicate an error by returning codecvt_base::error if the database required to perform a given conversion is not found.
virtual bool do_always_noconv() const throw();
Returns true if no conversion is required and false otherwise. The primary template codecvt_byname delegates to base class (codecvt). The codecvt_byname<wchar_t, char, mbstate_t> specialization returns false.
virtual codecvt_base::result do_in(state_type& state, const extern_type *from, const extern_type *from_end, const extern_type*& from_next, intern_type *to, intern_type *to_limit, intern_type*& to_next) const; virtual codecvt_base::result do_out(state_type& state, const intern_type *from, const intern_type *from_end, const intern_type*& from_next, extern_type *to, extern_type *to_limit, extern_type*& to_next) const;
For preconditions and return values, see codecvt.
codecvt_byname<wchar_t, char, mbstate_t> specialization:
The state object stores the state of the conversion in between successive calls to the functions for stateful conversions (for example, ISO-2022, ISO-2022-JP, etc.). The behavior of the functions is undefined if the state object has not been properly initialized. It is the responsibility of the user of the codecvt_byname object to initialize the state object (by setting all its bits to 0) whenever starting a new conversion. When called with a properly initialized state argument, the function may allocate resources such as file descriptors and/or storage. In order to release any resources allocated by the facet, the calling program must call do_unshift() or otherwise allow the facet to return the conversion state to its initial shift state (all bits set to 0) prior to disposing of the state object. The size of the mbstate_t type on a particular platform determines the maximum possible number of simultaneous conversions.
Conversions between different external codesets involving two separate facet objects, each of which uses UCS as the internal wchar_t representation are possible. Note that such conversions will fail with an error if the functions encounter a UCS character which cannot be encoded in the destination encoding.
After an error, the condition of the state object is unspecified; the object should be reset to its initial shift state (by setting all its bits to 0) before reuse.
virtual result do_unshift(state_type& state, extern_type *to, extern_type *to_limit, extern_type* &to_next) const;
In order to release any resources allocated by the facet, the calling program must call do_unshift() or otherwise return the conversion state to its initial shift state (all its bits set to 0) prior to disposing of the state object.
codecvt_byname<wchar_t,char,mbstate_t> specialization:
See do_in() and do_out() for state parameter details.
For preconditions and return values, see codecvt.
Suppose a program needs to convert a text file named input.euc-jp, encoded in the EUC-JP encoding, to another text file, named output.utf-8, encoded in UTF-8. The following steps will enable the program to do so with the codecvt_byname<wchar_t, char, mbstate_t> specialization of the facet.
Using the localedef utility, create a binary codecvt database file from the EUC-JP character set description definition file and an empty locale definition:
$ echo | localedef -c -f nls/charmaps/EUC-JP dummy-euc
Optionally, using the localedef utility, create a binary codecvt database file from the UTF-8 character set description file and an empty locale definition:
$ echo | localedef -c -f nls/charmaps/UTF-8 dummy-utf
This step is optional since UCS and UTF conversions can be performed entirely algorithmically, albeit without validating the source characters.
The commands above will create the conversion databases named EUC-JP and UTF-8 and two empty directories named dummy-euc and dummy-utf in the current working directory. The empty directories can be removed.
Create two codecvt_byname<wchar_t, char, mbstate_t> objects, one for each encoding, and install each in an arbitrary valid locale object:
typedef std::codecvt_byname<wchar_t, char, std::mbstate_t> Cvt; const std::locale euc (std::locale ("C"), new Cvt ("EUC-JP@UCS")); const std::locale utf (std::locale ("C"), new Cvt ("UTF"));
Imbue each locale object in a wide character file stream or file buffer object:std::wcin.imbue (euc);
std::wcout.imbue (utf);
It is not important whether the locales are imbued in the stream objects as shown above or the stream buffer objects associated with the streams as shown below, as long as the stream buffers are of the std::basic_filebuf<wchar_t> type, or derivatives thereof:
std::wcin.rdbuf ()->imbue (euc); std::wcout.rdbuf ()->imbue (utf);
Copy the input stream into the output stream:
std::wcout << std::wcin.rdbuf ();
Compile the program into the executable, say euc2utf, and run it, redirecting its stdin from the source text file input.euc-jp, and its stdout to the destination text file output.utf-8:
RWSTD_LOCALE_ROOT=. cat input.euc-jp | ./euc2utf >
output.utf-8
It is important to set the ${RWSTD_LOCALE_ROOT} environment variable in order for the facet to find the locale databases created above.
A complete example program with the functionality described above might look like this:
#include <iostream> #include <locale> int main () { typedef std::codecvt_byname<wchar_t, char, std::mbstate_t> Cvt; std::wcin.imbue (std::locale (std::locale ("C"), new Cvt ("EUC-JP@UCS"))); std::wcout.imbue (std::locale (std::locale ("C"), new Cvt ("UTF-8"))); std::wcout << std::wcin.rdbuf (); return !(std::wcin.good () && std::wcout.good ()); }
locale, Facets, codecvt, localedef utility
ISO/IEC 14882:1998 -- International Standard for Information Systems -- Programming Language C++, Section 22.2.1.6