Previous fileTop of DocumentContentsIndex pageNext file
Apache C++ Standard Library Reference Guide

codecvt_byname

Library:  Localization


... codecvt_base codecvt_byname code_cvt ... ... locale::facet

Local Index

Members

Summary

A facet that performs conversions between named and unnamed encodings and character sets.

Synopsis

#include <locale>

namespace std {
  template <class internT, class externT, class stateT> 
  class codecvt_byname;
}

Specializations

Description

The codecvt_byname template includes the same functionality as the codecvt template, but specific to a particular named locale. For a description of the member functions of codecvt_byname, see the entry for codecvt.

Interface

Constructors

explicit codecvt_byname(const char* name, size_t refs = 0); 

Constructs a codecvt_byname object. Calls codecvt::codecvt(refs).

The refs argument is set to the initial value of the object's reference count. A codecvt_byname object f constructed with (refs == 0) that is installed in one or more locale objects will be destroyed and the storage it occupies will be deallocated when the last locale object containing the facet is destroyed, as if by calling delete static_cast<locale::facet*>(&f). A codecvt_byname object constructed with (refs != 0) will not be destroyed by any locale objects in which it may have been installed.

The primary template behaves identically to codecvt facet. Details particular to codecvt_byname<wchar_t, char, mbstate_t>:

The format of the name argument recognized by the codecvt_byname facet is a superset of the formats recognized by the locale constructors that accept a locale name as an argument. The extended format is as follows:

language[_territory][.codeset[@modifiers]] | codeset[@modifiers] | special_name

Where:

If the codeset component is provided, the language and territory components are optional and allowed only for compatibility with the format of locale names accepted by the class locale constructors, as the facet only makes use of the codeset and encoding information.

Table 15 lists the names of some of the supported common codesets. The names match those assigned by IANA (see http://www.iana.org/assignments/character-sets). For a complete list of supported codesets, see the contents of the nls/charmaps/ directory in the distribution of this implementation of the C++ Standard Library.

Table 15: Supported common codesets

Name Description

ANSI_X3.4-1968

A 7-bit coded character set (ASCII).

BIG5

A character set used to represent Chinese text in Taiwan.

EBCDIC

Extended Binary Coded Decimal Interchange Code. An 8-bit coded character set for information interchange between IBM computers.

EUC-JP

A multibyte encoding of the ANSI_X3.4-1968, JIS_X0201, JIS_X0208-1983, JIS_X0212-1990 character sets used to encode Japanese text.

EUC-KR

A multibyte encoding of ANSI_X3.4-1968, KSC5601-1987 used to encode Korean text.

EUC-TW

A multibyte encoding of CNS 11643-1992, planes 1 through 16, used to encode Taiwanese text.

GB2312

A character set used to represent Chinese text in China, encoded with the EUC encoding,

ISO-646

A 7-bit coded character set for information interchange (identical to ASCII).

ISO-2022

Character code structure and extension techniques used for switching between code sets in 7-bit and 8-bit environments.

ISO-2022-JP

A stateful multibyte shift encoding encompassing the ANSI_X3.4-1968 and EUC-JP encodings.

ISO-2022-JP-2

A stateful multibyte shift encoding encompassing the ANSI_X3.4-1968, EUC-JP, EUC-KR, GB2312, ISO-8859-1 and ISO-8859-7 encodings.

ISO-8859-1

A fixed-width, single-byte coded character set, also known as Latin 1, used by most West European languages such as French (fr), Spanish (es), Catalan (ca), Basque (eu), Portuguese (pt), Italian (it), Albanian (sq), Rhaeto-Romanic (rm), Dutch (nl), German (de), Danish (da), Swedish (sv), Norwegian (no), Finnish (fi), Faroese (fo), Icelandic (is), Irish (ga), Scottish (gd), and English (en), as well as Afrikaans (af) and Swahili (sw).

ISO-8859-2

A fixed-width, single-byte coded character set, also known as Latin 2, used by Central and Eastern European languages such as Czech (cs), Hungarian (hu), Polish (pl), Romanian (ro), Croatian (hr), Slovak (sk), Slovenian (sl), and Serbian (sr).

ISO-8859-4

A fixed-width, single-byte coded character set, also known as Latin 4, used by Estonian (et), the Baltic languages Latvian (lv, Lettish) and Lithuanian (lt), Greenlandic (kl), and Lappish.

ISO-8859-5

A fixed-width, single-byte coded character set, used to encode Cyrillic alphabets used by Bulgarian (bg), Byelorussian (be), Macedonian (mk), Russian (ru), Serbian (sr) and Ukrainian (uk).

ISO-8859-6

A fixed-width, single-byte coded character set, used to encode Arabic alphabets.

ISO-8859-7

A fixed-width, single-byte coded character set, used to encode Greek alphabets.

ISO-8859-8

A fixed-width, single-byte coded character set, used to encode Hebrew alphabets used by used by Hebrew (iw) and Yiddish (ji).

ISO-8859-15

An update to Latin 1 that includes the Euro currency symbol.

Shift_JIS

A stateless multibyte shift encoding used to encode Japanese character sets.

UTF-8

A multibyte encoding used to encode the Universal Character Set (also referred to as UNICODE).


NOTE -- The behavior of the facet member functions relies on the availability of locale database files produced by the localedef utility provided with this implementation from the character set description files shipped with this implementation of the C++ Standard Library (or their equivalents). In particular, the functionality for ISO-2022-JP is dependent on the ANSI_X3.4-1968 and EUC-JP encodings, while ISO-2022-JP-2 is dependent on ANSI_X3.4-1968, EUC-JP, EUC-KR, GB2312, ISO-8859-1 and ISO-8859-7 encodings. The appropriate databases are produced automatically whenever a locale that uses such an encoding is built. For example, the codecvt encoding database for EUC-JP is built whenever the Japanese locale ja_JP.EUC-JP is built. It is also possible to create the databases by providing a dummy (empty) locale definition file and process it with the character set description file corresponding to the desired codecvt database. The facet's member functions indicate an error by returning codecvt_base::error if the database required to perform a given conversion is not found.

Protected Members

virtual bool
do_always_noconv() const throw();
virtual codecvt_base::result
do_in(state_type& state,
      const extern_type *from,
      const extern_type *from_end,
      const extern_type*& from_next,
      intern_type *to, intern_type *to_limit,
      intern_type*& to_next) const;

virtual codecvt_base::result
do_out(state_type& state,
       const intern_type *from,
       const intern_type *from_end,
       const intern_type*& from_next,
       extern_type *to, extern_type *to_limit,
       extern_type*& to_next) const;
virtual result
do_unshift(state_type& state, extern_type *to, extern_type *to_limit, extern_type* &to_next) const;

Example

Suppose a program needs to convert a text file named input.euc-jp, encoded in the EUC-JP encoding, to another text file, named output.utf-8, encoded in UTF-8. The following steps will enable the program to do so with the codecvt_byname<wchar_t, char, mbstate_t> specialization of the facet.

  1. Using the localedef utility, create a binary codecvt database file from the EUC-JP character set description definition file and an empty locale definition:

  2. Optionally, using the localedef utility, create a binary codecvt database file from the UTF-8 character set description file and an empty locale definition:
    $ echo | localedef -c -f nls/charmaps/UTF-8 dummy-utf
    This step is optional since UCS and UTF conversions can be performed entirely algorithmically, albeit without validating the source characters.

  3. The commands above will create the conversion databases named EUC-JP and UTF-8 and two empty directories named dummy-euc and dummy-utf in the current working directory. The empty directories can be removed.

  4. Create two codecvt_byname<wchar_t, char, mbstate_t> objects, one for each encoding, and install each in an arbitrary valid locale object:

  5. Imbue each locale object in a wide character file stream or file buffer object:std::wcin.imbue (euc);
    std::wcout.imbue (utf);
    It is not important whether the locales are imbued in the stream objects as shown above or the stream buffer objects associated with the streams as shown below, as long as the stream buffers are of the std::basic_filebuf<wchar_t> type, or derivatives thereof:

  6. Copy the input stream into the output stream:

  7. Compile the program into the executable, say euc2utf, and run it, redirecting its stdin from the source text file input.euc-jp, and its stdout to the destination text file output.utf-8:
    RWSTD_LOCALE_ROOT=. cat input.euc-jp | ./euc2utf >
    output.utf-8
    It is important to set the ${RWSTD_LOCALE_ROOT} environment variable in order for the facet to find the locale databases created above.

A complete example program with the functionality described above might look like this:

See Also

locale, Facets, codecvt, localedef utility

Standards Conformance

ISO/IEC 14882:1998 -- International Standard for Information Systems -- Programming Language C++, Section 22.2.1.6



Previous fileTop of DocumentContentsIndex pageNext file