DragonFly On-Line Manual Pages

MBINTOWCR(3)          DragonFly Library Functions Manual          MBINTOWCR(3)

NAME
     mbintowcr, mbintowcr_l, utf8towcr, wcrtombin, wcrtombin_l, wcrtoutf8 -
     8-bit-clean wchar conversion w/escaping or validation

LIBRARY
     Standard C Library (libc, -lc)

SYNOPSIS
     #include <wchar.h>

     size_t
     mbintowcr(wchar_t * restrict dst, const char * restrict src, size_t dlen,
         size_t *slen, int flags);

     size_t
     utf8towcr(wchar_t * restrict dst, const char * restrict src, size_t dlen,
         size_t *slen, int flags);

     size_t
     wcrtombin(char * restrict dst, const wchar_t * restrict src, size_t dlen,
         size_t *slen, int flags);

     size_t
     wcrtoutf8(char * restrict dst, const wchar_t * restrict src, size_t dlen,
         size_t *slen, int flags);

     #include <xlocale.h>

     size_t
     mbintowcr_l(wchar_t * restrict dst, const char * restrict src,
         size_t dlen, size_t *slen, locale_t locale, int flags);

     size_t
     wcrtombin_l(char * restrict dst, const wchar_t * restrict src,
         size_t dlen, size_t *slen, locale_t locale, int flags);

DESCRIPTION
     The mbintowcr() and wcrtombin() functions translate byte data into wide-
     char format and back again.  Under normal conditions (but not with all
     flags) these functions guarantee that the round-trip will be 8-bit-clean.
     Some care must be taken to properly specify the WCSBIN_EOF flag to
     properly handle trailing incomplete sequences at stream EOF.

     For the "C" locale these functions are 1:1 (do not convert UTF-8).  For
     UTF-8 locales these functions convert to/from UTF-8.  Most of the
     discussion below pertains to UTF-8 translations.

     The utf8towcr() and wcrtoutf8() functions do exactly the same thing as
     the above functions but are locked to the UTF-8 locale.  That is, these
     functions work regardless of which localehas been selected and also do
     not require any initial setlocale() call to initialize.  Applications
     working explicitly in UTF-8 should use these versions.

     Any illegal sequences will be escaped using UTF-8B (U+DC80 - U+DCFF).
     Illegal sequences include surrogate-space encodings, non-canonical
     encodings, codings >= 0x10FFFF, 5-byte and 6-byte codings (which are not
     legal anymore), and malformed codings.  Flags may be used to modify this
     behavior.

     The mbintowcr() function takes generic 8-bit byte data as its input which
     the caller expects to be loosely coded in UTF-8 and converts it to an
     array of wchar_t, and returns the number of wchar_t that were converted.
     The caller must set *slen to the number of bytes in the input buffer and
     the function will set *slen on return to the number of bytes in the input
     buffer that were processed.

     Fewer bytes than specified might be processed due to the output buffer
     reaching its limit or due to an incomplete sequence at the end of the
     input buffer when the WCSBIN_EOF flag has not been specified.

     If processing a stream, the caller typically copies any unprocessed data
     at the end of the buffer back to the beginning and then continues loading
     the buffer from there.  Be sure to check for an incomplete translation at
     stream EOF and do a final translation of the remainder with the
     WCSBIN_EOF flag set.

     This function will always generate escapes for illegal UTF-8 code
     sequences and by can produce a clean BYTE-WCHAR-BYTE conversion.  See the
     flags description later on.

     This function cannot return an error unless the WCSBIN_STRICT flag is
     set.  In case of error, any valid conversions are returned first and the
     caller is expected to iterate.  The error is returned when it becomes the
     first element of the buffer.

     A NULL destination buffer may be specified in which case this function
     operates identically except for actually trying to fill the buffer.  This
     feature is typically used for validation with WCSBIN_STRICT and sometimes
     also used in combination with WCSBIN_SURRO (set if you want to allow
     surrogates).

     The wcrtombin() function takes an array of wchar_t as its input which is
     usually expected to be well-formed and converts it to an array of generic
     8-bit byte data.  The caller must set *slen to the number of elements in
     the input buffer and the function will set *slen on return to the number
     of elements in the input buffer that were processed.

     Be sure to properly set the WCSBIN_EOF flag for the last buffer at stream
     EOF.

     This function can return an error regardless of the flags if a supplied
     wchar code is out of range.  Some flags change the range of allowed wchar
     codes.  In case of error, any valid conversions are returned first and
     the caller is expected to iterate.  The error is returned when it becomes
     the first element of the buffer.

     A NULL destination buffer may be specified in which case this function
     operates identically except for actually trying to fill the buffer.  This
     feature is typically used for validation with or without WCSBIN_STRICT
     and sometimes also used in combination with WCSBIN_SURRO.

     One final note on the use of WCSBIN_SURRO for wchars-to-bytes.  If this
     flag is not set surrogates in the escape range will be de-escaped (giving
     us our 8-bit-clean round-trip), and other surrogates will be passed
     through as UTF-8 encodings.  In WCSBIN_STRICT mode this flag works
     slightly differently.  If not specified no surrogates are allowed at all
     (escaped or otherwise), and if specified all surrogates are allowed and
     will never be de-escaped.

     The _l-suffixed versions of mbintowcr() and wcrtombin() take an explicit
     locale argument, whereas the non-suffixed versions use the current global
     or per-thread locale.

UTF-8B ESCAPE SEQUENCES
     Escaping is handled by converting one or more bytes in the byte sequence
     to the UTF-8B escape wchar (U+DC80 - U+DCFF).  Most illegal sequences
     escape the first byte and then reprocess the remaining bytes.  An illegal
     byte sequence length (5 or 6 bytes), non-canonical encoding, or illegal
     wchar value (beyond 0x10FFFF if not modified by flags) will escape all
     bytes in the sequence as long as they were not malformed.

     When converting back to a byte-sequence, if not modified by flags, UTF-8B
     escape wchars are converted back to their original bytes.  Other
     surrogate codes (U+D800 - U+DFFF which are normally illegal) will be
     passed through and encoded as UTF-8.

FLAGS
     WCSBIN_EOF            Indicate that the input buffer represents the last
                           of the input stream.  This causes any partial
                           sequences at the end of the input buffer to be
                           processed.

     WCSBIN_SURRO          This flag passes-through any surrogate codes that
                           are already UTF-8-encoded.  This is normally
                           illegal but if you are processing a stream which
                           has already been UTF-8B escaped this flag will
                           prevent the U+DC80 - U+DCFF codes from being re-
                           escaped bytes-to-wchars and will prevent decoding
                           back to the original bytes wchars-to-bytes.  This
                           flag is sometimes used on input if the caller
                           expects the input stream to already be escaped, and
                           not usually used on output unless the caller
                           explicitly wants to encode to an intermediate
                           illegal UTF-8 encoding that retains the escapes as
                           escapes.

                           This flag does not prevent additional escapes from
                           being translated on bytes-to-wchars (WCSBIN_STRICT
                           prevents escaping on bytes-to-wchars), but will
                           prevent de-escaping on wchars-to-bytes.

                           This flag breaks round-trip 8-bit-clean operation
                           since escape codes use the surrogate space and will
                           mix with surrogates that are passed through on
                           input by this flag in a way that cannot be
                           distinguished.

     WCSBIN_LONGCODES      Specifying this flag in the bytes-to-wchars
                           direction allows for decoding of legacy 5-byte and
                           6-byte sequences as well as 4-byte sequences which
                           would normally be illegal.  These sequences are
                           illegal and this flag should not normally be used
                           unless the caller explicitly wants to handle the
                           legacy case.

                           Specifying this flag in the wchars-to-bytes
                           direction allows normally illegal wchars to be
                           encoded.  Again, not recommended.

                           This flag does not allow decoding non-canonical
                           sequences.  Such sequences will still be escaped.

     WCSBIN_STRICT         This flag forces strict parsing in the bytes-to-
                           wchars direction and will cause mbintowcr() to
                           process short or return with an error once
                           processing reaches the illegal coding rather than
                           escaping the illegal sequence.  This flag is
                           usually specified only when the caller desires to
                           validate a UTF-8 buffer.  Remember that an error
                           may also be present with return values greater than
                           0.  A partial sequences at the end of the buffer is
                           not considered to be an error unless WCSBIN_EOF is
                           also specified.

                           Caller is reminded that when using this feature for
                           validation, a short-return can happen rather than
                           an error if the error is not at the base of the
                           source or if WCSBIN_EOF is not specified.  If the
                           caller is not chaining buffers then WCSBIN_EOF
                           should be specified and a simple check of whether
                           *slen equals the original input buffer length on
                           return is sufficient to determine if an error
                           occurred or not.  If the caller is chaining buffers
                           WCSBIN_EOF is not specified and the caller must
                           proceed with the copy-down / continued buffer
                           loading loop to distinguish between an incomplete
                           buffer and an error.

RETURN VALUES
     The mbintowcr(), mbintowcr_l(), utf8towcr(), wcrtombin(), wcrtombin_l()
     and wcrtoutf8() functions return the number of output elements generated
     and set *slen to the number of input elements converted.  If an error
     occurs but the output buffer has already been populated, a short return
     will occur and the next iteration where the error is the first element
     will return the error.  The caller is responsible for processing any
     error conditions before continuing.

     The mbintowcr(), mbintowcr_l() and utf8towcr() functions can return a
     (size_t)-1 error if WCSBIN_STRICT is specified, and otherwise cannot.

     The wcrtombin(), wcrtombin_l() and wcrtoutf8() functions can return a
     (size_t)-1 error if given an illegal wchar code, as modified by flags.
     Any wchar code >= 0x80000000U always causes an error to be returned.

ERRORS
     If an error is returned, errno will be set to EILSEQ.

SEE ALSO
     mbtowc(3), multibyte(3), setlocale(3), wcrtomb(3), xlocale(3)

STANDARDS
     The mbintowcr(), mbintowcr_l(), utf8towcr(), wcrtombin(), wcrtombin_l()
     and wcrtoutf8() functions are non-standard extensions to libc.

DragonFly 6.3-DEVELOPMENT       August 24, 2015      DragonFly 6.3-DEVELOPMENT