DragonFly On-Line Manual Pages
MBINTOWCR(3) DragonFly Library Functions Manual MBINTOWCR(3)
NAME
mbintowcr, mbintowcr_l, utf8towcr, wcrtombin, wcrtombin_l, wcrtoutf8 -
8-bit-clean wchar conversion w/escaping or validation
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
#include <wchar.h>
size_t
mbintowcr(wchar_t * restrict dst, const char * restrict src, size_t dlen,
size_t *slen, int flags);
size_t
utf8towcr(wchar_t * restrict dst, const char * restrict src, size_t dlen,
size_t *slen, int flags);
size_t
wcrtombin(char * restrict dst, const wchar_t * restrict src, size_t dlen,
size_t *slen, int flags);
size_t
wcrtoutf8(char * restrict dst, const wchar_t * restrict src, size_t dlen,
size_t *slen, int flags);
#include <xlocale.h>
size_t
mbintowcr_l(wchar_t * restrict dst, const char * restrict src,
size_t dlen, size_t *slen, locale_t locale, int flags);
size_t
wcrtombin_l(char * restrict dst, const wchar_t * restrict src,
size_t dlen, size_t *slen, locale_t locale, int flags);
DESCRIPTION
The mbintowcr() and wcrtombin() functions translate byte data into wide-
char format and back again. Under normal conditions (but not with all
flags) these functions guarantee that the round-trip will be 8-bit-clean.
Some care must be taken to properly specify the WCSBIN_EOF flag to
properly handle trailing incomplete sequences at stream EOF.
For the "C" locale these functions are 1:1 (do not convert UTF-8). For
UTF-8 locales these functions convert to/from UTF-8. Most of the
discussion below pertains to UTF-8 translations.
The utf8towcr() and wcrtoutf8() functions do exactly the same thing as
the above functions but are locked to the UTF-8 locale. That is, these
functions work regardless of which localehas been selected and also do
not require any initial setlocale() call to initialize. Applications
working explicitly in UTF-8 should use these versions.
Any illegal sequences will be escaped using UTF-8B (U+DC80 - U+DCFF).
Illegal sequences include surrogate-space encodings, non-canonical
encodings, codings >= 0x10FFFF, 5-byte and 6-byte codings (which are not
legal anymore), and malformed codings. Flags may be used to modify this
behavior.
The mbintowcr() function takes generic 8-bit byte data as its input which
the caller expects to be loosely coded in UTF-8 and converts it to an
array of wchar_t, and returns the number of wchar_t that were converted.
The caller must set *slen to the number of bytes in the input buffer and
the function will set *slen on return to the number of bytes in the input
buffer that were processed.
Fewer bytes than specified might be processed due to the output buffer
reaching its limit or due to an incomplete sequence at the end of the
input buffer when the WCSBIN_EOF flag has not been specified.
If processing a stream, the caller typically copies any unprocessed data
at the end of the buffer back to the beginning and then continues loading
the buffer from there. Be sure to check for an incomplete translation at
stream EOF and do a final translation of the remainder with the
WCSBIN_EOF flag set.
This function will always generate escapes for illegal UTF-8 code
sequences and by can produce a clean BYTE-WCHAR-BYTE conversion. See the
flags description later on.
This function cannot return an error unless the WCSBIN_STRICT flag is
set. In case of error, any valid conversions are returned first and the
caller is expected to iterate. The error is returned when it becomes the
first element of the buffer.
A NULL destination buffer may be specified in which case this function
operates identically except for actually trying to fill the buffer. This
feature is typically used for validation with WCSBIN_STRICT and sometimes
also used in combination with WCSBIN_SURRO (set if you want to allow
surrogates).
The wcrtombin() function takes an array of wchar_t as its input which is
usually expected to be well-formed and converts it to an array of generic
8-bit byte data. The caller must set *slen to the number of elements in
the input buffer and the function will set *slen on return to the number
of elements in the input buffer that were processed.
Be sure to properly set the WCSBIN_EOF flag for the last buffer at stream
EOF.
This function can return an error regardless of the flags if a supplied
wchar code is out of range. Some flags change the range of allowed wchar
codes. In case of error, any valid conversions are returned first and
the caller is expected to iterate. The error is returned when it becomes
the first element of the buffer.
A NULL destination buffer may be specified in which case this function
operates identically except for actually trying to fill the buffer. This
feature is typically used for validation with or without WCSBIN_STRICT
and sometimes also used in combination with WCSBIN_SURRO.
One final note on the use of WCSBIN_SURRO for wchars-to-bytes. If this
flag is not set surrogates in the escape range will be de-escaped (giving
us our 8-bit-clean round-trip), and other surrogates will be passed
through as UTF-8 encodings. In WCSBIN_STRICT mode this flag works
slightly differently. If not specified no surrogates are allowed at all
(escaped or otherwise), and if specified all surrogates are allowed and
will never be de-escaped.
The _l-suffixed versions of mbintowcr() and wcrtombin() take an explicit
locale argument, whereas the non-suffixed versions use the current global
or per-thread locale.
UTF-8B ESCAPE SEQUENCES
Escaping is handled by converting one or more bytes in the byte sequence
to the UTF-8B escape wchar (U+DC80 - U+DCFF). Most illegal sequences
escape the first byte and then reprocess the remaining bytes. An illegal
byte sequence length (5 or 6 bytes), non-canonical encoding, or illegal
wchar value (beyond 0x10FFFF if not modified by flags) will escape all
bytes in the sequence as long as they were not malformed.
When converting back to a byte-sequence, if not modified by flags, UTF-8B
escape wchars are converted back to their original bytes. Other
surrogate codes (U+D800 - U+DFFF which are normally illegal) will be
passed through and encoded as UTF-8.
FLAGS
WCSBIN_EOF Indicate that the input buffer represents the last
of the input stream. This causes any partial
sequences at the end of the input buffer to be
processed.
WCSBIN_SURRO This flag passes-through any surrogate codes that
are already UTF-8-encoded. This is normally
illegal but if you are processing a stream which
has already been UTF-8B escaped this flag will
prevent the U+DC80 - U+DCFF codes from being re-
escaped bytes-to-wchars and will prevent decoding
back to the original bytes wchars-to-bytes. This
flag is sometimes used on input if the caller
expects the input stream to already be escaped, and
not usually used on output unless the caller
explicitly wants to encode to an intermediate
illegal UTF-8 encoding that retains the escapes as
escapes.
This flag does not prevent additional escapes from
being translated on bytes-to-wchars (WCSBIN_STRICT
prevents escaping on bytes-to-wchars), but will
prevent de-escaping on wchars-to-bytes.
This flag breaks round-trip 8-bit-clean operation
since escape codes use the surrogate space and will
mix with surrogates that are passed through on
input by this flag in a way that cannot be
distinguished.
WCSBIN_LONGCODES Specifying this flag in the bytes-to-wchars
direction allows for decoding of legacy 5-byte and
6-byte sequences as well as 4-byte sequences which
would normally be illegal. These sequences are
illegal and this flag should not normally be used
unless the caller explicitly wants to handle the
legacy case.
Specifying this flag in the wchars-to-bytes
direction allows normally illegal wchars to be
encoded. Again, not recommended.
This flag does not allow decoding non-canonical
sequences. Such sequences will still be escaped.
WCSBIN_STRICT This flag forces strict parsing in the bytes-to-
wchars direction and will cause mbintowcr() to
process short or return with an error once
processing reaches the illegal coding rather than
escaping the illegal sequence. This flag is
usually specified only when the caller desires to
validate a UTF-8 buffer. Remember that an error
may also be present with return values greater than
0. A partial sequences at the end of the buffer is
not considered to be an error unless WCSBIN_EOF is
also specified.
Caller is reminded that when using this feature for
validation, a short-return can happen rather than
an error if the error is not at the base of the
source or if WCSBIN_EOF is not specified. If the
caller is not chaining buffers then WCSBIN_EOF
should be specified and a simple check of whether
*slen equals the original input buffer length on
return is sufficient to determine if an error
occurred or not. If the caller is chaining buffers
WCSBIN_EOF is not specified and the caller must
proceed with the copy-down / continued buffer
loading loop to distinguish between an incomplete
buffer and an error.
RETURN VALUES
The mbintowcr(), mbintowcr_l(), utf8towcr(), wcrtombin(), wcrtombin_l()
and wcrtoutf8() functions return the number of output elements generated
and set *slen to the number of input elements converted. If an error
occurs but the output buffer has already been populated, a short return
will occur and the next iteration where the error is the first element
will return the error. The caller is responsible for processing any
error conditions before continuing.
The mbintowcr(), mbintowcr_l() and utf8towcr() functions can return a
(size_t)-1 error if WCSBIN_STRICT is specified, and otherwise cannot.
The wcrtombin(), wcrtombin_l() and wcrtoutf8() functions can return a
(size_t)-1 error if given an illegal wchar code, as modified by flags.
Any wchar code >= 0x80000000U always causes an error to be returned.
ERRORS
If an error is returned, errno will be set to EILSEQ.
SEE ALSO
mbtowc(3), multibyte(3), setlocale(3), wcrtomb(3), xlocale(3)
STANDARDS
The mbintowcr(), mbintowcr_l(), utf8towcr(), wcrtombin(), wcrtombin_l()
and wcrtoutf8() functions are non-standard extensions to libc.
DragonFly 6.3-DEVELOPMENT August 24, 2015 DragonFly 6.3-DEVELOPMENT