DragonFly On-Line Manual Pages
UTF(3) DragonFly Library Functions Manual UTF(3)
NAME
runetochar, chartorune, runelen, fullrune, utflen, utfrune, utfrrune,
utfutf - Unicode Text Format functionality
SYNOPSIS
#include <utf.h>
int runetochar(char *cp, Rune *rp);
int chartorune(Rune *rp, char *cp);
int runelen(long r);
int fullrune(char *cp, int n);
int utflen(char *s);
int utfbytes(char *s);
char *utfrune(char *cp, long r);
char *utfrrune(char *cp, long r);
char *utfutf(char *big, char *little);
int utf_snprintf(char *buf, size_t size, char *format, ...);
int utfcmp(char *s1, char *s2);
int utfncmp(char *s1, char *s2, int rc);
char *utfcpy(char *dst, char *src);
char *utfncpy(char *dst, char *src, int nbytes);
char *utfcat(char *src, char *append);
char *utfncat(char *src, char *append, int nbytes);
DESCRIPTION
The UTF routines are used to pack the Unicode text encoding into a
standard character stream. To do that effectively, ASCII characters
form the lowest 127 characters of UTF-8. These characters are
interchangeable between the two character sets. A Rune is a Unicode
character, defined in the header file utf.h.
runetochar translates a single Rune to a UTF sequence and returns the
number of bytes produced. chartorune is the inverse of this function,
returning the number of bytes consumed. runelen returns the number of
bytes in the encoding of a Rune. fullrune checks that the first n
bytes of the UTF string cp contain a complete UTF encoding.
utflen returns the number of runes in a UTF string. utbytes returns
the number of bytes in a UTF string. utfrune returns a pointer to the
first occurrence of a rune in a UTF string. utfrrune returns a pointer
to the last. utfutf searches for the first occurrence of a UTF string
in another UTF string.
utf_snprintf is a prticularly dumb implementation of snprintf for utf
strings - it only interprets %%, %s and %d sequences in the format
string, and does no field width calculation on those.
utfcmp compares two strings lexicographically, Rune by Rune, and
returns a value greater than 0, equal to zero, or less than zero
depending on whether the first UTF string is greater than, the same as,
or less than the second string. utfncmp does the same comparison as
utfcmp, with a maximum upper bound of rc Runes.
utfcpy copies from source to destination, Rune by Rune, and returns its
destination string. No bounds checking is done on the number of Runes
copied, or their individual sizes. The dst argument is returned.
utfncpy copies at most nbytes bytes from source to destination,
terminating when a null Rune is found in the source. If the number of
bytes copied is less than nbytes, then the destination string is
paddedf with null (0) bytes. If it is equal to or greater than nbytes,
no zero bytes is added. The dst argument is returned. utfcat appends
the UTF string append onto the UTF string src. utfncat appends the UTF
string append onto the UTF string src, bearing in mind that the buffer
src is only nbytes long.
IMPLEMENTATION
This implementation of UTF, nominally UTF-8, can encode a null Unicode
character using a one-byte or a two-byte encoding. Typically, Plan 9
uses a one-byte encoding, whilst Java uses a two-byte encoding. Plan 9
type encoding makes backwards compatibility much easier, and loses
nothing - all the Java functionality is there, there are no embedded
null bytes in a UTF string, due to the encoding of second and third
characters, and ordinary C strings are recognised as well, which is not
the case in Java. By default, a one byte Null-byte encoding is used.
UTF-8 is defined in X/Open Company Ltd., "File System Safe UCS
Transformation Format (FSS_UTF)", X/Open Preliminary Specification,
Document Number: P316, which also appears in ISO/IEC 10646, Annex P.
BUGS
Undoubtably, these are many, and legion.
AUTHOR
Written by Alistair Crooks (agc@amdahl.com, or
agc@westley.demon.co.uk), from a draft document written by Rob Pike and
Ken Thompson, detailing the implementation of UTF in the Plan 9
operating system.
UTF(3)