DragonFly On-Line Manual Pages

str(3)                          String Library                          str(3)

NAME
       OSSP str - String Handling

VERSION
       OSSP str 0.9.12 (12-Oct-2005)

SYNOPSIS
       str_len, str_copy, str_dup, str_concat, str_splice, str_compare,
       str_span, str_locate, str_token, str_parse, str_format, str_hash,
       str_base64.

DESCRIPTION
       OSSP str is a generic string library written in ISO-C which provides
       functions for handling, matching, parsing, searching and formatting of
       ISO-C strings. So it can be considered as a superset of POSIX
       string(3), but its main intention is to provide a more convenient and
       compact API plus a more generalized functionality.

FUNCTIONS
       The following functions are provided by the OSSP str API:

       str_size_t str_len(const char *s);
           This function determines the length of string s, i.e., the number
           of characters starting at s that precede the terminating "NUL"
           character. It returns "NULL" if s is "NULL".

       char *str_copy(char *s, const char *t, size_t n);
           This copies the characters in string t into the string s, but never
           more than n characters (if n is greater than 0). The two involved
           strings can overlap and the characters in s are always
           "NUL"-terminated. The string s has to be large enough to hold all
           characters to be copied.  function returns "NULL" if s or t are
           "NULL". Else it returns the pointer to the written
           "NUL"-terminating character in s.

       char *str_dup(const char *s, str_size_t n);
           This returns a copy of the characters in string s, but never more
           than n characters if n is greater than 0. It returns "NULL" if s is
           "NULL". The returned string has to be deallocated later with
           free(3).

       char *str_concat(char *s, ...);
           This functions concatenates the characters of all string arguments
           into a new allocated string and returns this new string.  If s is
           "NULL" the function returns "NULL". Else it returns the pointer to
           the written final "NUL"-terminating character in s. The returned
           string later has to be deallocated with free(3).

       char *str_splice(char *s, str_size_t off, str_size_t n, char *t,
       str_size_t m);
           This splices the string t into string s, i.e., the n characters at
           offset off in s are removed and at their location the string t is
           inserted (or just the first m characters of t if m is greater than
           0). It returns "NULL" if s or t are "NULL".  Else the string s is
           returned. The function supports also the situation where t is a
           sub-string of s as long as the area s+off...s+off+n and t...t+m do
           not overlap. The caller always has to make sure that enough room
           exists in s.

       int str_compare(const char *s, const char *t, str_size_t n, int mode);
           This performs a lexicographical comparison of the two strings s and
           t (but never compares more than n characters of them) and returns
           one of three return values: a value lower than 0 if s is
           lexicographically lower than t, a value of exactly 0 if s and t are
           equal and a value greater than 0 if s is lexicographically higher
           than t. Per default (mode is 0) the comparison is case-sensitive,
           but if "STR_NOCASE" is used for mode the comparison is done in a
           case-insensitive way.

       char *str_span(const char *s, size_t n, const char *charset, int mode);
           This functions spans a string s according to the characters
           specified in charset. If mode is 0, this means that s is spanned
           from left to right starting at s (and ending either when reaching
           the terminating "NUL" character or already after n spanned
           characters) as long as the characters of s are contained in
           charset.

           Alternatively one can use a mode of "STR_COMPLEMENT" to indicate
           that s is spanned as long as the characters of s are not contained
           in charset, i.e., charset then specifies the complement of the
           spanning characters.

           In both cases one can additionally "or" (with the C operator
           ``"|"'') "STR_RIGHT" into mode to indicate that the spanning is
           done right to left starting at the terminating "NUL" character of s
           (and ending either when reaching s or already after n spanned
           characters).

       char *str_locate(const char *s, str_size_t n, const char *t);
           This functions searches for the (smaller) string t inside (larger)
           string s. If n is not 0, the search is performed only inside the
           first n characters of s.

       char *str_token(char **s, const char *delim, const char *quote, const
       char *comment, int mode);
           This function considers the string s to consist of a sequence of
           zero or more text tokens separated by spans of one or more
           characters from the separator string delim. However, text between
           matched pairs of quotemarks (characters in quote) is treated as
           plain text, never as delimiter (separator) text. Each call of this
           function returns a pointer to the first character of the first
           token of s. The token is "NUL"-terminated, i.e., the string s is
           processed in a destructive way. If there are quotation marks or
           escape sequences, the input string is rewritten with quoted
           sections and escape sequences properly interpreted.

           This function keeps track of its parsing position in the string
           between separate calls by simply adjusting the callers s pointer,
           so that subsequent calls with the same pointer variable s will
           start processing from the position immediately after the last
           returned token.  In this way subsequent calls will work through the
           string s until no tokens remain. When no token remains in s, "NULL"
           is returned. The string of token separators (delim) and the string
           of quote characters (quote) may be changed from call to call.

           If a character in the string s is not quoted or escaped, and is in
           the comment set, then it is overwritten with a "NUL" character and
           the rest of the string is ignored. The characters to be used as
           quote characters are specified in the quote set, and must be used
           in balanced pairs. If there is more than one flavor of quote
           character, one kind of quote character may be used to quote another
           kind. If an unbalanced quote is found, the function silently act as
           if one had been placed at the end of the input string.  The delim
           and quote strings must be disjoint, i.e., they have to share no
           characters.

           The mode argument can be used to modify the processing of the
           string (default for mode is 0): "STR_STRIPQUOTES" forces quote
           characters to be stripped from quoted tokens; "STR_BACKSLASHESC"
           enables the interpretation (and expansion) of backslash escape
           sequences (`\x') through ANSI-C rules; "STR_SKIPDELIMS" forces that
           after the terminating "NUL" is written and the token returned,
           further delimiters are skipped (this allows one to make sure that
           the delimiters for one word don't become part of the next word if
           one change delimiters between calls); and "STR_TRIGRAPHS" enables
           the recognition and expansion of ANSI C Trigraph sequences (as a
           side effect this enables "STR_BACKSLASHESC", too).

       int str_parse(const char *s, const char *pop, ...);
           This parses the string s according to the parsing operation
           specified by pop. If the parsing operation succeeds, 1 is returned.
           If the parsing operation failed because the pattern pop did not
           match, 0 is returned. If the parsing operation failed because the
           underlying regular expression library failed, "-1" is returned.

           The pop string usually has one of the following two syntax
           variants: `m delim regex delim flags*' (for matching operations)
           and `s delim regex delim subst delim flags*' (for substitution
           operations). For more details about the syntax variants and
           semantic of the pop argument see section GORY DETAILS, Parsing
           Specification below. The syntax of the regex part in pop is mostly
           equivalent to Perl 5's regular expression syntax. For the complete
           and gory details see perlre(1). A brief summary you can find under
           section GORY DETAILS, Perl Regular Expressions below.

       int str_format(char *s, str_size_t n, const char *fmt, ...);
           This formats a new string according to fmt and optionally following
           arguments and writes it into the string s, but never more than n
           characters at all. It returns the number of written characters.  If
           s is "NULL" it just calculates the number of characters which would
           be written.

           The function generates the output string under the control of the
           fmt format string that specifies how subsequent arguments (or
           arguments accessed via the variable-length argument facilities of
           stdarg(3)) are converted for output.

           The format string fmt is composed of zero or more directives:
           ordinary characters (not %), which are copied unchanged to the
           output stream; and conversion specifications, each of which results
           in fetching zero or more subsequent arguments. Each conversion
           specification is introduced by the character %. The arguments must
           correspond properly (after type promotion) with the conversion
           specifier. Which conversion specifications are supported are
           described in detail under GORY DETAILS, Format Specification below.

       unsigned long str_hash(const char *s, str_size_t n, int mode);
           This function calculates a hash value of string s (or of its first
           n characters if n is equal to 0). The following hashing functions
           are supported and can be selected with mode: STR_HASH_DJBX33
           (Daniel J. Berstein, Times 33 Hash with Addition), STR_HASH_BJDDJ
           (Bob Jenkins, Dr. Dobbs Journal), and STR_HASH_MACRC32 (Mark Adler,
           Cyclic Redundancy Check with 32-Bit). This function is intended for
           fast use in hashing algorithms and not for use as cryptographically
           strong message digests.

       int str_base64(char *s, str_size_t n, unsigned char *ucp, str_size_t
       ucn, int mode);
           This function Base64 encodes ucn bytes starting at ucp and writes
           the resulting string into s (but never more than n characters are
           written). The mode for this operation has to be
           "STR_BASE64_ENCODE".  Additionally one can OR the value
           "STR_BASE64_STRICT" to enable strict encoding where after every
           72th output character a newline character is inserted. The function
           returns the number of output characters written.  If s is "NULL"
           the function just calculates the number of required output
           characters.

           Alternatively, if mode is "STR_BASE64_DECODE" the string s (or the
           first n characters only if n is not 0) is decoded and the output
           bytes written at ucp. Again, if ucp is "NULL" only the number of
           required output bytes are calculated.

GORY DETAILS
       In this part of the documentation more complex topics are documented in
       detail.

       Perl Regular Expressions

       The regular expressions used in OSSP str are more or less Perl
       compatible (they are provided by a stripped down and built-in version
       of the PCRE library). So the syntax description in perlre(1) applies
       and don't has to be repeated here again. For a deeper understanding and
       details you should have a look at the book `Mastering Regular
       Expressions' (see also the perlbook(1) manpage) by Jeffrey Friedl.  For
       convenience reasons we give you only a brief summary of Perl compatible
       regular expressions:

       The following metacharacters have their standard egrep(1) meanings:

         \      Quote the next metacharacter
         ^      Match the beginning of the line
         .      Match any character (except newline)
         $      Match the end of the line (or before newline at the end)
         |      Alternation
         ()     Grouping
         []     Character class

       The following standard quantifiers are recognized:

         *      Match 0 or more times (greedy)
         *?     Match 0 or more times (non greedy)
         +      Match 1 or more times (greedy)
         +?     Match 1 or more times (non greedy)
         ?      Match 1 or 0 times (greedy)
         ??     Match 1 or 0 times (non greedy)
         {n}    Match exactly n times (greedy)
         {n}?   Match exactly n times (non greedy)
         {n,}   Match at least n times (greedy)
         {n,}?  Match at least n times (non greedy)
         {n,m}  Match at least n but not more than m times (greedy)
         {n,m}? Match at least n but not more than m times (non greedy)

       The following backslash sequences are recognized:

         \t     Tab                   (HT, TAB)
         \n     Newline               (LF, NL)
         \r     Return                (CR)
         \f     Form feed             (FF)
         \a     Alarm (bell)          (BEL)
         \e     Escape (think troff)  (ESC)
         \033   Octal char
         \x1B   Hex char
         \c[    Control char
         \l     Lowercase next char
         \u     Uppercase next char
         \L     Lowercase till \E
         \U     Uppercase till \E
         \E     End case modification
         \Q     Quote (disable) pattern metacharacters till \E

       The following non zero-width assertions are recognized:

         \w     Match a "word" character (alphanumeric plus "_")
         \W     Match a non-word character
         \s     Match a whitespace character
         \S     Match a non-whitespace character
         \d     Match a digit character
         \D     Match a non-digit character

       The following zero-width assertions are recognized:

         \b     Match a word boundary
         \B     Match a non-(word boundary)
         \A     Match only at beginning of string
         \Z     Match only at end of string, or before newline at the end
         \z     Match only at end of string
         \G     Match only where previous m//g left off (works only with /g)

       The following regular expression extensions are recognized:

         (?#text)              An embedded comment
         (?:pattern)           This is for clustering, not capturing (simple)
         (?imsx-imsx:pattern)  This is for clustering, not capturing (full)
         (?=pattern)           A zero-width positive lookahead assertion
         (?!pattern)           A zero-width negative lookahead assertion
         (?<=pattern)          A zero-width positive lookbehind assertion
         (?<!pattern)          A zero-width negative lookbehind assertion
         (?>pattern)           An "independent" subexpression
         (?(cond)yes-re)       Conditional expression (simple)
         (?(cond)yes-re|no-re) Conditional expression (full)
         (?imsx-imsx)          One or more embedded pattern-match modifiers

       Parsing Specification

       The str_parse(const char *s, const char *pop, ...) function is a very
       flexible but complex one. The argument s is the string on which the
       parsing operation specified by argument pop is applied.  The parsing
       semantics are highly influenced by Perl's `=~' matching operator,
       because one of the main goals of str_parse(3) is to allow one to
       rewrite typical Perl matching constructs into C.

       Now to the gory details. In general, the pop argument of str_parse(3)
       has one of the following two syntax variants:

       Matching: `m delim regex delim flags*':
           This matches s against the Perl-style regular expression regex
           under the control of zero or more flags which control the parsing
           semantics. The stripped down pop syntax `regex' is equivalent to
           `m/regex/'.

           For each grouping pair of parenthesis in regex, the text in s which
           was grouped by the parenthesis is extracted into new strings.
           These per default are allocated as seperate strings and returned to
           the caller through following `char **' arguments. The caller is
           required to free(3) them later.

       Substitution: `s delim regex delim subst delim flags*':
           This matches s against the Perl-style regular expression regex
           under the control of zero or more flags which control the parsing
           semantics. As a result of the operation, a new string formed which
           consists of s but with the part which matched regex replaced by
           subst. The result string is returned to the caller through a `char
           **' argument. The caller is required to free(3) this later.

           For each grouping pair of parenthesis in regex, the text in s which
           was grouped by the parenthesis is extracted into new strings and
           can be referenced for expansion via `$n' (n=1,..) in subst.
           Additionally any str_format(3) style `%' constructs in subst are
           expanded through additional caller supplied arguments.

       The following flags are supported:

       b   If the bundle flag `b' is specified, the extracted strings are
           bundled together into a single chunk of memory and its address is
           returned to the caller with a additional `char **' argument which
           has to preceed the regular string arguments. The caller then has to
           free(3) only this chunk of memory in order to free all extracted
           strings at once.

       i   If the case-insensitive flag `i' is specified, regex is matched in
           case-insensitive way.

       o   If the once flag `o' is specified, this indicates to the OSSP str
           library that the whole pop string is constant and that its internal
           pre-processing (it is compiled into a deterministic finite
           automaton (DFA) internally) has to be done only once (the OSSP str
           library then caches the DFA which corresponds to the pop argument).

       x   If the extended flag `x' is specified, the regex's legibility is
           extended by permitting embedded whitespace and comments to allow
           one to write down complex regular expressions more cleary and even
           in a documented way.

       m   If the multiple lines flag `m' is specified, the string s is
           treated as multiple lines. That is, this changes the regular
           expression meta characters `^' and `$' from matching at only the
           very start or end of the string s to the start or end of any line
           anywhere within the string s.

       s   If the single line flag `s' is specified, the string s is treated
           as single line. That is, this changes the regular expression meta
           character `.' to match any character whatsoever, even a newline,
           which it normally would not match.

CONVERSION SPECIFICATION
       In the format string of str_format(3) each conversion specification is
       introduced by the character %. After the %, the following appear in
       sequence:

       o   An optional field, consisting of a decimal digit string followed by
           a $, specifying the next argument to access.  If this field is not
           provided, the argument following the last argument accessed will be
           used.  Arguments are numbered starting at 1. If unaccessed
           arguments in the format string are interspersed with ones that are
           accessed the results will be indeterminate.

       o   Zero or more of the following flags:

           A # character specifying that the value should be converted to an
           ``alternate form''.  For c, d, i, n, p, s, and u, conversions, this
           option has no effect.  For o conversions, the precision of the
           number is increased to force the first character of the output
           string to a zero (except if a zero value is printed with an
           explicit precision of zero).  For x and X conversions, a non-zero
           result has the string 0x (or 0X for X conversions) prepended to it.
           For e, E, f, g, and G, conversions, the result will always contain
           a decimal point, even if no digits follow it (normally, a decimal
           point appears in the results of those conversions only if a digit
           follows).  For g and G conversions, trailing zeros are not removed
           from the result as they would otherwise be.

           A zero `0' character specifying zero padding.  For all conversions
           except n, the converted value is padded on the left with zeros
           rather than blanks.  If a precision is given with a numeric
           conversion (d, i, o, u, i, x, and X), the `0' flag is ignored.

           A negative field width flag `-' indicates the converted value is to
           be left adjusted on the field boundary.  Except for n conversions,
           the converted value is padded on the right with blanks, rather than
           on the left with blanks or zeros.  A `-' overrides a `0' if both
           are given.

           A space, specifying that a blank should be left before a positive
           number produced by a signed conversion (d, e, E, f, g, G, or i).

           A `*' character specifying that a sign always be placed before a
           number produced by a signed conversion.  A `*' overrides a space if
           both are used.

       o   An optional decimal digit string specifying a minimum field width.
           If the converted value has fewer characters than the field width,
           it will be padded with spaces on the left (or right, if the left-
           adjustment flag has been given) to fill out the field width.

       o   An optional precision, in the form of a period `.' followed by an
           optional digit string. If the digit string is omitted, the
           precision is taken as zero. This gives the minimum number of digits
           to appear for d, i, o, u, x, and X conversions, the number of
           digits to appear after the decimal-point for e, E, and f
           conversions, the maximum number of significant digits for g and G
           conversions, or the maximum number of characters to be printed from
           a string for s conversions.

       o   The optional character h, specifying that a following d, i, o, u,
           x, or X conversion corresponds to a `"short int"' or `"unsigned
           short int"' argument, or that a following n conversion corresponds
           to a pointer to a `"short int" argument.

       o   The optional character l (ell) specifying that a following d, i, o,
           u, x, or X conversion applies to a pointer to a `"long int"' or
           `"unsigned long int"' argument, or that a following n conversion
           corresponds to a pointer to a `"long int" argument.

       o   The optional character q, specifying that a following d, i, o, u,
           x, or X conversion corresponds to a `"quad int"' or `"unsigned quad
           int"' argument, or that a following n conversion corresponds to a
           pointer to a `"quad int"' argument.

       o   The character L specifying that a following e, E, f, g, or G
           conversion corresponds to a `"long double"' argument.

       o   A character that specifies the type of conversion to be applied.

       A field width or precision, or both, may be indicated by an asterisk
       `*' or an asterisk followed by one or more decimal digits and a `$'
       instead of a digit string.  In this case, an `"int"' argument supplies
       the field width or precision.  A negative field width is treated as a
       left adjustment flag followed by a positive field width; a negative
       precision is treated as though it were missing.  If a single format
       directive mixes positional (`nn$') and non-positional arguments, the
       results are undefined.

       The conversion specifiers and their meanings are:

       diouxX
           The `"int"' (or appropriate variant) argument is converted to
           signed decimal (d and i), unsigned octal (o), unsigned decimal (u),
           or unsigned hexadecimal (x and X) notation.  The letters abcdef are
           used for x conversions; the letters ABCDEF are used for X
           conversions.  The precision, if any, gives the minimum number of
           digits that must appear; if the converted value requires fewer
           digits, it is padded on the left with zeros.

       DOU The `"long int" argument is converted to signed decimal, unsigned
           octal, or unsigned decimal, as if the format had been ld, lo, or lu
           respectively.  These conversion characters are deprecated, and will
           eventually disappear.

       eE  The `"double"' argument is rounded and converted in the style
           `[-]d.ddde+-dd' where there is one digit before the decimal-point
           character and the number of digits after it is equal to the
           precision; if the precision is missing, it is taken as 6; if the
           precision is zero, no decimal-point character appears.  An E
           conversion uses the letter E (rather than e) to introduce the
           exponent.  The exponent always contains at least two digits; if the
           value is zero, the exponent is 00.

       f   The `"double"' argument is rounded and converted to decimal
           notation in the style `[-]ddd.ddd>' where the number of digits
           after the decimal-point character is equal to the precision
           specification.  If the precision is missing, it is taken as 6; if
           the precision is explicitly zero, no decimal-point character
           appears.  If a decimal point appears, at least one digit appears
           before it.

       g   The `"double"' argument is converted in style f or e (or E for G
           conversions).  The precision specifies the number of significant
           digits.  If the precision is missing, 6 digits are given; if the
           precision is zero, it is treated as 1.  Style e is used if the
           exponent from its conversion is less than -4 or greater than or
           equal to the precision.  Trailing zeros are removed from the
           fractional part of the result; a decimal point appears only if it
           is followed by at least one digit.

       c   The `"int"' argument is converted to an `"unsigned char", and the
           resulting character is written.

       s   The `"char *"' argument is expected to be a pointer to an array of
           character type (pointer to a string).  Characters from the array
           are written up to (but not including) a terminating "NUL"
           character; if a precision is specified, no more than the number
           specified are written.  If a precision is given, no null character
           need be present; if the precision is not specified, or is greater
           than the size of the array, the array must contain a terminating
           "NUL" character.

       p   The `"void *" pointer argument is printed in hexadecimal (as if by
           `%#x' or `%#lx).

       n   The number of characters written so far is stored into the integer
           indicated by the `"int *"' (or variant) pointer argument.  No
           argument is converted.

       %   A `%' is written. No argument is converted. The complete conversion
           specification is `%%.

       In no case does a non-existent or small field width cause truncation of
       a field; if the result of a conversion is wider than the field width,
       the field is expanded to contain the conversion result.

EXAMPLES
       In the following a few snippets of selected use cases of OSSP str are
       presented:

       Splice a String into Another
            char *v1 = "foo bar quux";
            char *v2 = "baz";
            str_splice(v1, 3, 5, v2, 0):
            /* now we have v1 = "foobazquux" */
            ....

       Tokenize a String
            char *var = " foo \t " bar 'baz'" q'uu'x #comment";
            char *tok, *p;
            p = var;
            while ((tok = str_token(p, ":", "\"'", "#", 0)) != NULL) {
                /* here we enter three times:
                   1. tok = "foo"
                   2. tok = " bar 'baz'"
                   3. tok = "quux" */
                ...
            }

       Match a String
            char *var = "foo:bar";
            if (str_parse(var, "^.+?:.+$/") > 0) {
                /* var matched */
                ...
            }

       Match a String and Go Ahead with Details
            char *var = "foo:bar";
            char *cp, *v1, *v2;
            if (str_parse(var, "m/^(.+?):(.+)$/b", &cp, &v1, &v2) > 0) {
                ...
                /* now we have:
                   cp = "foo\0bar\0" and v1 and v2 pointing
                   into it, i.e., v1 = "foo", v2 = "bar" */
                ...
                free(cp);
            }

       Substitute Text in a String
            char *var = "foo:bar";
            char *subst = "quux";
            char *new;
            str_parse(var, "s/^(.+?):(.+)$/$1-%s-$2/", &new, subst);
            ...
            /* now we have: var = "foo:bar", new = "foo:quux:bar" */
            ...
            free(new);

       Format a String
            char *v0 = "abc..."; /* length not guessable */
            char *v1 = "foo";
            void *v2 = 0xDEAD;
            int v3 = 42;
            char *cp;
            int n;

            n = str_format(NULL, 0, "%s|%5s-%x-%04d", v0, v1, v2, v3);
            cp = malloc(n);
            str_format(cp, n, "%s-%x-%04d", v1, v2, v3);
            /* now we have cp = "abc...|  foo-DEAD-0042" */
            ...
            free(cp);

SEE ALSO
       string(3), printf(3), perlre(1).

HISTORY
       OSSP str was written in November and December 1999 by Ralf S.
       Engelschall for the OSSP project. As building blocks various existing
       code was used and recycled: for the str_token(3) implementation an
       anchient strtok(3) flavor from William Deich 1991 was cleaned up and
       adjusted. As the background parsing engine for str_parse(3) a heavily
       stripped down version of Philip Hazel's Perl Compatible Regular
       Expression (PCRE) library (initially version 2.08 and now 3.9) was
       used. The str_format(3) implementation was based on Panos Tsirigotis'
       sprintf(3) code as adjusted by the Apache Software Foundation (ASF)
       1998. The formatting engine was stripped down and enhanced to support
       internal extensions which were required by str_format(3) and
       str_parse(3).

AUTHOR
        Ralf S. Engelschall
        rse@engelschall.com
        www.engelschall.com

12-Oct-2005                       Str 0.9.12                            str(3)