@c -*-texinfo-*-
@c This is part of the GNU Guile Reference Manual.
-@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006
+@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009, 2010, 2011
@c Free Software Foundation, Inc.
@c See the file guile.texi for copying conditions.
-@page
@node Simple Data Types
@section Simple Generic Data Types
* Characters:: Single characters.
* Character Sets:: Sets of characters.
* Strings:: Sequences of characters.
-* Regular Expressions:: Pattern matching and substitution.
+* Bytevectors:: Sequences of bytes.
* Symbols:: Symbols.
* Keywords:: Self-quoting, customizable display keywords.
* Other Types:: "Functionality-centric" data types.
* Complex:: Complex number operations.
* Arithmetic:: Arithmetic functions.
* Scientific:: Scientific functions.
-* Primitive Numerics:: Primitive numeric functions.
* Bitwise Operations:: Logical AND, OR, NOT, and so on.
* Random:: Random number generation.
@end menu
form. Conversion between these two representations is automatic and
completely invisible to the Scheme level programmer.
-The infinities @samp{+inf.0} and @samp{-inf.0} are considered to be
-inexact integers. They are explained in detail in the next section,
-together with reals and rationals.
-
C has a host of different integer types, and Guile offers a host of
functions to convert between them and the @code{SCM} representation.
For example, a C @code{int} can be handled with @code{scm_to_int} and
The motivation for this behavior is that the inexactness of a number
should not be lost silently. If you want to allow inexact integers,
-you can explicitely insert a call to @code{inexact->exact} or to its C
+you can explicitly insert a call to @code{inexact->exact} or to its C
equivalent @code{scm_inexact_to_exact}. (Only inexact integers will
be converted by this call into exact integers; inexact non-integers
will become exact fractions.)
@m{\pi,pi}.
Guile can represent both exact and inexact rational numbers, but it
-can not represent irrational numbers. Exact rationals are represented
-by storing the numerator and denominator as two exact integers.
-Inexact rationals are stored as floating point numbers using the C
-type @code{double}.
+cannot represent precise finite irrational numbers. Exact rationals are
+represented by storing the numerator and denominator as two exact
+integers. Inexact rationals are stored as floating point numbers using
+the C type @code{double}.
Exact rationals are written as a fraction of integers. There must be
no whitespace around the slash:
4.0
@end lisp
-The limited precision of Guile's encoding means that any ``real'' number
-in Guile can be written in a rational form, by multiplying and then dividing
-by sufficient powers of 10 (or in fact, 2). For example,
-@samp{-0.00000142857931198} is the same as @minus{}142857931198 divided by
-100000000000000000. In Guile's current incarnation, therefore, the
-@code{rational?} and @code{real?} predicates are equivalent.
-
-
-Dividing by an exact zero leads to a error message, as one might
-expect. However, dividing by an inexact zero does not produce an
-error. Instead, the result of the division is either plus or minus
-infinity, depending on the sign of the divided number.
+The limited precision of Guile's encoding means that any finite ``real''
+number in Guile can be written in a rational form, by multiplying and
+then dividing by sufficient powers of 10 (or in fact, 2). For example,
+@samp{-0.00000142857931198} is the same as @minus{}142857931198 divided
+by 100000000000000000. In Guile's current incarnation, therefore, the
+@code{rational?} and @code{real?} predicates are equivalent for finite
+numbers.
-The infinities are written @samp{+inf.0} and @samp{-inf.0},
-respectivly. This syntax is also recognized by @code{read} as an
-extension to the usual Scheme syntax.
-Dividing zero by zero yields something that is not a number at all:
-@samp{+nan.0}. This is the special `not a number' value.
+Dividing by an exact zero leads to a error message, as one might expect.
+However, dividing by an inexact zero does not produce an error.
+Instead, the result of the division is either plus or minus infinity,
+depending on the sign of the divided number and the sign of the zero
+divisor (some platforms support signed zeroes @samp{-0.0} and
+@samp{+0.0}; @samp{0.0} is the same as @samp{+0.0}).
+
+Dividing zero by an inexact zero yields a @acronym{NaN} (`not a number')
+value, although they are actually considered numbers by Scheme.
+Attempts to compare a @acronym{NaN} value with any number (including
+itself) using @code{=}, @code{<}, @code{>}, @code{<=} or @code{>=}
+always returns @code{#f}. Although a @acronym{NaN} value is not
+@code{=} to itself, it is both @code{eqv?} and @code{equal?} to itself
+and other @acronym{NaN} values. However, the preferred way to test for
+them is by using @code{nan?}.
+
+The real @acronym{NaN} values and infinities are written @samp{+nan.0},
+@samp{+inf.0} and @samp{-inf.0}. This syntax is also recognized by
+@code{read} as an extension to the usual Scheme syntax. These special
+values are considered by Scheme to be inexact real numbers but not
+rational. Note that non-real complex numbers may also contain
+infinities or @acronym{NaN} values in their real or imaginary parts. To
+test a real number to see if it is infinite, a @acronym{NaN} value, or
+neither, use @code{inf?}, @code{nan?}, or @code{finite?}, respectively.
+Every real number in Scheme belongs to precisely one of those three
+classes.
On platforms that follow @acronym{IEEE} 754 for their floating point
arithmetic, the @samp{+inf.0}, @samp{-inf.0}, and @samp{+nan.0} values
They behave in arithmetic operations like @acronym{IEEE} 754 describes
it, i.e., @code{(= +nan.0 +nan.0)} @result{} @code{#f}.
-The infinities are inexact integers and are considered to be both even
-and odd. While @samp{+nan.0} is not @code{=} to itself, it is
-@code{eqv?} to itself.
-
-To test for the special values, use the functions @code{inf?} and
-@code{nan?}.
-
@deffn {Scheme Procedure} real? obj
@deffnx {C Function} scm_real_p (obj)
Return @code{#t} if @var{obj} is a real number, else @code{#f}. Note
Note that the set of integer values forms a subset of the set of
rational numbers, i. e. the predicate will also be fulfilled if
@var{x} is an integer number.
-
-Since Guile can not represent irrational numbers, every number
-satisfying @code{real?} also satisfies @code{rational?} in Guile.
@end deffn
@deffn {Scheme Procedure} rationalize x eps
@deffn {Scheme Procedure} inf? x
@deffnx {C Function} scm_inf_p (x)
-Return @code{#t} if @var{x} is either @samp{+inf.0} or @samp{-inf.0},
-@code{#f} otherwise.
+Return @code{#t} if the real number @var{x} is @samp{+inf.0} or
+@samp{-inf.0}. Otherwise return @code{#f}.
@end deffn
@deffn {Scheme Procedure} nan? x
@deffnx {C Function} scm_nan_p (x)
-Return @code{#t} if @var{x} is @samp{+nan.0}, @code{#f} otherwise.
+Return @code{#t} if the real number @var{x} is @samp{+nan.0}, or
+@code{#f} otherwise.
+@end deffn
+
+@deffn {Scheme Procedure} finite? x
+@deffnx {C Function} scm_finite_p (x)
+Return @code{#t} if the real number @var{x} is neither infinite nor a
+NaN, @code{#f} otherwise.
@end deffn
@deffn {Scheme Procedure} nan
@deffnx {C Function} scm_nan ()
-Return NaN.
+Return @samp{+nan.0}, a @acronym{NaN} value.
@end deffn
@deffn {Scheme Procedure} inf
@deffnx {C Function} scm_inf ()
-Return Inf.
+Return @samp{+inf.0}, positive infinity.
@end deffn
@deffn {Scheme Procedure} numerator x
@end deftypefn
@deftypefn {C Function} SCM scm_from_double (double val)
-Return the @code{SCM} value that representats @var{val}. The returned
+Return the @code{SCM} value that represents @var{val}. The returned
value is inexact according to the predicate @code{inexact?}, but it
will be exactly equal to @var{val}.
@end deftypefn
(remainder 13 4) @result{} 1
(remainder -13 4) @result{} -1
@end lisp
+
+See also @code{euclidean-quotient}, @code{euclidean-remainder} and
+related operations in @ref{Arithmetic}.
@end deffn
@c begin (texi-doc-string "guile" "modulo")
(modulo 13 -4) @result{} -3
(modulo -13 -4) @result{} -1
@end lisp
+
+See also @code{euclidean-quotient}, @code{euclidean-remainder} and
+related operations in @ref{Arithmetic}.
@end deffn
@c begin (texi-doc-string "guile" "gcd")
The following procedures read and write numbers according to their
external representation as defined by R5RS (@pxref{Lexical structure,
R5RS Lexical Structure,, r5rs, The Revised^5 Report on the Algorithmic
-Language Scheme}). @xref{The ice-9 i18n Module, the @code{(ice-9
+Language Scheme}). @xref{Number Input and Output, the @code{(ice-9
i18n)} module}, for locale-dependent number parsing.
@deffn {Scheme Procedure} number->string n [radix]
@rnindex magnitude
@rnindex angle
-@deffn {Scheme Procedure} make-rectangular real imaginary
-@deffnx {C Function} scm_make_rectangular (real, imaginary)
-Return a complex number constructed of the given @var{real} and
-@var{imaginary} parts.
+@deffn {Scheme Procedure} make-rectangular real_part imaginary_part
+@deffnx {C Function} scm_make_rectangular (real_part, imaginary_part)
+Return a complex number constructed of the given @var{real-part} and @var{imaginary-part} parts.
@end deffn
@deffn {Scheme Procedure} make-polar x y
@rnindex *
@rnindex -
@rnindex /
+@findex 1+
+@findex 1-
@rnindex abs
@rnindex floor
@rnindex ceiling
@rnindex truncate
@rnindex round
+@rnindex euclidean/
+@rnindex euclidean-quotient
+@rnindex euclidean-remainder
+@rnindex centered/
+@rnindex centered-quotient
+@rnindex centered-remainder
The C arithmetic functions below always takes two arguments, while the
Scheme functions can take an arbitrary number. When you need to
called with one argument @var{z1}, 1/@var{z1} is returned.
@end deffn
+@deffn {Scheme Procedure} 1+ z
+@deffnx {C Function} scm_oneplus (z)
+Return @math{@var{z} + 1}.
+@end deffn
+
+@deffn {Scheme Procedure} 1- z
+@deffnx {C function} scm_oneminus (z)
+Return @math{@var{z} - 1}.
+@end deffn
+
@c begin (texi-doc-string "guile" "abs")
@deffn {Scheme Procedure} abs x
@deffnx {C Function} scm_abs (x)
values.
@end deftypefn
+@deffn {Scheme Procedure} euclidean/ x y
+@deffnx {Scheme Procedure} euclidean-quotient x y
+@deffnx {Scheme Procedure} euclidean-remainder x y
+@deffnx {C Function} scm_euclidean_quo_and_rem (x y)
+@deffnx {C Function} scm_euclidean_quotient (x y)
+@deffnx {C Function} scm_euclidean_remainder (x y)
+These procedures accept two real numbers @var{x} and @var{y}, where the
+divisor @var{y} must be non-zero. @code{euclidean-quotient} returns the
+integer @var{q} and @code{euclidean-remainder} returns the real number
+@var{r} such that @math{@var{x} = @var{q}*@var{y} + @var{r}} and
+@math{0 <= @var{r} < abs(@var{y})}. @code{euclidean/} returns both @var{q} and
+@var{r}, and is more efficient than computing each separately. Note
+that when @math{@var{y} > 0}, @code{euclidean-quotient} returns
+@math{floor(@var{x}/@var{y})}, otherwise it returns
+@math{ceiling(@var{x}/@var{y})}.
+
+Note that these operators are equivalent to the R6RS operators
+@code{div}, @code{mod}, and @code{div-and-mod}.
+
+@lisp
+(euclidean-quotient 123 10) @result{} 12
+(euclidean-remainder 123 10) @result{} 3
+(euclidean/ 123 10) @result{} 12 and 3
+(euclidean/ 123 -10) @result{} -12 and 3
+(euclidean/ -123 10) @result{} -13 and 7
+(euclidean/ -123 -10) @result{} 13 and 7
+(euclidean/ -123.2 -63.5) @result{} 2.0 and 3.8
+(euclidean/ 16/3 -10/7) @result{} -3 and 22/21
+@end lisp
+@end deffn
+
+@deffn {Scheme Procedure} centered/ x y
+@deffnx {Scheme Procedure} centered-quotient x y
+@deffnx {Scheme Procedure} centered-remainder x y
+@deffnx {C Function} scm_centered_quo_and_rem (x y)
+@deffnx {C Function} scm_centered_quotient (x y)
+@deffnx {C Function} scm_centered_remainder (x y)
+These procedures accept two real numbers @var{x} and @var{y}, where the
+divisor @var{y} must be non-zero. @code{centered-quotient} returns the
+integer @var{q} and @code{centered-remainder} returns the real number
+@var{r} such that @math{@var{x} = @var{q}*@var{y} + @var{r}} and
+@math{-abs(@var{y}/2) <= @var{r} < abs(@var{y}/2)}. @code{centered/}
+returns both @var{q} and @var{r}, and is more efficient than computing
+each separately.
+
+Note that @code{centered-quotient} returns @math{@var{x}/@var{y}}
+rounded to the nearest integer. When @math{@var{x}/@var{y}} lies
+exactly half-way between two integers, the tie is broken according to
+the sign of @var{y}. If @math{@var{y} > 0}, ties are rounded toward
+positive infinity, otherwise they are rounded toward negative infinity.
+This is a consequence of the requirement that @math{-abs(@var{y}/2) <= @var{r} < abs(@var{y}/2)}.
+
+Note that these operators are equivalent to the R6RS operators
+@code{div0}, @code{mod0}, and @code{div0-and-mod0}.
+
+@lisp
+(centered-quotient 123 10) @result{} 12
+(centered-remainder 123 10) @result{} 3
+(centered/ 123 10) @result{} 12 and 3
+(centered/ 123 -10) @result{} -12 and 3
+(centered/ -123 10) @result{} -12 and -3
+(centered/ -123 -10) @result{} 12 and -3
+(centered/ -123.2 -63.5) @result{} 2.0 and 3.8
+(centered/ 16/3 -10/7) @result{} -4 and -8/21
+@end lisp
+@end deffn
+
@node Scientific
@subsubsection Scientific Functions
@end deffn
-@node Primitive Numerics
-@subsubsection Primitive Numeric Functions
-
-Many of Guile's numeric procedures which accept any kind of numbers as
-arguments, including complex numbers, are implemented as Scheme
-procedures that use the following real number-based primitives. These
-primitives signal an error if they are called with complex arguments.
-
-@c begin (texi-doc-string "guile" "$abs")
-@deffn {Scheme Procedure} $abs x
-Return the absolute value of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$sqrt")
-@deffn {Scheme Procedure} $sqrt x
-Return the square root of @var{x}.
-@end deffn
-
-@deffn {Scheme Procedure} $expt x y
-@deffnx {C Function} scm_sys_expt (x, y)
-Return @var{x} raised to the power of @var{y}. This
-procedure does not accept complex arguments.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$sin")
-@deffn {Scheme Procedure} $sin x
-Return the sine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$cos")
-@deffn {Scheme Procedure} $cos x
-Return the cosine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$tan")
-@deffn {Scheme Procedure} $tan x
-Return the tangent of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$asin")
-@deffn {Scheme Procedure} $asin x
-Return the arcsine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$acos")
-@deffn {Scheme Procedure} $acos x
-Return the arccosine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$atan")
-@deffn {Scheme Procedure} $atan x
-Return the arctangent of @var{x} in the range @minus{}@math{PI/2} to
-@math{PI/2}.
-@end deffn
-
-@deffn {Scheme Procedure} $atan2 x y
-@deffnx {C Function} scm_sys_atan2 (x, y)
-Return the arc tangent of the two arguments @var{x} and
-@var{y}. This is similar to calculating the arc tangent of
-@var{x} / @var{y}, except that the signs of both arguments
-are used to determine the quadrant of the result. This
-procedure does not accept complex arguments.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$exp")
-@deffn {Scheme Procedure} $exp x
-Return e to the power of @var{x}, where e is the base of natural
-logarithms (2.71828@dots{}).
-@end deffn
-
-@c begin (texi-doc-string "guile" "$log")
-@deffn {Scheme Procedure} $log x
-Return the natural logarithm of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$sinh")
-@deffn {Scheme Procedure} $sinh x
-Return the hyperbolic sine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$cosh")
-@deffn {Scheme Procedure} $cosh x
-Return the hyperbolic cosine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$tanh")
-@deffn {Scheme Procedure} $tanh x
-Return the hyperbolic tangent of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$asinh")
-@deffn {Scheme Procedure} $asinh x
-Return the hyperbolic arcsine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$acosh")
-@deffn {Scheme Procedure} $acosh x
-Return the hyperbolic arccosine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$atanh")
-@deffn {Scheme Procedure} $atanh x
-Return the hyperbolic arctangent of @var{x}.
-@end deffn
-
-C functions for the above are provided by the standard mathematics
-library. Naturally these expect and return @code{double} arguments
-(@pxref{Mathematics,,, libc, GNU C Library Reference Manual}).
-
-@multitable {xx} {Scheme Procedure} {C Function}
-@item @tab Scheme Procedure @tab C Function
-
-@item @tab @code{$abs} @tab @code{fabs}
-@item @tab @code{$sqrt} @tab @code{sqrt}
-@item @tab @code{$sin} @tab @code{sin}
-@item @tab @code{$cos} @tab @code{cos}
-@item @tab @code{$tan} @tab @code{tan}
-@item @tab @code{$asin} @tab @code{asin}
-@item @tab @code{$acos} @tab @code{acos}
-@item @tab @code{$atan} @tab @code{atan}
-@item @tab @code{$atan2} @tab @code{atan2}
-@item @tab @code{$exp} @tab @code{exp}
-@item @tab @code{$expt} @tab @code{pow}
-@item @tab @code{$log} @tab @code{log}
-@item @tab @code{$sinh} @tab @code{sinh}
-@item @tab @code{$cosh} @tab @code{cosh}
-@item @tab @code{$tanh} @tab @code{tanh}
-@item @tab @code{$asinh} @tab @code{asinh}
-@item @tab @code{$acosh} @tab @code{acosh}
-@item @tab @code{$atanh} @tab @code{atanh}
-@end multitable
-
-@code{asinh}, @code{acosh} and @code{atanh} are C99 standard but might
-not be available on older systems. Guile provides the following
-equivalents (on all systems).
-
-@deftypefn {C Function} double scm_asinh (double x)
-@deftypefnx {C Function} double scm_acosh (double x)
-@deftypefnx {C Function} double scm_atanh (double x)
-Return the hyperbolic arcsine, arccosine or arctangent of @var{x}
-respectively.
-@end deftypefn
-
-
@node Bitwise Operations
@subsubsection Bitwise Operations
@subsubsection Random Number Generation
Pseudo-random numbers are generated from a random state object, which
-can be created with @code{seed->random-state}. The @var{state}
-parameter to the various functions below is optional, it defaults to
-the state object in the @code{*random-state*} variable.
+can be created with @code{seed->random-state} or
+@code{datum->random-state}. An external representation (i.e. one
+which can written with @code{write} and read with @code{read}) of a
+random state object can be obtained via
+@code{random-state->datum}. The @var{state} parameter to the
+various functions below is optional, it defaults to the state object
+in the @code{*random-state*} variable.
@deffn {Scheme Procedure} copy-random-state [state]
@deffnx {C Function} scm_copy_random_state (state)
Return a new random state using @var{seed}.
@end deffn
+@deffn {Scheme Procedure} datum->random-state datum
+@deffnx {C Function} scm_datum_to_random_state (datum)
+Return a new random state from @var{datum}, which should have been
+obtained by @code{random-state->datum}.
+@end deffn
+
+@deffn {Scheme Procedure} random-state->datum state
+@deffnx {C Function} scm_random_state_to_datum (state)
+Return a datum representation of @var{state} that may be written out and
+read back with the Scheme reader.
+@end deffn
+
@defvar *random-state*
The global random state used by the above functions when the
@var{state} parameter is not given.
@end defvar
+Note that the initial value of @code{*random-state*} is the same every
+time Guile starts up. Therefore, if you don't pass a @var{state}
+parameter to the above procedures, and you don't set
+@code{*random-state*} to @code{(seed->random-state your-seed)}, where
+@code{your-seed} is something that @emph{isn't} the same every time,
+you'll get the same sequence of ``random'' numbers on every run.
+
+For example, unless the relevant source code has changed, @code{(map
+random (cdr (iota 30)))}, if the first use of random numbers since
+Guile started up, will always give:
+
+@lisp
+(map random (cdr (iota 19)))
+@result{}
+(0 1 1 2 2 2 1 2 6 7 10 0 5 3 12 5 5 12)
+@end lisp
+
+To use the time of day as the random seed, you can use code like this:
+
+@lisp
+(let ((time (gettimeofday)))
+ (set! *random-state*
+ (seed->random-state (+ (car time)
+ (cdr time)))))
+@end lisp
+
+@noindent
+And then (depending on the time of day, of course):
+
+@lisp
+(map random (cdr (iota 19)))
+@result{}
+(0 0 1 0 2 4 5 4 5 5 9 3 10 1 8 3 14 17)
+@end lisp
+
+For security applications, such as password generation, you should use
+more bits of seed. Otherwise an open source password generator could
+be attacked by guessing the seed@dots{} but that's a subject for
+another manual.
+
@node Characters
@subsection Characters
@tpindex Characters
+In Scheme, there is a data type to describe a single character.
+
+Defining what exactly a character @emph{is} can be more complicated
+than it seems. Guile follows the advice of R6RS and uses The Unicode
+Standard to help define what a character is. So, for Guile, a
+character is anything in the Unicode Character Database.
+
+@cindex code point
+@cindex Unicode code point
+
+The Unicode Character Database is basically a table of characters
+indexed using integers called 'code points'. Valid code points are in
+the ranges 0 to @code{#xD7FF} inclusive or @code{#xE000} to
+@code{#x10FFFF} inclusive, which is about 1.1 million code points.
+
+@cindex designated code point
+@cindex code point, designated
+
+Any code point that has been assigned to a character or that has
+otherwise been given a meaning by Unicode is called a 'designated code
+point'. Most of the designated code points, about 200,000 of them,
+indicate characters, accents or other combining marks that modify
+other characters, symbols, whitespace, and control characters. Some
+are not characters but indicators that suggest how to format or
+display neighboring characters.
+
+@cindex reserved code point
+@cindex code point, reserved
+
+If a code point is not a designated code point -- if it has not been
+assigned to a character by The Unicode Standard -- it is a 'reserved
+code point', meaning that they are reserved for future use. Most of
+the code points, about 800,000, are 'reserved code points'.
+
+By convention, a Unicode code point is written as
+``U+XXXX'' where ``XXXX'' is a hexadecimal number. Please note that
+this convenient notation is not valid code. Guile does not interpret
+``U+XXXX'' as a character.
+
In Scheme, a character literal is written as @code{#\@var{name}} where
@var{name} is the name of the character that you want. Printable
characters have their usual single character name; for example,
-@code{#\a} is a lower case @code{a}.
+@code{#\a} is a lower case @code{a}.
+
+Some of the code points are 'combining characters' that are not meant
+to be printed by themselves but are instead meant to modify the
+appearance of the previous character. For combining characters, an
+alternate form of the character literal is @code{#\} followed by
+U+25CC (a small, dotted circle), followed by the combining character.
+This allows the combining character to be drawn on the circle, not on
+the backslash of @code{#\}.
+
+Many of the non-printing characters, such as whitespace characters and
+control characters, also have names.
+
+The most commonly used non-printing characters have long character
+names, described in the table below.
+
+@multitable {@code{#\backspace}} {Preferred}
+@item Character Name @tab Codepoint
+@item @code{#\nul} @tab U+0000
+@item @code{#\alarm} @tab u+0007
+@item @code{#\backspace} @tab U+0008
+@item @code{#\tab} @tab U+0009
+@item @code{#\linefeed} @tab U+000A
+@item @code{#\newline} @tab U+000A
+@item @code{#\vtab} @tab U+000B
+@item @code{#\page} @tab U+000C
+@item @code{#\return} @tab U+000D
+@item @code{#\esc} @tab U+001B
+@item @code{#\space} @tab U+0020
+@item @code{#\delete} @tab U+007F
+@end multitable
-Most of the ``control characters'' (those below codepoint 32) in the
-@acronym{ASCII} character set, as well as the space, may be referred
-to by longer names: for example, @code{#\tab}, @code{#\esc},
-@code{#\stx}, and so on. The following table describes the
-@acronym{ASCII} names for each character.
+There are also short names for all of the ``C0 control characters''
+(those with code points below 32). The following table lists the short
+name for each character.
@multitable @columnfractions .25 .25 .25 .25
@item 0 = @code{#\nul}
@tab 7 = @code{#\bel}
@item 8 = @code{#\bs}
@tab 9 = @code{#\ht}
- @tab 10 = @code{#\nl}
+ @tab 10 = @code{#\lf}
@tab 11 = @code{#\vt}
-@item 12 = @code{#\np}
+@item 12 = @code{#\ff}
@tab 13 = @code{#\cr}
@tab 14 = @code{#\so}
@tab 15 = @code{#\si}
@item 32 = @code{#\sp}
@end multitable
-The ``delete'' character (octal 177) may be referred to with the name
+The short name for the ``delete'' character (code point U+007F) is
@code{#\del}.
-Several characters have more than one name:
+There are also a few alternative names left over for compatibility with
+previous versions of Guile.
-@multitable {@code{#\backspace}} {Original}
-@item Alias @tab Original
-@item @code{#\space} @tab @code{#\sp}
-@item @code{#\newline} @tab @code{#\nl}
-@item @code{#\tab} @tab @code{#\ht}
-@item @code{#\backspace} @tab @code{#\bs}
-@item @code{#\return} @tab @code{#\cr}
-@item @code{#\page} @tab @code{#\np}
+@multitable {@code{#\backspace}} {Preferred}
+@item Alternate @tab Standard
+@item @code{#\nl} @tab @code{#\newline}
+@item @code{#\np} @tab @code{#\page}
@item @code{#\null} @tab @code{#\nul}
@end multitable
+Characters may also be written using their code point values. They can
+be written with as an octal number, such as @code{#\10} for
+@code{#\bs} or @code{#\177} for @code{#\del}.
+
+If one prefers hex to octal, there is an additional syntax for character
+escapes: @code{#\xHHHH} -- the letter 'x' followed by a hexadecimal
+number of one to eight digits.
+
@rnindex char?
@deffn {Scheme Procedure} char? x
@deffnx {C Function} scm_char_p (x)
Return @code{#t} iff @var{x} is a character, else @code{#f}.
@end deffn
+Fundamentally, the character comparison operations below are
+numeric comparisons of the character's code points.
+
@rnindex char=?
@deffn {Scheme Procedure} char=? x y
-Return @code{#t} iff @var{x} is the same character as @var{y}, else @code{#f}.
+Return @code{#t} iff code point of @var{x} is equal to the code point
+of @var{y}, else @code{#f}.
@end deffn
@rnindex char<?
@deffn {Scheme Procedure} char<? x y
-Return @code{#t} iff @var{x} is less than @var{y} in the @acronym{ASCII} sequence,
-else @code{#f}.
+Return @code{#t} iff the code point of @var{x} is less than the code
+point of @var{y}, else @code{#f}.
@end deffn
@rnindex char<=?
@deffn {Scheme Procedure} char<=? x y
-Return @code{#t} iff @var{x} is less than or equal to @var{y} in the
-@acronym{ASCII} sequence, else @code{#f}.
+Return @code{#t} iff the code point of @var{x} is less than or equal
+to the code point of @var{y}, else @code{#f}.
@end deffn
@rnindex char>?
@deffn {Scheme Procedure} char>? x y
-Return @code{#t} iff @var{x} is greater than @var{y} in the @acronym{ASCII}
-sequence, else @code{#f}.
+Return @code{#t} iff the code point of @var{x} is greater than the
+code point of @var{y}, else @code{#f}.
@end deffn
@rnindex char>=?
@deffn {Scheme Procedure} char>=? x y
-Return @code{#t} iff @var{x} is greater than or equal to @var{y} in the
-@acronym{ASCII} sequence, else @code{#f}.
+Return @code{#t} iff the code point of @var{x} is greater than or
+equal to the code point of @var{y}, else @code{#f}.
@end deffn
+@cindex case folding
+
+Case-insensitive character comparisons use @emph{Unicode case
+folding}. In case folding comparisons, if a character is lowercase
+and has an uppercase form that can be expressed as a single character,
+it is converted to uppercase before comparison. All other characters
+undergo no conversion before the comparison occurs. This includes the
+German sharp S (Eszett) which is not uppercased before conversion
+because its uppercase form has two characters. Unicode case folding
+is language independent: it uses rules that are generally true, but,
+it cannot cover all cases for all languages.
+
@rnindex char-ci=?
@deffn {Scheme Procedure} char-ci=? x y
-Return @code{#t} iff @var{x} is the same character as @var{y} ignoring
-case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is the same
+as the case-folded code point of @var{y}, else @code{#f}.
@end deffn
@rnindex char-ci<?
@deffn {Scheme Procedure} char-ci<? x y
-Return @code{#t} iff @var{x} is less than @var{y} in the @acronym{ASCII} sequence
-ignoring case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is less
+than the case-folded code point of @var{y}, else @code{#f}.
@end deffn
@rnindex char-ci<=?
@deffn {Scheme Procedure} char-ci<=? x y
-Return @code{#t} iff @var{x} is less than or equal to @var{y} in the
-@acronym{ASCII} sequence ignoring case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is less
+than or equal to the case-folded code point of @var{y}, else
+@code{#f}.
@end deffn
@rnindex char-ci>?
@deffn {Scheme Procedure} char-ci>? x y
-Return @code{#t} iff @var{x} is greater than @var{y} in the @acronym{ASCII}
-sequence ignoring case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is greater
+than the case-folded code point of @var{y}, else @code{#f}.
@end deffn
@rnindex char-ci>=?
@deffn {Scheme Procedure} char-ci>=? x y
-Return @code{#t} iff @var{x} is greater than or equal to @var{y} in the
-@acronym{ASCII} sequence ignoring case, else @code{#f}.
+Return @code{#t} iff the case-folded code point of @var{x} is greater
+than or equal to the case-folded code point of @var{y}, else
+@code{#f}.
@end deffn
@rnindex char-alphabetic?
@code{#f}.
@end deffn
+@deffn {Scheme Procedure} char-general-category chr
+@deffnx {C Function} scm_char_general_category (chr)
+Return a symbol giving the two-letter name of the Unicode general
+category assigned to @var{chr} or @code{#f} if no named category is
+assigned. The following table provides a list of category names along
+with their meanings.
+
+@multitable @columnfractions .1 .4 .1 .4
+@item Lu
+ @tab Uppercase letter
+ @tab Pf
+ @tab Final quote punctuation
+@item Ll
+ @tab Lowercase letter
+ @tab Po
+ @tab Other punctuation
+@item Lt
+ @tab Titlecase letter
+ @tab Sm
+ @tab Math symbol
+@item Lm
+ @tab Modifier letter
+ @tab Sc
+ @tab Currency symbol
+@item Lo
+ @tab Other letter
+ @tab Sk
+ @tab Modifier symbol
+@item Mn
+ @tab Non-spacing mark
+ @tab So
+ @tab Other symbol
+@item Mc
+ @tab Combining spacing mark
+ @tab Zs
+ @tab Space separator
+@item Me
+ @tab Enclosing mark
+ @tab Zl
+ @tab Line separator
+@item Nd
+ @tab Decimal digit number
+ @tab Zp
+ @tab Paragraph separator
+@item Nl
+ @tab Letter number
+ @tab Cc
+ @tab Control
+@item No
+ @tab Other number
+ @tab Cf
+ @tab Format
+@item Pc
+ @tab Connector punctuation
+ @tab Cs
+ @tab Surrogate
+@item Pd
+ @tab Dash punctuation
+ @tab Co
+ @tab Private use
+@item Ps
+ @tab Open punctuation
+ @tab Cn
+ @tab Unassigned
+@item Pe
+ @tab Close punctuation
+ @tab
+ @tab
+@item Pi
+ @tab Initial quote punctuation
+ @tab
+ @tab
+@end multitable
+@end deffn
+
@rnindex char->integer
@deffn {Scheme Procedure} char->integer chr
@deffnx {C Function} scm_char_to_integer (chr)
-Return the number corresponding to ordinal position of @var{chr} in the
-@acronym{ASCII} sequence.
+Return the code point of @var{chr}.
@end deffn
@rnindex integer->char
@deffn {Scheme Procedure} integer->char n
@deffnx {C Function} scm_integer_to_char (n)
-Return the character at position @var{n} in the @acronym{ASCII} sequence.
+Return the character that has code point @var{n}. The integer @var{n}
+must be a valid code point. Valid code points are in the ranges 0 to
+@code{#xD7FF} inclusive or @code{#xE000} to @code{#x10FFFF} inclusive.
@end deffn
@rnindex char-upcase
Return the lowercase character version of @var{chr}.
@end deffn
+@rnindex char-titlecase
+@deffn {Scheme Procedure} char-titlecase chr
+@deffnx {C Function} scm_char_titlecase (chr)
+Return the titlecase character version of @var{chr} if one exists;
+otherwise return the uppercase version.
+
+For most characters these will be the same, but the Unicode Standard
+includes certain digraph compatibility characters, such as @code{U+01F3}
+``dz'', for which the uppercase and titlecase characters are different
+(@code{U+01F1} ``DZ'' and @code{U+01F2} ``Dz'' in this case,
+respectively).
+@end deffn
+
+@tindex scm_t_wchar
+@deftypefn {C Function} scm_t_wchar scm_c_upcase (scm_t_wchar @var{c})
+@deftypefnx {C Function} scm_t_wchar scm_c_downcase (scm_t_wchar @var{c})
+@deftypefnx {C Function} scm_t_wchar scm_c_titlecase (scm_t_wchar @var{c})
+
+These C functions take an integer representation of a Unicode
+codepoint and return the codepoint corresponding to its uppercase,
+lowercase, and titlecase forms respectively. The type
+@code{scm_t_wchar} is a signed, 32-bit integer.
+@end deftypefn
+
@node Character Sets
@subsection Character Sets
Character sets can be created, extended, tested for the membership of a
characters and be compared to other character sets.
-The Guile implementation of character sets currently deals only with
-8-bit characters. In the future, when Guile gets support for
-international character sets, this will change, but the functions
-provided here will always then be able to efficiently cope with very
-large character sets.
-
@menu
* Character Set Predicates/Comparison::
* Iterating Over Character Sets:: Enumerate charset elements.
If @var{error} is a true value, an error is signalled if the
specified range contains characters which are not contained in
the implemented character range. If @var{error} is @code{#f},
-these characters are silently left out of the resultung
+these characters are silently left out of the resulting
character set.
The characters in @var{base_cs} are added to the result, if
If @var{error} is a true value, an error is signalled if the
specified range contains characters which are not contained in
the implemented character range. If @var{error} is @code{#f},
-these characters are silently left out of the resultung
+these characters are silently left out of the resulting
character set.
The characters are added to @var{base_cs} and @var{base_cs} is
@deffn {Scheme Procedure} ->char-set x
@deffnx {C Function} scm_to_char_set (x)
-Coerces x into a char-set. @var{x} may be a string, character or char-set. A string is converted to the set of its constituent characters; a character is converted to a singleton set; a char-set is returned as-is.
+Coerces x into a char-set. @var{x} may be a string, character or
+char-set. A string is converted to the set of its constituent
+characters; a character is converted to a singleton set; a char-set is
+returned as-is.
@end deffn
@c ===================================================================
Access the elements and other information of a character set with these
procedures.
+@deffn {Scheme Procedure} %char-set-dump cs
+Returns an association list containing debugging information
+for @var{cs}. The association list has the following entries.
+@table @code
+@item char-set
+The char-set itself
+@item len
+The number of groups of contiguous code points the char-set
+contains
+@item ranges
+A list of lists where each sublist is a range of code points
+and their associated characters
+@end table
+The return value of this function cannot be relied upon to be
+consistent between versions of Guile and should not be used in code.
+@end deffn
+
@deffn {Scheme Procedure} char-set-size cs
@deffnx {C Function} scm_char_set_size (cs)
Return the number of elements in character set @var{cs}.
Return the complement of the character set @var{cs}.
@end deffn
+Note that the complement of a character set is likely to contain many
+reserved code points (code points that are not associated with
+characters). It may be helpful to modify the output of
+@code{char-set-complement} by computing its intersection with the set
+of designated code points, @code{char-set:designated}.
+
@deffn {Scheme Procedure} char-set-union . rest
@deffnx {C Function} scm_char_set_union (rest)
Return the union of all argument character sets.
@cindex charset
@cindex locale
-Currently, the contents of these character sets are recomputed upon a
-successful @code{setlocale} call (@pxref{Locales}) in order to reflect
-the characters available in the current locale's codeset. For
-instance, @code{char-set:letter} contains 52 characters under an ASCII
-locale (e.g., the default @code{C} locale) and 117 characters under an
-ISO-8859-1 (``Latin-1'') locale.
+These character sets are locale independent and are not recomputed
+upon a @code{setlocale} call. They contain characters from the whole
+range of Unicode code points. For instance, @code{char-set:letter}
+contains about 94,000 characters.
@defvr {Scheme Variable} char-set:lower-case
@defvrx {C Variable} scm_char_set_lower_case
@defvr {Scheme Variable} char-set:title-case
@defvrx {C Variable} scm_char_set_title_case
-This is empty, because ASCII has no titlecase characters.
+All single characters that function as if they were an upper-case
+letter followed by a lower-case letter.
@end defvr
@defvr {Scheme Variable} char-set:letter
@defvrx {C Variable} scm_char_set_letter
-All letters, e.g. the union of @code{char-set:lower-case} and
-@code{char-set:upper-case}.
+All letters. This includes @code{char-set:lower-case},
+@code{char-set:upper-case}, @code{char-set:title-case}, and many
+letters that have no case at all. For example, Chinese and Japanese
+characters typically have no concept of case.
@end defvr
@defvr {Scheme Variable} char-set:digit
@defvr {Scheme Variable} char-set:blank
@defvrx {C Variable} scm_char_set_blank
-All horizontal whitespace characters, that is @code{#\space} and
-@code{#\tab}.
+All horizontal whitespace characters, which notably includes
+@code{#\space} and @code{#\tab}.
@end defvr
@defvr {Scheme Variable} char-set:iso-control
@defvrx {C Variable} scm_char_set_iso_control
-The ISO control characters with the codes 0--31 and 127.
+The ISO control characters are the C0 control characters (U+0000 to
+U+001F), delete (U+007F), and the C1 control characters (U+0080 to
+U+009F).
@end defvr
@defvr {Scheme Variable} char-set:punctuation
@defvrx {C Variable} scm_char_set_punctuation
-The characters @code{!"#%&'()*,-./:;?@@[\\]_@{@}}
+All punctuation characters, such as the characters
+@code{!"#%&'()*,-./:;?@@[\\]_@{@}}
@end defvr
@defvr {Scheme Variable} char-set:symbol
@defvrx {C Variable} scm_char_set_symbol
-The characters @code{$+<=>^`|~}.
+All symbol characters, such as the characters @code{$+<=>^`|~}.
@end defvr
@defvr {Scheme Variable} char-set:hex-digit
The empty character set.
@end defvr
+@defvr {Scheme Variable} char-set:designated
+@defvrx {C Variable} scm_char_set_designated
+This character set contains all designated code points. This includes
+all the code points to which Unicode has assigned a character or other
+meaning.
+@end defvr
+
@defvr {Scheme Variable} char-set:full
@defvrx {C Variable} scm_char_set_full
-This character set contains all possible characters.
+This character set contains all possible code points. This includes
+both designated and reserved code points.
@end defvr
@node Strings
When one of these two strings is modified, as with @code{string-set!},
their common memory does get copied so that each string has its own
-memory and modifying one does not accidently modify the other as well.
+memory and modifying one does not accidentally modify the other as well.
Thus, Guile's strings are `copy on write'; the actual copying of their
memory is delayed until one string is written to.
* Reversing and Appending Strings:: Appending strings to form a new string.
* Mapping Folding and Unfolding:: Iterating over strings.
* Miscellaneous String Operations:: Replicating, insertion, parsing, ...
-* Conversion to/from C::
+* Conversion to/from C::
+* String Internals:: The storage strategy for strings.
@end menu
@node String Syntax
The read syntax for strings is an arbitrarily long sequence of
characters enclosed in double quotes (@nicode{"}).
-Backslash is an escape character and can be used to insert the
-following special characters. @nicode{\"} and @nicode{\\} are R5RS
-standard, the rest are Guile extensions, notice they follow C string
-syntax.
+Backslash is an escape character and can be used to insert the following
+special characters. @nicode{\"} and @nicode{\\} are R5RS standard, the
+next seven are R6RS standard --- notice they follow C syntax --- and the
+remaining four are Guile extensions.
@table @asis
@item @nicode{\\}
Double quote character (an unescaped @nicode{"} is otherwise the end
of the string).
-@item @nicode{\0}
-NUL character (ASCII 0).
-
@item @nicode{\a}
Bell character (ASCII 7).
@item @nicode{\v}
Vertical tab character (ASCII 11).
+@item @nicode{\b}
+Backspace character (ASCII 8).
+
+@item @nicode{\0}
+NUL character (ASCII 0).
+
+@item @nicode{\} followed by newline (ASCII 10)
+Nothing. This way if @nicode{\} is the last character in a line, the
+string will continue with the first character from the next line,
+without a line break.
+
+If the @code{hungry-eol-escapes} reader option is enabled, which is not
+the case by default, leading whitespace on the next line is discarded.
+
+@lisp
+"foo\
+ bar"
+@result{} "foo bar"
+(read-enable 'hungry-eol-escapes)
+"foo\
+ bar"
+@result{} "foobar"
+@end lisp
@item @nicode{\xHH}
Character code given by two hexadecimal digits. For example
@nicode{\x7f} for an ASCII DEL (127).
+
+@item @nicode{\uHHHH}
+Character code given by four hexadecimal digits. For example
+@nicode{\u0100} for a capital A with macron (U+0100).
+
+@item @nicode{\UHHHHHH}
+Character code given by six hexadecimal digits. For example
+@nicode{\U010402}.
@end table
@noindent
"\"Hi\", he said."
@end lisp
+The three escape sequences @code{\xHH}, @code{\uHHHH} and @code{\UHHHHHH} were
+chosen to not break compatibility with code written for previous versions of
+Guile. The R6RS specification suggests a different, incompatible syntax for hex
+escapes: @code{\xHHHH;} -- a character code followed by one to eight hexadecimal
+digits terminated with a semicolon. If this escape format is desired instead,
+it can be enabled with the reader option @code{r6rs-hex-escapes}.
+
+@lisp
+(read-enable 'r6rs-hex-escapes)
+@end lisp
+
+For more on reader options, @xref{Scheme Read}.
@node String Predicates
@subsubsection String Predicates
@deffnx {C Function} scm_string_trim (s, char_pred, start, end)
@deffnx {C Function} scm_string_trim_right (s, char_pred, start, end)
@deffnx {C Function} scm_string_trim_both (s, char_pred, start, end)
-Trim occurrances of @var{char_pred} from the ends of @var{s}.
+Trim occurrences of @var{char_pred} from the ends of @var{s}.
@code{string-trim} trims @var{char_pred} characters from the left
(start) of the string, @code{string-trim-right} trims them from the
predicates (@pxref{Characters}), but are defined on character sequences.
The first set is specified in R5RS and has names that end in @code{?}.
-The second set is specified in SRFI-13 and the names have no ending
-@code{?}. The predicates ending in @code{-ci} ignore the character case
-when comparing strings. @xref{The ice-9 i18n Module, the @code{(ice-9
+The second set is specified in SRFI-13 and the names have not ending
+@code{?}.
+
+The predicates ending in @code{-ci} ignore the character case
+when comparing strings. For now, case-insensitive comparison is done
+using the R5RS rules, where every lower-case character that has a
+single character upper-case form is converted to uppercase before
+comparison. See @xref{Text Collation, the @code{(ice-9
i18n)} module}, for locale-dependent string comparison.
@rnindex string=?
-@deffn {Scheme Procedure} string=? s1 s2
+@deffn {Scheme Procedure} string=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_equal_p (s1, s2, rest)
Lexicographic equality predicate; return @code{#t} if the two
strings are the same length and contain the same characters in
the same positions, otherwise return @code{#f}.
@end deffn
@rnindex string<?
-@deffn {Scheme Procedure} string<? s1 s2
+@deffn {Scheme Procedure} string<? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_less_p (s1, s2, rest)
Lexicographic ordering predicate; return @code{#t} if @var{s1}
is lexicographically less than @var{s2}.
@end deffn
@rnindex string<=?
-@deffn {Scheme Procedure} string<=? s1 s2
+@deffn {Scheme Procedure} string<=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_leq_p (s1, s2, rest)
Lexicographic ordering predicate; return @code{#t} if @var{s1}
is lexicographically less than or equal to @var{s2}.
@end deffn
@rnindex string>?
-@deffn {Scheme Procedure} string>? s1 s2
+@deffn {Scheme Procedure} string>? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_gr_p (s1, s2, rest)
Lexicographic ordering predicate; return @code{#t} if @var{s1}
is lexicographically greater than @var{s2}.
@end deffn
@rnindex string>=?
-@deffn {Scheme Procedure} string>=? s1 s2
+@deffn {Scheme Procedure} string>=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_geq_p (s1, s2, rest)
Lexicographic ordering predicate; return @code{#t} if @var{s1}
is lexicographically greater than or equal to @var{s2}.
@end deffn
@rnindex string-ci=?
-@deffn {Scheme Procedure} string-ci=? s1 s2
+@deffn {Scheme Procedure} string-ci=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_equal_p (s1, s2, rest)
Case-insensitive string equality predicate; return @code{#t} if
the two strings are the same length and their component
characters match (ignoring case) at each position; otherwise
@end deffn
@rnindex string-ci<?
-@deffn {Scheme Procedure} string-ci<? s1 s2
+@deffn {Scheme Procedure} string-ci<? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_less_p (s1, s2, rest)
Case insensitive lexicographic ordering predicate; return
@code{#t} if @var{s1} is lexicographically less than @var{s2}
regardless of case.
@end deffn
@rnindex string<=?
-@deffn {Scheme Procedure} string-ci<=? s1 s2
+@deffn {Scheme Procedure} string-ci<=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_leq_p (s1, s2, rest)
Case insensitive lexicographic ordering predicate; return
@code{#t} if @var{s1} is lexicographically less than or equal
to @var{s2} regardless of case.
@end deffn
@rnindex string-ci>?
-@deffn {Scheme Procedure} string-ci>? s1 s2
+@deffn {Scheme Procedure} string-ci>? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_gr_p (s1, s2, rest)
Case insensitive lexicographic ordering predicate; return
@code{#t} if @var{s1} is lexicographically greater than
@var{s2} regardless of case.
@end deffn
@rnindex string-ci>=?
-@deffn {Scheme Procedure} string-ci>=? s1 s2
+@deffn {Scheme Procedure} string-ci>=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_geq_p (s1, s2, rest)
Case insensitive lexicographic ordering predicate; return
@code{#t} if @var{s1} is lexicographically greater than or
equal to @var{s2} regardless of case.
equal to, or greater than @var{s2}. The mismatch index is the
largest index @var{i} such that for every 0 <= @var{j} <
@var{i}, @var{s1}[@var{j}] = @var{s2}[@var{j}] -- that is,
-@var{i} is the first position that does not match. The
-character comparison is done case-insensitively.
+@var{i} is the first position where the lowercased letters
+do not match.
+
@end deffn
@deffn {Scheme Procedure} string= s1 s2 [start1 [end1 [start2 [end2]]]]
Compute a hash value for @var{S}. the optional argument @var{bound} is a non-negative exact integer specifying the range of the hash function. A positive value restricts the return value to the range [0,bound).
@end deffn
+Because the same visual appearance of an abstract Unicode character can
+be obtained via multiple sequences of Unicode characters, even the
+case-insensitive string comparison functions described above may return
+@code{#f} when presented with strings containing different
+representations of the same character. For example, the Unicode
+character ``LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE'' can be
+represented with a single character (U+1E69) or by the character ``LATIN
+SMALL LETTER S'' (U+0073) followed by the combining marks ``COMBINING
+DOT BELOW'' (U+0323) and ``COMBINING DOT ABOVE'' (U+0307).
+
+For this reason, it is often desirable to ensure that the strings
+to be compared are using a mutually consistent representation for every
+character. The Unicode standard defines two methods of normalizing the
+contents of strings: Decomposition, which breaks composite characters
+into a set of constituent characters with an ordering defined by the
+Unicode Standard; and composition, which performs the converse.
+
+There are two decomposition operations. ``Canonical decomposition''
+produces character sequences that share the same visual appearance as
+the original characters, while ``compatiblity decomposition'' produces
+ones whose visual appearances may differ from the originals but which
+represent the same abstract character.
+
+These operations are encapsulated in the following set of normalization
+forms:
+
+@table @dfn
+@item NFD
+Characters are decomposed to their canonical forms.
+
+@item NFKD
+Characters are decomposed to their compatibility forms.
+
+@item NFC
+Characters are decomposed to their canonical forms, then composed.
+
+@item NFKC
+Characters are decomposed to their compatibility forms, then composed.
+
+@end table
+
+The functions below put their arguments into one of the forms described
+above.
+
+@deffn {Scheme Procedure} string-normalize-nfd s
+@deffnx {C Function} scm_string_normalize_nfd (s)
+Return the @code{NFD} normalized form of @var{s}.
+@end deffn
+
+@deffn {Scheme Procedure} string-normalize-nfkd s
+@deffnx {C Function} scm_string_normalize_nfkd (s)
+Return the @code{NFKD} normalized form of @var{s}.
+@end deffn
+
+@deffn {Scheme Procedure} string-normalize-nfc s
+@deffnx {C Function} scm_string_normalize_nfc (s)
+Return the @code{NFC} normalized form of @var{s}.
+@end deffn
+
+@deffn {Scheme Procedure} string-normalize-nfkc s
+@deffnx {C Function} scm_string_normalize_nfkc (s)
+Return the @code{NFKC} normalized form of @var{s}.
+@end deffn
+
@node String Searching
@subsubsection String Searching
@deffn {Scheme Procedure} string-index s char_pred [start [end]]
@deffnx {C Function} scm_string_index (s, char_pred, start, end)
Search through the string @var{s} from left to right, returning
-the index of the first occurence of a character which
+the index of the first occurrence of a character which
@itemize @bullet
@item
equals @var{char_pred}, if it is character,
@item
-satisifies the predicate @var{char_pred}, if it is a procedure,
+satisfies the predicate @var{char_pred}, if it is a procedure,
@item
is in the set @var{char_pred}, if it is a character set.
@end itemize
+
+Return @code{#f} if no match is found.
@end deffn
@deffn {Scheme Procedure} string-rindex s char_pred [start [end]]
@deffnx {C Function} scm_string_rindex (s, char_pred, start, end)
Search through the string @var{s} from right to left, returning
-the index of the last occurence of a character which
+the index of the last occurrence of a character which
@itemize @bullet
@item
equals @var{char_pred}, if it is character,
@item
-satisifies the predicate @var{char_pred}, if it is a procedure,
+satisfies the predicate @var{char_pred}, if it is a procedure,
@item
is in the set if @var{char_pred} is a character set.
@end itemize
+
+Return @code{#f} if no match is found.
@end deffn
@deffn {Scheme Procedure} string-prefix-length s1 s2 [start1 [end1 [start2 [end2]]]]
@deffn {Scheme Procedure} string-index-right s char_pred [start [end]]
@deffnx {C Function} scm_string_index_right (s, char_pred, start, end)
Search through the string @var{s} from right to left, returning
-the index of the last occurence of a character which
+the index of the last occurrence of a character which
@itemize @bullet
@item
equals @var{char_pred}, if it is character,
@item
-satisifies the predicate @var{char_pred}, if it is a procedure,
+satisfies the predicate @var{char_pred}, if it is a procedure,
@item
is in the set if @var{char_pred} is a character set.
@end itemize
+
+Return @code{#f} if no match is found.
@end deffn
@deffn {Scheme Procedure} string-skip s char_pred [start [end]]
@deffnx {C Function} scm_string_skip (s, char_pred, start, end)
Search through the string @var{s} from left to right, returning
-the index of the first occurence of a character which
+the index of the first occurrence of a character which
@itemize @bullet
@item
does not equal @var{char_pred}, if it is character,
@item
-does not satisify the predicate @var{char_pred}, if it is a
+does not satisfy the predicate @var{char_pred}, if it is a
procedure,
@item
@deffn {Scheme Procedure} string-skip-right s char_pred [start [end]]
@deffnx {C Function} scm_string_skip_right (s, char_pred, start, end)
Search through the string @var{s} from right to left, returning
-the index of the last occurence of a character which
+the index of the last occurrence of a character which
@itemize @bullet
@item
equals @var{char_pred}, if it is character,
@item
-satisifies the predicate @var{char_pred}, if it is a procedure.
+satisfies the predicate @var{char_pred}, if it is a procedure.
@item
is in the set @var{char_pred}, if it is a character set.
These are procedures for mapping strings to their upper- or lower-case
equivalents, respectively, or for capitalizing strings.
+They use the basic case mapping rules for Unicode characters. No
+special language or context rules are considered. The resulting strings
+are guaranteed to be the same length as the input strings.
+
+@xref{Character Case Mapping, the @code{(ice-9
+i18n)} module}, for locale-dependent case conversions.
+
@deffn {Scheme Procedure} string-upcase str [start [end]]
@deffnx {C Function} scm_substring_upcase (str, start, end)
@deffnx {C Function} scm_string_upcase (str)
@end example
@end deffn
-@deffn {Scheme Procedure} string-append/shared . ls
-@deffnx {C Function} scm_string_append_shared (ls)
+@deffn {Scheme Procedure} string-append/shared . rest
+@deffnx {C Function} scm_string_append_shared (rest)
Like @code{string-append}, but the result may share memory
with the argument strings.
@end deffn
@deffnx {C Function} scm_string_concatenate_reverse (ls, final_string, end)
Without optional arguments, this procedure is equivalent to
-@smalllisp
+@lisp
(string-concatenate (reverse ls))
-@end smalllisp
+@end lisp
If the optional argument @var{final_string} is specified, it is
consed onto the beginning to @var{ls} before performing the
@deffn {Scheme Procedure} string-concatenate-reverse/shared ls [final_string [end]]
@deffnx {C Function} scm_string_concatenate_reverse_shared (ls, final_string, end)
Like @code{string-concatenate-reverse}, but the result may
-share memory with the the strings in the @var{ls} arguments.
+share memory with the strings in the @var{ls} arguments.
@end deffn
@node Mapping Folding and Unfolding
@example
(define str (string-copy "studly"))
-(string-for-each-index (lambda (i)
- (string-set! str i
- ((if (even? i) char-upcase char-downcase)
- (string-ref str i))))
- str)
+(string-for-each-index
+ (lambda (i)
+ (string-set! str i
+ ((if (even? i) char-upcase char-downcase)
+ (string-ref str i))))
+ str)
str @result{} "StUdLy"
@end example
@end deffn
@item @var{make_final} is applied to the terminal seed
value (on which @var{p} returns true) to produce
the final/rightmost portion of the constructed string.
-It defaults to @code{(lambda (x) )}.
+The default is nothing extra.
@end itemize
@end deffn
of @var{s}.
@end deffn
-@deffn {Scheme Procedure} string-filter s char_pred [start [end]]
-@deffnx {C Function} scm_string_filter (s, char_pred, start, end)
+@deffn {Scheme Procedure} string-filter char_pred s [start [end]]
+@deffnx {C Function} scm_string_filter (char_pred, s, start, end)
Filter the string @var{s}, retaining only those characters which
satisfy @var{char_pred}.
is a character set, it is tested for membership.
@end deffn
-@deffn {Scheme Procedure} string-delete s char_pred [start [end]]
-@deffnx {C Function} scm_string_delete (s, char_pred, start, end)
+@deffn {Scheme Procedure} string-delete char_pred s [start [end]]
+@deffnx {C Function} scm_string_delete (char_pred, s, start, end)
Delete characters satisfying @var{char_pred} from @var{s}.
If @var{char_pred} is a procedure, it is applied to each character as
not an issue (most of the time), since in Scheme you never get to see
the bytes, only the characters.
-Well, ideally, anyway. Right now, Guile simply equates Scheme
-characters and bytes, ignoring the possibility of multi-byte encodings
-completely. This will change in the future, where Guile will use
-Unicode codepoints as its characters and UTF-8 or some other encoding
-as its internal encoding. When you exclusively use the functions
-listed in this section, you are `future-proof'.
+Converting to C and converting from C each have their own challenges.
+
+When converting from C to Scheme, it is important that the sequence of
+bytes in the C string be valid with respect to its encoding. ASCII
+strings, for example, can't have any bytes greater than 127. An ASCII
+byte greater than 127 is considered @emph{ill-formed} and cannot be
+converted into a Scheme character.
+
+Problems can occur in the reverse operation as well. Not all character
+encodings can hold all possible Scheme characters. Some encodings, like
+ASCII for example, can only describe a small subset of all possible
+characters. So, when converting to C, one must first decide what to do
+with Scheme characters that can't be represented in the C string.
Converting a Scheme string to a C string will often allocate fresh
memory to hold the result. You must take care that this memory is
@deftypefn {C Function} SCM scm_from_locale_string (const char *str)
@deftypefnx {C Function} SCM scm_from_locale_stringn (const char *str, size_t len)
-Creates a new Scheme string that has the same contents as @var{str}
-when interpreted in the current locale character encoding.
+Creates a new Scheme string that has the same contents as @var{str} when
+interpreted in the locale character encoding of the
+@code{current-input-port}.
For @code{scm_from_locale_string}, @var{str} must be null-terminated.
@var{str} in bytes, and @var{str} does not need to be null-terminated.
If @var{len} is @code{(size_t)-1}, then @var{str} does need to be
null-terminated and the real length will be found with @code{strlen}.
+
+If the C string is ill-formed, an error will be raised.
@end deftypefn
@deftypefn {C Function} SCM scm_take_locale_string (char *str)
@deftypefn {C Function} {char *} scm_to_locale_string (SCM str)
@deftypefnx {C Function} {char *} scm_to_locale_stringn (SCM str, size_t *lenp)
-Returns a C string in the current locale encoding with the same
-contents as @var{str}. The C string must be freed with @code{free}
-eventually, maybe by using @code{scm_dynwind_free}, @xref{Dynamic
-Wind}.
+Returns a C string with the same contents as @var{str} in the locale
+encoding of the @code{current-output-port}. The C string must be freed
+with @code{free} eventually, maybe by using @code{scm_dynwind_free},
+@xref{Dynamic Wind}.
For @code{scm_to_locale_string}, the returned string is
null-terminated and an error is signalled when @var{str} contains
returned string will not be null-terminated in this case. If
@var{lenp} is @code{NULL}, @code{scm_to_locale_stringn} behaves like
@code{scm_to_locale_string}.
+
+If a character in @var{str} cannot be represented in the locale encoding
+of the current output port, the port conversion strategy of the current
+output port will determine the result, @xref{Ports}. If output port's
+conversion strategy is @code{error}, an error will be raised. If it is
+@code{subsitute}, a replacement character, such as a question mark, will
+be inserted in its place. If it is @code{escape}, a hex escape will be
+inserted in its place.
@end deftypefn
@deftypefn {C Function} size_t scm_to_locale_stringbuf (SCM str, char *buf, size_t max_len)
stored and you probably need to try again with a larger buffer.
@end deftypefn
-@node Regular Expressions
-@subsection Regular Expressions
-@tpindex Regular expressions
-
-@cindex regular expressions
-@cindex regex
-@cindex emacs regexp
-
-A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
-describes a whole class of strings. A full description of regular
-expressions and their syntax is beyond the scope of this manual;
-an introduction can be found in the Emacs manual (@pxref{Regexps,
-, Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or
-in many general Unix reference books.
-
-If your system does not include a POSIX regular expression library,
-and you have not linked Guile with a third-party regexp library such
-as Rx, these functions will not be available. You can tell whether
-your Guile installation includes regular expression support by
-checking whether @code{(provided? 'regex)} returns true.
-
-The following regexp and string matching features are provided by the
-@code{(ice-9 regex)} module. Before using the described functions,
-you should load this module by executing @code{(use-modules (ice-9
-regex))}.
-
-@menu
-* Regexp Functions:: Functions that create and match regexps.
-* Match Structures:: Finding what was matched by a regexp.
-* Backslash Escapes:: Removing the special meaning of regexp
- meta-characters.
-@end menu
-
-
-@node Regexp Functions
-@subsubsection Regexp Functions
+For most situations, string conversion should occur using the current
+locale, such as with the functions above. But there may be cases where
+one wants to convert strings from a character encoding other than the
+locale's character encoding. For these cases, the lower-level functions
+@code{scm_to_stringn} and @code{scm_from_stringn} are provided. These
+functions should seldom be necessary if one is properly using locales.
+
+@deftp {C Type} scm_t_string_failed_conversion_handler
+This is an enumerated type that can take one of three values:
+@code{SCM_FAILED_CONVERSION_ERROR},
+@code{SCM_FAILED_CONVERSION_QUESTION_MARK}, and
+@code{SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE}. They are used to indicate
+a strategy for handling characters that cannot be converted to or from a
+given character encoding. @code{SCM_FAILED_CONVERSION_ERROR} indicates
+that a conversion should throw an error if some characters cannot be
+converted. @code{SCM_FAILED_CONVERSION_QUESTION_MARK} indicates that a
+conversion should replace unconvertable characters with the question
+mark character. And, @code{SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE}
+requests that a conversion should replace an unconvertable character
+with an escape sequence.
+
+While all three strategies apply when converting Scheme strings to C,
+only @code{SCM_FAILED_CONVERSION_ERROR} and
+@code{SCM_FAILED_CONVERSION_QUESTION_MARK} can be used when converting C
+strings to Scheme.
+@end deftp
+
+@deftypefn {C Function} char *scm_to_stringn (SCM str, size_t *lenp, const char *encoding, scm_t_string_failed_conversion_handler handler)
+This function returns a newly allocated C string from the Guile string
+@var{str}. The length of the string will be returned in @var{lenp}.
+The character encoding of the C string is passed as the ASCII,
+null-terminated C string @var{encoding}. The @var{handler} parameter
+gives a strategy for dealing with characters that cannot be converted
+into @var{encoding}.
+
+If @var{lenp} is NULL, this function will return a null-terminated C
+string. It will throw an error if the string contains a null
+character.
+@end deftypefn
-By default, Guile supports POSIX extended regular expressions.
-That means that the characters @samp{(}, @samp{)}, @samp{+} and
-@samp{?} are special, and must be escaped if you wish to match the
-literal characters.
+@deftypefn {C Function} SCM scm_from_stringn (const char *str, size_t len, const char *encoding, scm_t_string_failed_conversion_handler handler)
+This function returns a scheme string from the C string @var{str}. The
+length of the C string is input as @var{len}. The encoding of the C
+string is passed as the ASCII, null-terminated C string @code{encoding}.
+The @var{handler} parameters suggests a strategy for dealing with
+unconvertable characters.
+@end deftypefn
-This regular expression interface was modeled after that
-implemented by SCSH, the Scheme Shell. It is intended to be
-upwardly compatible with SCSH regular expressions.
+ISO-8859-1 is the most common 8-bit character encoding. This encoding
+is also referred to as the Latin-1 encoding. The following two
+conversion functions are provided to convert between Latin-1 C strings
+and Guile strings.
+
+@deftypefn {C Function} SCM scm_from_latin1_stringn (const char *str, size_t len)
+@deftypefnx {C Function} SCM scm_from_utf8_stringn (const char *str, size_t len)
+@deftypefnx {C Function} SCM scm_from_utf32_stringn (const scm_t_wchar *str, size_t len)
+Return a scheme string from C string @var{str}, which is ISO-8859-1-,
+UTF-8-, or UTF-32-encoded, of length @var{len}. @var{len} is the number
+of bytes pointed to by @var{str} for @code{scm_from_latin1_stringn} and
+@code{scm_from_utf8_stringn}; it is the number of elements (code points)
+in @var{str} in the case of @code{scm_from_utf32_stringn}.
+@end deftypefn
-Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
-strings, since the underlying C functions treat that as the end of
-string. If there's a zero byte an error is thrown.
+@deftypefn {C function} char *scm_to_latin1_stringn (SCM str, size_t *lenp)
+@deftypefnx {C function} char *scm_to_utf8_stringn (SCM str, size_t *lenp)
+@deftypefnx {C function} scm_t_wchar *scm_to_utf32_stringn (SCM str, size_t *lenp)
+Return a newly allocated, ISO-8859-1-, UTF-8-, or UTF-32-encoded C string
+from Scheme string @var{str}. An error is thrown when @var{str}
+string cannot be converted to the specified encoding. If @var{lenp} is
+@code{NULL}, the returned C string will be null terminated, and an error
+will be thrown if the C string would otherwise contain null
+characters. If @var{lenp} is not NULL, the length of the string is
+returned in @var{lenp}, and the string is not null terminated.
+@end deftypefn
-Patterns and input strings are treated as being in the locale
-character set if @code{setlocale} has been called (@pxref{Locales}),
-and in a multibyte locale this includes treating multi-byte sequences
-as a single character. (Guile strings are currently merely bytes,
-though this may change in the future, @xref{Conversion to/from C}.)
+@node String Internals
+@subsubsection String Internals
+
+Guile stores each string in memory as a contiguous array of Unicode code
+points along with an associated set of attributes. If all of the code
+points of a string have an integer range between 0 and 255 inclusive,
+the code point array is stored as one byte per code point: it is stored
+as an ISO-8859-1 (aka Latin-1) string. If any of the code points of the
+string has an integer value greater that 255, the code point array is
+stored as four bytes per code point: it is stored as a UTF-32 string.
+
+Conversion between the one-byte-per-code-point and
+four-bytes-per-code-point representations happens automatically as
+necessary.
+
+No API is provided to set the internal representation of strings;
+however, there are pair of procedures available to query it. These are
+debugging procedures. Using them in production code is discouraged,
+since the details of Guile's internal representation of strings may
+change from release to release.
+
+@deffn {Scheme Procedure} string-bytes-per-char str
+@deffnx {C Function} scm_string_bytes_per_char (str)
+Return the number of bytes used to encode a Unicode code point in string
+@var{str}. The result is one or four.
+@end deffn
+
+@deffn {Scheme Procedure} %string-dump str
+@deffnx {C Function} scm_sys_string_dump (str)
+Returns an association list containing debugging information for
+@var{str}. The association list has the following entries.
+@table @code
-@deffn {Scheme Procedure} string-match pattern str [start]
-Compile the string @var{pattern} into a regular expression and compare
-it with @var{str}. The optional numeric argument @var{start} specifies
-the position of @var{str} at which to begin matching.
+@item string
+The string itself.
-@code{string-match} returns a @dfn{match structure} which
-describes what, if anything, was matched by the regular
-expression. @xref{Match Structures}. If @var{str} does not match
-@var{pattern} at all, @code{string-match} returns @code{#f}.
-@end deffn
+@item start
+The start index of the string into its stringbuf
-Two examples of a match follow. In the first example, the pattern
-matches the four digits in the match string. In the second, the pattern
-matches nothing.
+@item length
+The length of the string
-@example
-(string-match "[0-9][0-9][0-9][0-9]" "blah2002")
-@result{} #("blah2002" (4 . 8))
+@item shared
+If this string is a substring, it returns its
+parent string. Otherwise, it returns @code{#f}
-(string-match "[A-Za-z]" "123456")
-@result{} #f
-@end example
+@item read-only
+@code{#t} if the string is read-only
-Each time @code{string-match} is called, it must compile its
-@var{pattern} argument into a regular expression structure. This
-operation is expensive, which makes @code{string-match} inefficient if
-the same regular expression is used several times (for example, in a
-loop). For better performance, you can compile a regular expression in
-advance and then match strings against the compiled regexp.
-
-@deffn {Scheme Procedure} make-regexp pat flag@dots{}
-@deffnx {C Function} scm_make_regexp (pat, flaglst)
-Compile the regular expression described by @var{pat}, and
-return the compiled regexp structure. If @var{pat} does not
-describe a legal regular expression, @code{make-regexp} throws
-a @code{regular-expression-syntax} error.
-
-The @var{flag} arguments change the behavior of the compiled
-regular expression. The following values may be supplied:
-
-@defvar regexp/icase
-Consider uppercase and lowercase letters to be the same when
-matching.
-@end defvar
+@item stringbuf-chars
+A new string containing this string's stringbuf's characters
-@defvar regexp/newline
-If a newline appears in the target string, then permit the
-@samp{^} and @samp{$} operators to match immediately after or
-immediately before the newline, respectively. Also, the
-@samp{.} and @samp{[^...]} operators will never match a newline
-character. The intent of this flag is to treat the target
-string as a buffer containing many lines of text, and the
-regular expression as a pattern that may match a single one of
-those lines.
-@end defvar
+@item stringbuf-length
+The number of characters in this stringbuf
-@defvar regexp/basic
-Compile a basic (``obsolete'') regexp instead of the extended
-(``modern'') regexps that are the default. Basic regexps do
-not consider @samp{|}, @samp{+} or @samp{?} to be special
-characters, and require the @samp{@{...@}} and @samp{(...)}
-metacharacters to be backslash-escaped (@pxref{Backslash
-Escapes}). There are several other differences between basic
-and extended regular expressions, but these are the most
-significant.
-@end defvar
+@item stringbuf-shared
+@code{#t} if this stringbuf is shared
-@defvar regexp/extended
-Compile an extended regular expression rather than a basic
-regexp. This is the default behavior; this flag will not
-usually be needed. If a call to @code{make-regexp} includes
-both @code{regexp/basic} and @code{regexp/extended} flags, the
-one which comes last will override the earlier one.
-@end defvar
+@item stringbuf-wide
+@code{#t} if this stringbuf's characters are stored in a 32-bit buffer,
+or @code{#f} if they are stored in an 8-bit buffer
+@end table
@end deffn
-@deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
-@deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
-Match the compiled regular expression @var{rx} against
-@code{str}. If the optional integer @var{start} argument is
-provided, begin matching from that position in the string.
-Return a match structure describing the results of the match,
-or @code{#f} if no match could be found.
-
-The @var{flags} argument changes the matching behavior. The following
-flag values may be supplied, use @code{logior} (@pxref{Bitwise
-Operations}) to combine them,
-@defvar regexp/notbol
-Consider that the @var{start} offset into @var{str} is not the
-beginning of a line and should not match operator @samp{^}.
+@node Bytevectors
+@subsection Bytevectors
-If @var{rx} was created with the @code{regexp/newline} option above,
-@samp{^} will still match after a newline in @var{str}.
-@end defvar
+@cindex bytevector
+@cindex R6RS
-@defvar regexp/noteol
-Consider that the end of @var{str} is not the end of a line and should
-not match operator @samp{$}.
+A @dfn{bytevector} is a raw bit string. The @code{(rnrs bytevectors)}
+module provides the programming interface specified by the
+@uref{http://www.r6rs.org/, Revised^6 Report on the Algorithmic Language
+Scheme (R6RS)}. It contains procedures to manipulate bytevectors and
+interpret their contents in a number of ways: bytevector contents can be
+accessed as signed or unsigned integer of various sizes and endianness,
+as IEEE-754 floating point numbers, or as strings. It is a useful tool
+to encode and decode binary data.
-If @var{rx} was created with the @code{regexp/newline} option above,
-@samp{$} will still match before a newline in @var{str}.
-@end defvar
-@end deffn
+The R6RS (Section 4.3.4) specifies an external representation for
+bytevectors, whereby the octets (integers in the range 0--255) contained
+in the bytevector are represented as a list prefixed by @code{#vu8}:
@lisp
-;; Regexp to match uppercase letters
-(define r (make-regexp "[A-Z]*"))
-
-;; Regexp to match letters, ignoring case
-(define ri (make-regexp "[A-Z]*" regexp/icase))
+#vu8(1 53 204)
+@end lisp
-;; Search for bob using regexp r
-(match:substring (regexp-exec r "bob"))
-@result{} "" ; no match
+denotes a 3-byte bytevector containing the octets 1, 53, and 204. Like
+string literals, booleans, etc., bytevectors are ``self-quoting'', i.e.,
+they do not need to be quoted:
-;; Search for bob using regexp ri
-(match:substring (regexp-exec ri "Bob"))
-@result{} "Bob" ; matched case insensitive
+@lisp
+#vu8(1 53 204)
+@result{} #vu8(1 53 204)
@end lisp
-@deffn {Scheme Procedure} regexp? obj
-@deffnx {C Function} scm_regexp_p (obj)
-Return @code{#t} if @var{obj} is a compiled regular expression,
-or @code{#f} otherwise.
-@end deffn
+Bytevectors can be used with the binary input/output primitives of the
+R6RS (@pxref{R6RS I/O Ports}).
-@sp 1
-@deffn {Scheme Procedure} list-matches regexp str [flags]
-Return a list of match structures which are the non-overlapping
-matches of @var{regexp} in @var{str}. @var{regexp} can be either a
-pattern string or a compiled regexp. The @var{flags} argument is as
-per @code{regexp-exec} above.
+@menu
+* Bytevector Endianness:: Dealing with byte order.
+* Bytevector Manipulation:: Creating, copying, manipulating bytevectors.
+* Bytevectors as Integers:: Interpreting bytes as integers.
+* Bytevectors and Integer Lists:: Converting to/from an integer list.
+* Bytevectors as Floats:: Interpreting bytes as real numbers.
+* Bytevectors as Strings:: Interpreting bytes as Unicode strings.
+* Bytevectors as Generalized Vectors:: Guile extension to the bytevector API.
+* Bytevectors as Uniform Vectors:: Bytevectors and SRFI-4.
+@end menu
-@example
-(map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
-@result{} ("abc" "def")
-@end example
-@end deffn
+@node Bytevector Endianness
+@subsubsection Endianness
-@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
-Apply @var{proc} to the non-overlapping matches of @var{regexp} in
-@var{str}, to build a result. @var{regexp} can be either a pattern
-string or a compiled regexp. The @var{flags} argument is as per
-@code{regexp-exec} above.
+@cindex endianness
+@cindex byte order
+@cindex word order
-@var{proc} is called as @code{(@var{proc} match prev)} where
-@var{match} is a match structure and @var{prev} is the previous return
-from @var{proc}. For the first call @var{prev} is the given
-@var{init} parameter. @code{fold-matches} returns the final value
-from @var{proc}.
+Some of the following procedures take an @var{endianness} parameter.
+The @dfn{endianness} is defined as the order of bytes in multi-byte
+numbers: numbers encoded in @dfn{big endian} have their most
+significant bytes written first, whereas numbers encoded in
+@dfn{little endian} have their least significant bytes
+first@footnote{Big-endian and little-endian are the most common
+``endiannesses'', but others do exist. For instance, the GNU MP
+library allows @dfn{word order} to be specified independently of
+@dfn{byte order} (@pxref{Integer Import and Export,,, gmp, The GNU
+Multiple Precision Arithmetic Library Manual}).}.
-For example to count matches,
+Little-endian is the native endianness of the IA32 architecture and
+its derivatives, while big-endian is native to SPARC and PowerPC,
+among others. The @code{native-endianness} procedure returns the
+native endianness of the machine it runs on.
-@example
-(fold-matches "[a-z][0-9]" "abc x1 def y2" 0
- (lambda (match count)
- (1+ count)))
-@result{} 2
-@end example
+@deffn {Scheme Procedure} native-endianness
+@deffnx {C Function} scm_native_endianness ()
+Return a value denoting the native endianness of the host machine.
@end deffn
-@sp 1
-Regular expressions are commonly used to find patterns in one string
-and replace them with the contents of another string. The following
-functions are convenient ways to do this.
+@deffn {Scheme Macro} endianness symbol
+Return an object denoting the endianness specified by @var{symbol}. If
+@var{symbol} is neither @code{big} nor @code{little} then an error is
+raised at expand-time.
+@end deffn
-@c begin (scm-doc-string "regex.scm" "regexp-substitute")
-@deffn {Scheme Procedure} regexp-substitute port match [item@dots{}]
-Write to @var{port} selected parts of the match structure @var{match}.
-Or if @var{port} is @code{#f} then form a string from those parts and
-return that.
+@defvr {C Variable} scm_endianness_big
+@defvrx {C Variable} scm_endianness_little
+The objects denoting big- and little-endianness, respectively.
+@end defvr
-Each @var{item} specifies a part to be written, and may be one of the
-following,
-@itemize @bullet
-@item
-A string. String arguments are written out verbatim.
+@node Bytevector Manipulation
+@subsubsection Manipulating Bytevectors
-@item
-An integer. The submatch with that number is written
-(@code{match:substring}). Zero is the entire match.
+Bytevectors can be created, copied, and analyzed with the following
+procedures and C functions.
-@item
-The symbol @samp{pre}. The portion of the matched string preceding
-the regexp match is written (@code{match:prefix}).
+@deffn {Scheme Procedure} make-bytevector len [fill]
+@deffnx {C Function} scm_make_bytevector (len, fill)
+@deffnx {C Function} scm_c_make_bytevector (size_t len)
+Return a new bytevector of @var{len} bytes. Optionally, if @var{fill}
+is given, fill it with @var{fill}; @var{fill} must be in the range
+[-128,255].
+@end deffn
-@item
-The symbol @samp{post}. The portion of the matched string following
-the regexp match is written (@code{match:suffix}).
-@end itemize
+@deffn {Scheme Procedure} bytevector? obj
+@deffnx {C Function} scm_bytevector_p (obj)
+Return true if @var{obj} is a bytevector.
+@end deffn
-For example, changing a match and retaining the text before and after,
+@deftypefn {C Function} int scm_is_bytevector (SCM obj)
+Equivalent to @code{scm_is_true (scm_bytevector_p (obj))}.
+@end deftypefn
-@example
-(regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
- 'pre "37" 'post)
-@result{} "number 37 is good"
-@end example
+@deffn {Scheme Procedure} bytevector-length bv
+@deffnx {C Function} scm_bytevector_length (bv)
+Return the length in bytes of bytevector @var{bv}.
+@end deffn
-Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
-re-ordering and hyphenating the fields.
+@deftypefn {C Function} size_t scm_c_bytevector_length (SCM bv)
+Likewise, return the length in bytes of bytevector @var{bv}.
+@end deftypefn
-@lisp
-(define date-regex "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
-(define s "Date 20020429 12am.")
-(regexp-substitute #f (string-match date-regex s)
- 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
-@result{} "Date 04-29-2002 12am. (20020429)"
-@end lisp
+@deffn {Scheme Procedure} bytevector=? bv1 bv2
+@deffnx {C Function} scm_bytevector_eq_p (bv1, bv2)
+Return is @var{bv1} equals to @var{bv2}---i.e., if they have the same
+length and contents.
@end deffn
+@deffn {Scheme Procedure} bytevector-fill! bv fill
+@deffnx {C Function} scm_bytevector_fill_x (bv, fill)
+Fill bytevector @var{bv} with @var{fill}, a byte.
+@end deffn
-@c begin (scm-doc-string "regex.scm" "regexp-substitute")
-@deffn {Scheme Procedure} regexp-substitute/global port regexp target [item@dots{}]
-@cindex search and replace
-Write to @var{port} selected parts of matches of @var{regexp} in
-@var{target}. If @var{port} is @code{#f} then form a string from
-those parts and return that. @var{regexp} can be a string or a
-compiled regex.
+@deffn {Scheme Procedure} bytevector-copy! source source-start target target-start len
+@deffnx {C Function} scm_bytevector_copy_x (source, source_start, target, target_start, len)
+Copy @var{len} bytes from @var{source} into @var{target}, starting
+reading from @var{source-start} (a positive index within @var{source})
+and start writing at @var{target-start}.
+@end deffn
-This is similar to @code{regexp-substitute}, but allows global
-substitutions on @var{target}. Each @var{item} behaves as per
-@code{regexp-substitute}, with the following differences,
+@deffn {Scheme Procedure} bytevector-copy bv
+@deffnx {C Function} scm_bytevector_copy (bv)
+Return a newly allocated copy of @var{bv}.
+@end deffn
-@itemize @bullet
-@item
-A function. Called as @code{(@var{item} match)} with the match
-structure for the @var{regexp} match, it should return a string to be
-written to @var{port}.
+@deftypefn {C Function} scm_t_uint8 scm_c_bytevector_ref (SCM bv, size_t index)
+Return the byte at @var{index} in bytevector @var{bv}.
+@end deftypefn
-@item
-The symbol @samp{post}. This doesn't output anything, but instead
-causes @code{regexp-substitute/global} to recurse on the unmatched
-portion of @var{target}.
+@deftypefn {C Function} void scm_c_bytevector_set_x (SCM bv, size_t index, scm_t_uint8 value)
+Set the byte at @var{index} in @var{bv} to @var{value}.
+@end deftypefn
-This @emph{must} be supplied to perform a global search and replace on
-@var{target}; without it @code{regexp-substitute/global} returns after
-a single match and output.
-@end itemize
+Low-level C macros are available. They do not perform any
+type-checking; as such they should be used with care.
-For example, to collapse runs of tabs and spaces to a single hyphen
-each,
+@deftypefn {C Macro} size_t SCM_BYTEVECTOR_LENGTH (bv)
+Return the length in bytes of bytevector @var{bv}.
+@end deftypefn
-@example
-(regexp-substitute/global #f "[ \t]+" "this is the text"
- 'pre "-" 'post)
-@result{} "this-is-the-text"
-@end example
+@deftypefn {C Macro} {signed char *} SCM_BYTEVECTOR_CONTENTS (bv)
+Return a pointer to the contents of bytevector @var{bv}.
+@end deftypefn
-Or using a function to reverse the letters in each word,
-@example
-(regexp-substitute/global #f "[a-z]+" "to do and not-do"
- 'pre (lambda (m) (string-reverse (match:substring m))) 'post)
-@result{} "ot od dna ton-od"
-@end example
+@node Bytevectors as Integers
+@subsubsection Interpreting Bytevector Contents as Integers
-Without the @code{post} symbol, just one regexp match is made. For
-example the following is the date example from
-@code{regexp-substitute} above, without the need for the separate
-@code{string-match} call.
+The contents of a bytevector can be interpreted as a sequence of
+integers of any given size, sign, and endianness.
@lisp
-(define date-regex "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
-(define s "Date 20020429 12am.")
-(regexp-substitute/global #f date-regex s
- 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
-
-@result{} "Date 04-29-2002 12am. (20020429)"
+(let ((bv (make-bytevector 4)))
+ (bytevector-u8-set! bv 0 #x12)
+ (bytevector-u8-set! bv 1 #x34)
+ (bytevector-u8-set! bv 2 #x56)
+ (bytevector-u8-set! bv 3 #x78)
+
+ (map (lambda (number)
+ (number->string number 16))
+ (list (bytevector-u8-ref bv 0)
+ (bytevector-u16-ref bv 0 (endianness big))
+ (bytevector-u32-ref bv 0 (endianness little)))))
+
+@result{} ("12" "1234" "78563412")
@end lisp
-@end deffn
-
-
-@node Match Structures
-@subsubsection Match Structures
-
-@cindex match structures
-
-A @dfn{match structure} is the object returned by @code{string-match} and
-@code{regexp-exec}. It describes which portion of a string, if any,
-matched the given regular expression. Match structures include: a
-reference to the string that was checked for matches; the starting and
-ending positions of the regexp match; and, if the regexp included any
-parenthesized subexpressions, the starting and ending positions of each
-submatch.
-In each of the regexp match functions described below, the @code{match}
-argument must be a match structure returned by a previous call to
-@code{string-match} or @code{regexp-exec}. Most of these functions
-return some information about the original target string that was
-matched against a regular expression; we will call that string
-@var{target} for easy reference.
-
-@c begin (scm-doc-string "regex.scm" "regexp-match?")
-@deffn {Scheme Procedure} regexp-match? obj
-Return @code{#t} if @var{obj} is a match structure returned by a
-previous call to @code{regexp-exec}, or @code{#f} otherwise.
-@end deffn
-
-@c begin (scm-doc-string "regex.scm" "match:substring")
-@deffn {Scheme Procedure} match:substring match [n]
-Return the portion of @var{target} matched by subexpression number
-@var{n}. Submatch 0 (the default) represents the entire regexp match.
-If the regular expression as a whole matched, but the subexpression
-number @var{n} did not match, return @code{#f}.
-@end deffn
+The most generic procedures to interpret bytevector contents as integers
+are described below.
+
+@deffn {Scheme Procedure} bytevector-uint-ref bv index endianness size
+@deffnx {Scheme Procedure} bytevector-sint-ref bv index endianness size
+@deffnx {C Function} scm_bytevector_uint_ref (bv, index, endianness, size)
+@deffnx {C Function} scm_bytevector_sint_ref (bv, index, endianness, size)
+Return the @var{size}-byte long unsigned (resp. signed) integer at
+index @var{index} in @var{bv}, decoded according to @var{endianness}.
+@end deffn
+
+@deffn {Scheme Procedure} bytevector-uint-set! bv index value endianness size
+@deffnx {Scheme Procedure} bytevector-sint-set! bv index value endianness size
+@deffnx {C Function} scm_bytevector_uint_set_x (bv, index, value, endianness, size)
+@deffnx {C Function} scm_bytevector_sint_set_x (bv, index, value, endianness, size)
+Set the @var{size}-byte long unsigned (resp. signed) integer at
+@var{index} to @var{value}, encoded according to @var{endianness}.
+@end deffn
+
+The following procedures are similar to the ones above, but specialized
+to a given integer size:
+
+@deffn {Scheme Procedure} bytevector-u8-ref bv index
+@deffnx {Scheme Procedure} bytevector-s8-ref bv index
+@deffnx {Scheme Procedure} bytevector-u16-ref bv index endianness
+@deffnx {Scheme Procedure} bytevector-s16-ref bv index endianness
+@deffnx {Scheme Procedure} bytevector-u32-ref bv index endianness
+@deffnx {Scheme Procedure} bytevector-s32-ref bv index endianness
+@deffnx {Scheme Procedure} bytevector-u64-ref bv index endianness
+@deffnx {Scheme Procedure} bytevector-s64-ref bv index endianness
+@deffnx {C Function} scm_bytevector_u8_ref (bv, index)
+@deffnx {C Function} scm_bytevector_s8_ref (bv, index)
+@deffnx {C Function} scm_bytevector_u16_ref (bv, index, endianness)
+@deffnx {C Function} scm_bytevector_s16_ref (bv, index, endianness)
+@deffnx {C Function} scm_bytevector_u32_ref (bv, index, endianness)
+@deffnx {C Function} scm_bytevector_s32_ref (bv, index, endianness)
+@deffnx {C Function} scm_bytevector_u64_ref (bv, index, endianness)
+@deffnx {C Function} scm_bytevector_s64_ref (bv, index, endianness)
+Return the unsigned @var{n}-bit (signed) integer (where @var{n} is 8,
+16, 32 or 64) from @var{bv} at @var{index}, decoded according to
+@var{endianness}.
+@end deffn
+
+@deffn {Scheme Procedure} bytevector-u8-set! bv index value
+@deffnx {Scheme Procedure} bytevector-s8-set! bv index value
+@deffnx {Scheme Procedure} bytevector-u16-set! bv index value endianness
+@deffnx {Scheme Procedure} bytevector-s16-set! bv index value endianness
+@deffnx {Scheme Procedure} bytevector-u32-set! bv index value endianness
+@deffnx {Scheme Procedure} bytevector-s32-set! bv index value endianness
+@deffnx {Scheme Procedure} bytevector-u64-set! bv index value endianness
+@deffnx {Scheme Procedure} bytevector-s64-set! bv index value endianness
+@deffnx {C Function} scm_bytevector_u8_set_x (bv, index, value)
+@deffnx {C Function} scm_bytevector_s8_set_x (bv, index, value)
+@deffnx {C Function} scm_bytevector_u16_set_x (bv, index, value, endianness)
+@deffnx {C Function} scm_bytevector_s16_set_x (bv, index, value, endianness)
+@deffnx {C Function} scm_bytevector_u32_set_x (bv, index, value, endianness)
+@deffnx {C Function} scm_bytevector_s32_set_x (bv, index, value, endianness)
+@deffnx {C Function} scm_bytevector_u64_set_x (bv, index, value, endianness)
+@deffnx {C Function} scm_bytevector_s64_set_x (bv, index, value, endianness)
+Store @var{value} as an @var{n}-bit (signed) integer (where @var{n} is
+8, 16, 32 or 64) in @var{bv} at @var{index}, encoded according to
+@var{endianness}.
+@end deffn
+
+Finally, a variant specialized for the host's endianness is available
+for each of these functions (with the exception of the @code{u8}
+accessors, for obvious reasons):
+
+@deffn {Scheme Procedure} bytevector-u16-native-ref bv index
+@deffnx {Scheme Procedure} bytevector-s16-native-ref bv index
+@deffnx {Scheme Procedure} bytevector-u32-native-ref bv index
+@deffnx {Scheme Procedure} bytevector-s32-native-ref bv index
+@deffnx {Scheme Procedure} bytevector-u64-native-ref bv index
+@deffnx {Scheme Procedure} bytevector-s64-native-ref bv index
+@deffnx {C Function} scm_bytevector_u16_native_ref (bv, index)
+@deffnx {C Function} scm_bytevector_s16_native_ref (bv, index)
+@deffnx {C Function} scm_bytevector_u32_native_ref (bv, index)
+@deffnx {C Function} scm_bytevector_s32_native_ref (bv, index)
+@deffnx {C Function} scm_bytevector_u64_native_ref (bv, index)
+@deffnx {C Function} scm_bytevector_s64_native_ref (bv, index)
+Return the unsigned @var{n}-bit (signed) integer (where @var{n} is 8,
+16, 32 or 64) from @var{bv} at @var{index}, decoded according to the
+host's native endianness.
+@end deffn
+
+@deffn {Scheme Procedure} bytevector-u16-native-set! bv index value
+@deffnx {Scheme Procedure} bytevector-s16-native-set! bv index value
+@deffnx {Scheme Procedure} bytevector-u32-native-set! bv index value
+@deffnx {Scheme Procedure} bytevector-s32-native-set! bv index value
+@deffnx {Scheme Procedure} bytevector-u64-native-set! bv index value
+@deffnx {Scheme Procedure} bytevector-s64-native-set! bv index value
+@deffnx {C Function} scm_bytevector_u16_native_set_x (bv, index, value)
+@deffnx {C Function} scm_bytevector_s16_native_set_x (bv, index, value)
+@deffnx {C Function} scm_bytevector_u32_native_set_x (bv, index, value)
+@deffnx {C Function} scm_bytevector_s32_native_set_x (bv, index, value)
+@deffnx {C Function} scm_bytevector_u64_native_set_x (bv, index, value)
+@deffnx {C Function} scm_bytevector_s64_native_set_x (bv, index, value)
+Store @var{value} as an @var{n}-bit (signed) integer (where @var{n} is
+8, 16, 32 or 64) in @var{bv} at @var{index}, encoded according to the
+host's native endianness.
+@end deffn
+
+
+@node Bytevectors and Integer Lists
+@subsubsection Converting Bytevectors to/from Integer Lists
+
+Bytevector contents can readily be converted to/from lists of signed or
+unsigned integers:
@lisp
-(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
-(match:substring s)
-@result{} "2002"
-
-;; match starting at offset 6 in the string
-(match:substring
- (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
-@result{} "7654"
+(bytevector->sint-list (u8-list->bytevector (make-list 4 255))
+ (endianness little) 2)
+@result{} (-1 -1)
@end lisp
-@c begin (scm-doc-string "regex.scm" "match:start")
-@deffn {Scheme Procedure} match:start match [n]
-Return the starting position of submatch number @var{n}.
+@deffn {Scheme Procedure} bytevector->u8-list bv
+@deffnx {C Function} scm_bytevector_to_u8_list (bv)
+Return a newly allocated list of unsigned 8-bit integers from the
+contents of @var{bv}.
@end deffn
-In the following example, the result is 4, since the match starts at
-character index 4:
+@deffn {Scheme Procedure} u8-list->bytevector lst
+@deffnx {C Function} scm_u8_list_to_bytevector (lst)
+Return a newly allocated bytevector consisting of the unsigned 8-bit
+integers listed in @var{lst}.
+@end deffn
-@lisp
-(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
-(match:start s)
-@result{} 4
-@end lisp
+@deffn {Scheme Procedure} bytevector->uint-list bv endianness size
+@deffnx {Scheme Procedure} bytevector->sint-list bv endianness size
+@deffnx {C Function} scm_bytevector_to_uint_list (bv, endianness, size)
+@deffnx {C Function} scm_bytevector_to_sint_list (bv, endianness, size)
+Return a list of unsigned (resp. signed) integers of @var{size} bytes
+representing the contents of @var{bv}, decoded according to
+@var{endianness}.
+@end deffn
-@c begin (scm-doc-string "regex.scm" "match:end")
-@deffn {Scheme Procedure} match:end match [n]
-Return the ending position of submatch number @var{n}.
+@deffn {Scheme Procedure} uint-list->bytevector lst endianness size
+@deffnx {Scheme Procedure} sint-list->bytevector lst endianness size
+@deffnx {C Function} scm_uint_list_to_bytevector (lst, endianness, size)
+@deffnx {C Function} scm_sint_list_to_bytevector (lst, endianness, size)
+Return a new bytevector containing the unsigned (resp. signed) integers
+listed in @var{lst} and encoded on @var{size} bytes according to
+@var{endianness}.
@end deffn
-In the following example, the result is 8, since the match runs between
-characters 4 and 8 (i.e. the ``2002'').
+@node Bytevectors as Floats
+@subsubsection Interpreting Bytevector Contents as Floating Point Numbers
-@lisp
-(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
-(match:end s)
-@result{} 8
-@end lisp
+@cindex IEEE-754 floating point numbers
-@c begin (scm-doc-string "regex.scm" "match:prefix")
-@deffn {Scheme Procedure} match:prefix match
-Return the unmatched portion of @var{target} preceding the regexp match.
+Bytevector contents can also be accessed as IEEE-754 single- or
+double-precision floating point numbers (respectively 32 and 64-bit
+long) using the procedures described here.
-@lisp
-(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
-(match:prefix s)
-@result{} "blah"
-@end lisp
+@deffn {Scheme Procedure} bytevector-ieee-single-ref bv index endianness
+@deffnx {Scheme Procedure} bytevector-ieee-double-ref bv index endianness
+@deffnx {C Function} scm_bytevector_ieee_single_ref (bv, index, endianness)
+@deffnx {C Function} scm_bytevector_ieee_double_ref (bv, index, endianness)
+Return the IEEE-754 single-precision floating point number from @var{bv}
+at @var{index} according to @var{endianness}.
@end deffn
-@c begin (scm-doc-string "regex.scm" "match:suffix")
-@deffn {Scheme Procedure} match:suffix match
-Return the unmatched portion of @var{target} following the regexp match.
+@deffn {Scheme Procedure} bytevector-ieee-single-set! bv index value endianness
+@deffnx {Scheme Procedure} bytevector-ieee-double-set! bv index value endianness
+@deffnx {C Function} scm_bytevector_ieee_single_set_x (bv, index, value, endianness)
+@deffnx {C Function} scm_bytevector_ieee_double_set_x (bv, index, value, endianness)
+Store real number @var{value} in @var{bv} at @var{index} according to
+@var{endianness}.
@end deffn
-@lisp
-(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
-(match:suffix s)
-@result{} "foo"
-@end lisp
+Specialized procedures are also available:
-@c begin (scm-doc-string "regex.scm" "match:count")
-@deffn {Scheme Procedure} match:count match
-Return the number of parenthesized subexpressions from @var{match}.
-Note that the entire regular expression match itself counts as a
-subexpression, and failed submatches are included in the count.
+@deffn {Scheme Procedure} bytevector-ieee-single-native-ref bv index
+@deffnx {Scheme Procedure} bytevector-ieee-double-native-ref bv index
+@deffnx {C Function} scm_bytevector_ieee_single_native_ref (bv, index)
+@deffnx {C Function} scm_bytevector_ieee_double_native_ref (bv, index)
+Return the IEEE-754 single-precision floating point number from @var{bv}
+at @var{index} according to the host's native endianness.
@end deffn
-@c begin (scm-doc-string "regex.scm" "match:string")
-@deffn {Scheme Procedure} match:string match
-Return the original @var{target} string.
+@deffn {Scheme Procedure} bytevector-ieee-single-native-set! bv index value
+@deffnx {Scheme Procedure} bytevector-ieee-double-native-set! bv index value
+@deffnx {C Function} scm_bytevector_ieee_single_native_set_x (bv, index, value)
+@deffnx {C Function} scm_bytevector_ieee_double_native_set_x (bv, index, value)
+Store real number @var{value} in @var{bv} at @var{index} according to
+the host's native endianness.
@end deffn
-@lisp
-(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
-(match:string s)
-@result{} "blah2002foo"
-@end lisp
+@node Bytevectors as Strings
+@subsubsection Interpreting Bytevector Contents as Unicode Strings
+
+@cindex Unicode string encoding
-@node Backslash Escapes
-@subsubsection Backslash Escapes
-
-Sometimes you will want a regexp to match characters like @samp{*} or
-@samp{$} exactly. For example, to check whether a particular string
-represents a menu entry from an Info node, it would be useful to match
-it against a regexp like @samp{^* [^:]*::}. However, this won't work;
-because the asterisk is a metacharacter, it won't match the @samp{*} at
-the beginning of the string. In this case, we want to make the first
-asterisk un-magic.
-
-You can do this by preceding the metacharacter with a backslash
-character @samp{\}. (This is also called @dfn{quoting} the
-metacharacter, and is known as a @dfn{backslash escape}.) When Guile
-sees a backslash in a regular expression, it considers the following
-glyph to be an ordinary character, no matter what special meaning it
-would ordinarily have. Therefore, we can make the above example work by
-changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells
-the regular expression engine to match only a single asterisk in the
-target string.
-
-Since the backslash is itself a metacharacter, you may force a regexp to
-match a backslash in the target string by preceding the backslash with
-itself. For example, to find variable references in a @TeX{} program,
-you might want to find occurrences of the string @samp{\let\} followed
-by any number of alphabetic characters. The regular expression
-@samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
-regexp each match a single backslash in the target string.
-
-@c begin (scm-doc-string "regex.scm" "regexp-quote")
-@deffn {Scheme Procedure} regexp-quote str
-Quote each special character found in @var{str} with a backslash, and
-return the resulting string.
-@end deffn
-
-@strong{Very important:} Using backslash escapes in Guile source code
-(as in Emacs Lisp or C) can be tricky, because the backslash character
-has special meaning for the Guile reader. For example, if Guile
-encounters the character sequence @samp{\n} in the middle of a string
-while processing Scheme code, it replaces those characters with a
-newline character. Similarly, the character sequence @samp{\t} is
-replaced by a horizontal tab. Several of these @dfn{escape sequences}
-are processed by the Guile reader before your code is executed.
-Unrecognized escape sequences are ignored: if the characters @samp{\*}
-appear in a string, they will be translated to the single character
-@samp{*}.
-
-This translation is obviously undesirable for regular expressions, since
-we want to be able to include backslashes in a string in order to
-escape regexp metacharacters. Therefore, to make sure that a backslash
-is preserved in a string in your Guile program, you must use @emph{two}
-consecutive backslashes:
+Bytevector contents can also be interpreted as Unicode strings encoded
+in one of the most commonly available encoding formats.
@lisp
-(define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
+(utf8->string (u8-list->bytevector '(99 97 102 101)))
+@result{} "cafe"
+
+(string->utf8 "caf@'e") ;; SMALL LATIN LETTER E WITH ACUTE ACCENT
+@result{} #vu8(99 97 102 195 169)
@end lisp
-The string in this example is preprocessed by the Guile reader before
-any code is executed. The resulting argument to @code{make-regexp} is
-the string @samp{^\* [^:]*}, which is what we really want.
+@deffn {Scheme Procedure} string->utf8 str
+@deffnx {Scheme Procedure} string->utf16 str [endianness]
+@deffnx {Scheme Procedure} string->utf32 str [endianness]
+@deffnx {C Function} scm_string_to_utf8 (str)
+@deffnx {C Function} scm_string_to_utf16 (str, endianness)
+@deffnx {C Function} scm_string_to_utf32 (str, endianness)
+Return a newly allocated bytevector that contains the UTF-8, UTF-16, or
+UTF-32 (aka. UCS-4) encoding of @var{str}. For UTF-16 and UTF-32,
+@var{endianness} should be the symbol @code{big} or @code{little}; when omitted,
+it defaults to big endian.
+@end deffn
+
+@deffn {Scheme Procedure} utf8->string utf
+@deffnx {Scheme Procedure} utf16->string utf [endianness]
+@deffnx {Scheme Procedure} utf32->string utf [endianness]
+@deffnx {C Function} scm_utf8_to_string (utf)
+@deffnx {C Function} scm_utf16_to_string (utf, endianness)
+@deffnx {C Function} scm_utf32_to_string (utf, endianness)
+Return a newly allocated string that contains from the UTF-8-, UTF-16-,
+or UTF-32-decoded contents of bytevector @var{utf}. For UTF-16 and UTF-32,
+@var{endianness} should be the symbol @code{big} or @code{little}; when omitted,
+it defaults to big endian.
+@end deffn
+
+@node Bytevectors as Generalized Vectors
+@subsubsection Accessing Bytevectors with the Generalized Vector API
+
+As an extension to the R6RS, Guile allows bytevectors to be manipulated
+with the @dfn{generalized vector} procedures (@pxref{Generalized
+Vectors}). This also allows bytevectors to be accessed using the
+generic @dfn{array} procedures (@pxref{Array Procedures}). When using
+these APIs, bytes are accessed one at a time as 8-bit unsigned integers:
+
+@example
+(define bv #vu8(0 1 2 3))
-This also means that in order to write a regular expression that matches
-a single backslash character, the regular expression string in the
-source code must include @emph{four} backslashes. Each consecutive pair
-of backslashes gets translated by the Guile reader to a single
-backslash, and the resulting double-backslash is interpreted by the
-regexp engine as matching a single backslash character. Hence:
+(generalized-vector? bv)
+@result{} #t
-@lisp
-(define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
-@end lisp
+(generalized-vector-ref bv 2)
+@result{} 2
+
+(generalized-vector-set! bv 2 77)
+(array-ref bv 2)
+@result{} 77
+
+(array-type bv)
+@result{} vu8
+@end example
+
+
+@node Bytevectors as Uniform Vectors
+@subsubsection Accessing Bytevectors with the SRFI-4 API
-The reason for the unwieldiness of this syntax is historical. Both
-regular expression pattern matchers and Unix string processing systems
-have traditionally used backslashes with the special meanings
-described above. The POSIX regular expression specification and ANSI C
-standard both require these semantics. Attempting to abandon either
-convention would cause other kinds of compatibility problems, possibly
-more severe ones. Therefore, without extending the Scheme reader to
-support strings with different quoting conventions (an ungainly and
-confusing extension when implemented in other languages), we must adhere
-to this cumbersome escape syntax.
+Bytevectors may also be accessed with the SRFI-4 API. @xref{SRFI-4 and
+Bytevectors}, for more information.
@node Symbols
Most symbols are created by writing them literally in code. However it
is also possible to create symbols programmatically using the following
-@code{string->symbol} and @code{string-ci->symbol} procedures:
+procedures:
+
+@deffn {Scheme Procedure} symbol char@dots{}
+@rnindex symbol
+Return a newly allocated symbol made from the given character arguments.
+
+@example
+(symbol #\x #\y #\z) @result{} xyz
+@end example
+@end deffn
+
+@deffn {Scheme Procedure} list->symbol lst
+@rnindex list->symbol
+Return a newly allocated symbol made from a list of characters.
+
+@example
+(list->symbol '(#\a #\b #\c)) @result{} abc
+@end example
+@end deffn
+
+@rnindex symbol-append
+@deffn {Scheme Procedure} symbol-append . args
+Return a newly allocated symbol whose characters form the
+concatenation of the given symbols, @var{args}.
+
+@example
+(let ((h 'hello))
+ (symbol-append h 'world))
+@result{} helloworld
+@end example
+@end deffn
@rnindex string->symbol
@deffn {Scheme Procedure} string->symbol string
can then use @var{str} directly as its internal representation.
@end deftypefn
+The size of a symbol can also be obtained from C:
+
+@deftypefn {C Function} size_t scm_c_symbol_length (SCM sym)
+Return the number of characters in @var{sym}.
+@end deftypefn
Finally, some applications, especially those that generate new Scheme
code dynamically, need to generate symbols for use in the generated
Guile's keyword support conforms to R5RS, and adds a (switchable) read
syntax extension to permit keywords to begin with @code{:} as well as
-@code{#:}.
+@code{#:}, or to end with @code{:}.
@menu
* Why Use Keywords?:: Motivation for keyword usage.
recognizes the alternative read syntax @code{:NAME}. Otherwise, tokens
of the form @code{:NAME} are read as symbols, as required by R5RS.
+@cindex SRFI-88 keyword syntax
+
+If the @code{keyword} read option is set to @code{'postfix}, Guile
+recognizes the SRFI-88 read syntax @code{NAME:} (@pxref{SRFI-88}).
+Otherwise, tokens of this form are read as symbols.
+
To enable and disable the alternative non-R5RS keyword syntax, you use
-the @code{read-set!} procedure documented in @ref{User level options
-interfaces} and @ref{Reader options}.
+the @code{read-set!} procedure documented @ref{Scheme Read}. Note that
+the @code{prefix} and @code{postfix} syntax are mutually exclusive.
-@smalllisp
+@lisp
(read-set! keywords 'prefix)
#:type
@result{}
#:type
+(read-set! keywords 'postfix)
+
+type:
+@result{}
+#:type
+
+:type
+@result{}
+:type
+
(read-set! keywords #f)
#:type
ERROR: In expression :type:
ERROR: Unbound variable: :type
ABORT: (unbound-variable)
-@end smalllisp
+@end lisp
@node Keyword Procedures
@subsubsection Keyword Procedures
@node Other Types
@subsection ``Functionality-Centric'' Data Types
-Procedures and macros are documented in their own chapter: see
-@ref{Procedures and Macros}.
+Procedures and macros are documented in their own sections: see
+@ref{Procedures} and @ref{Macros}.
Variable objects are documented as part of the description of Guile's
module system: see @ref{Variables}.
-Asyncs, dynamic roots and fluids are described in the chapter on
+Asyncs, dynamic roots and fluids are described in the section on
scheduling: see @ref{Scheduling}.
-Hooks are documented in the chapter on general utility functions: see
+Hooks are documented in the section on general utility functions: see
@ref{Hooks}.
-Ports are described in the chapter on I/O: see @ref{Input and Output}.
+Ports are described in the section on I/O: see @ref{Input and Output}.
+Regular expressions are described in their own section: see @ref{Regular
+Expressions}.
@c Local Variables:
@c TeX-master: "guile.texi"