@c -*-texinfo-*-
@c This is part of the GNU Guile Reference Manual.
-@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009
+@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009, 2010
@c Free Software Foundation, Inc.
@c See the file guile.texi for copying conditions.
-@page
@node Simple Data Types
@section Simple Generic Data Types
* Complex:: Complex number operations.
* Arithmetic:: Arithmetic functions.
* Scientific:: Scientific functions.
-* Primitive Numerics:: Primitive numeric functions.
* Bitwise Operations:: Logical AND, OR, NOT, and so on.
* Random:: Random number generation.
@end menu
infinity, depending on the sign of the divided number.
The infinities are written @samp{+inf.0} and @samp{-inf.0},
-respectivly. This syntax is also recognized by @code{read} as an
+respectively. This syntax is also recognized by @code{read} as an
extension to the usual Scheme syntax.
Dividing zero by zero yields something that is not a number at all:
@end deftypefn
@deftypefn {C Function} SCM scm_from_double (double val)
-Return the @code{SCM} value that representats @var{val}. The returned
+Return the @code{SCM} value that represents @var{val}. The returned
value is inexact according to the predicate @code{inexact?}, but it
will be exactly equal to @var{val}.
@end deftypefn
@rnindex magnitude
@rnindex angle
-@deffn {Scheme Procedure} make-rectangular real imaginary
-@deffnx {C Function} scm_make_rectangular (real, imaginary)
-Return a complex number constructed of the given @var{real} and
-@var{imaginary} parts.
+@deffn {Scheme Procedure} make-rectangular real_part imaginary_part
+@deffnx {C Function} scm_make_rectangular (real_part, imaginary_part)
+Return a complex number constructed of the given @var{real-part} and @var{imaginary-part} parts.
@end deffn
@deffn {Scheme Procedure} make-polar x y
@end deffn
-@node Primitive Numerics
-@subsubsection Primitive Numeric Functions
-
-Many of Guile's numeric procedures which accept any kind of numbers as
-arguments, including complex numbers, are implemented as Scheme
-procedures that use the following real number-based primitives. These
-primitives signal an error if they are called with complex arguments.
-
-@c begin (texi-doc-string "guile" "$abs")
-@deffn {Scheme Procedure} $abs x
-Return the absolute value of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$sqrt")
-@deffn {Scheme Procedure} $sqrt x
-Return the square root of @var{x}.
-@end deffn
-
-@deffn {Scheme Procedure} $expt x y
-@deffnx {C Function} scm_sys_expt (x, y)
-Return @var{x} raised to the power of @var{y}. This
-procedure does not accept complex arguments.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$sin")
-@deffn {Scheme Procedure} $sin x
-Return the sine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$cos")
-@deffn {Scheme Procedure} $cos x
-Return the cosine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$tan")
-@deffn {Scheme Procedure} $tan x
-Return the tangent of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$asin")
-@deffn {Scheme Procedure} $asin x
-Return the arcsine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$acos")
-@deffn {Scheme Procedure} $acos x
-Return the arccosine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$atan")
-@deffn {Scheme Procedure} $atan x
-Return the arctangent of @var{x} in the range @minus{}@math{PI/2} to
-@math{PI/2}.
-@end deffn
-
-@deffn {Scheme Procedure} $atan2 x y
-@deffnx {C Function} scm_sys_atan2 (x, y)
-Return the arc tangent of the two arguments @var{x} and
-@var{y}. This is similar to calculating the arc tangent of
-@var{x} / @var{y}, except that the signs of both arguments
-are used to determine the quadrant of the result. This
-procedure does not accept complex arguments.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$exp")
-@deffn {Scheme Procedure} $exp x
-Return e to the power of @var{x}, where e is the base of natural
-logarithms (2.71828@dots{}).
-@end deffn
-
-@c begin (texi-doc-string "guile" "$log")
-@deffn {Scheme Procedure} $log x
-Return the natural logarithm of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$sinh")
-@deffn {Scheme Procedure} $sinh x
-Return the hyperbolic sine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$cosh")
-@deffn {Scheme Procedure} $cosh x
-Return the hyperbolic cosine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$tanh")
-@deffn {Scheme Procedure} $tanh x
-Return the hyperbolic tangent of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$asinh")
-@deffn {Scheme Procedure} $asinh x
-Return the hyperbolic arcsine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$acosh")
-@deffn {Scheme Procedure} $acosh x
-Return the hyperbolic arccosine of @var{x}.
-@end deffn
-
-@c begin (texi-doc-string "guile" "$atanh")
-@deffn {Scheme Procedure} $atanh x
-Return the hyperbolic arctangent of @var{x}.
-@end deffn
-
-C functions for the above are provided by the standard mathematics
-library. Naturally these expect and return @code{double} arguments
-(@pxref{Mathematics,,, libc, GNU C Library Reference Manual}).
-
-@multitable {xx} {Scheme Procedure} {C Function}
-@item @tab Scheme Procedure @tab C Function
-
-@item @tab @code{$abs} @tab @code{fabs}
-@item @tab @code{$sqrt} @tab @code{sqrt}
-@item @tab @code{$sin} @tab @code{sin}
-@item @tab @code{$cos} @tab @code{cos}
-@item @tab @code{$tan} @tab @code{tan}
-@item @tab @code{$asin} @tab @code{asin}
-@item @tab @code{$acos} @tab @code{acos}
-@item @tab @code{$atan} @tab @code{atan}
-@item @tab @code{$atan2} @tab @code{atan2}
-@item @tab @code{$exp} @tab @code{exp}
-@item @tab @code{$expt} @tab @code{pow}
-@item @tab @code{$log} @tab @code{log}
-@item @tab @code{$sinh} @tab @code{sinh}
-@item @tab @code{$cosh} @tab @code{cosh}
-@item @tab @code{$tanh} @tab @code{tanh}
-@item @tab @code{$asinh} @tab @code{asinh}
-@item @tab @code{$acosh} @tab @code{acosh}
-@item @tab @code{$atanh} @tab @code{atanh}
-@end multitable
-
-@code{asinh}, @code{acosh} and @code{atanh} are C99 standard but might
-not be available on older systems. Guile provides the following
-equivalents (on all systems).
-
-@deftypefn {C Function} double scm_asinh (double x)
-@deftypefnx {C Function} double scm_acosh (double x)
-@deftypefnx {C Function} double scm_atanh (double x)
-Return the hyperbolic arcsine, arccosine or arctangent of @var{x}
-respectively.
-@end deftypefn
-
-
@node Bitwise Operations
@subsubsection Bitwise Operations
@subsubsection Random Number Generation
Pseudo-random numbers are generated from a random state object, which
-can be created with @code{seed->random-state}. The @var{state}
-parameter to the various functions below is optional, it defaults to
-the state object in the @code{*random-state*} variable.
+can be created with @code{seed->random-state} or
+@code{datum->random-state}. An external representation (i.e. one
+which can written with @code{write} and read with @code{read}) of a
+random state object can be obtained via
+@code{random-state->datum}. The @var{state} parameter to the
+various functions below is optional, it defaults to the state object
+in the @code{*random-state*} variable.
@deffn {Scheme Procedure} copy-random-state [state]
@deffnx {C Function} scm_copy_random_state (state)
Return a new random state using @var{seed}.
@end deffn
+@deffn {Scheme Procedure} datum->random-state datum
+@deffnx {C Function} scm_datum_to_random_state (datum)
+Return a new random state from @var{datum}, which should have been
+obtained by @code{random-state->datum}.
+@end deffn
+
+@deffn {Scheme Procedure} random-state->datum state
+@deffnx {C Function} scm_random_state_to_datum (state)
+Return a datum representation of @var{state} that may be written out and
+read back with the Scheme reader.
+@end deffn
+
@defvar *random-state*
The global random state used by the above functions when the
@var{state} parameter is not given.
In Scheme, there is a data type to describe a single character.
Defining what exactly a character @emph{is} can be more complicated
-than it seems. Guile follows the advice of R6RS and just uses The
-Unicode Standard to help define what a character is. So, for Guile,
-a character is anything in the Unicode Character Database.
-
-Unicode assigns each character an unique integer representation: a
-@emph{code point}. Guile uses Unicode code points as the integer
-representation of characters. Valid code points are in the ranges 0
-to @code{#xD7FF} inclusive or @code{#xE000} to @code{#x10FFFF}
-inclusive.
+than it seems. Guile follows the advice of R6RS and uses The Unicode
+Standard to help define what a character is. So, for Guile, a
+character is anything in the Unicode Character Database.
+
+@cindex code point
+@cindex Unicode code point
+
+The Unicode Character Database is basically a table of characters
+indexed using integers called 'code points'. Valid code points are in
+the ranges 0 to @code{#xD7FF} inclusive or @code{#xE000} to
+@code{#x10FFFF} inclusive, which is about 1.1 million code points.
+
+@cindex designated code point
+@cindex code point, designated
+
+Any code point that has been assigned to a character or that has
+otherwise been given a meaning by Unicode is called a 'designated code
+point'. Most of the designated code points, about 200,000 of them,
+indicate characters, accents or other combining marks that modify
+other characters, symbols, whitespace, and control characters. Some
+are not characters but indicators that suggest how to format or
+display neighboring characters.
+
+@cindex reserved code point
+@cindex code point, reserved
+
+If a code point is not a designated code point -- if it has not been
+assigned to a character by The Unicode Standard -- it is a 'reserved
+code point', meaning that they are reserved for future use. Most of
+the code points, about 800,000, are 'reserved code points'.
+
+By convention, a Unicode code point is written as
+``U+XXXX'' where ``XXXX'' is a hexadecimal number. Please note that
+this convenient notation is not valid code. Guile does not interpret
+``U+XXXX'' as a character.
In Scheme, a character literal is written as @code{#\@var{name}} where
@var{name} is the name of the character that you want. Printable
characters have their usual single character name; for example,
-@code{#\a} is a lower case @code{a}. Many of the non-printing
-characters, such as whitespace characters and control characters, also
-have names.
+@code{#\a} is a lower case @code{a}.
+
+Some of the code points are 'combining characters' that are not meant
+to be printed by themselves but are instead meant to modify the
+appearance of the previous character. For combining characters, an
+alternate form of the character literal is @code{#\} followed by
+U+25CC (a small, dotted circle), followed by the combining character.
+This allows the combining character to be drawn on the circle, not on
+the backslash of @code{#\}.
+
+Many of the non-printing characters, such as whitespace characters and
+control characters, also have names.
+
+The most commonly used non-printing characters have long character
+names, described in the table below.
+
+@multitable {@code{#\backspace}} {Preferred}
+@item Character Name @tab Codepoint
+@item @code{#\nul} @tab U+0000
+@item @code{#\alarm} @tab u+0007
+@item @code{#\backspace} @tab U+0008
+@item @code{#\tab} @tab U+0009
+@item @code{#\linefeed} @tab U+000A
+@item @code{#\newline} @tab U+000A
+@item @code{#\vtab} @tab U+000B
+@item @code{#\page} @tab U+000C
+@item @code{#\return} @tab U+000D
+@item @code{#\esc} @tab U+001B
+@item @code{#\space} @tab U+0020
+@item @code{#\delete} @tab U+007F
+@end multitable
-The most commonly used non-printing chararacters are space and
-newline. Their character names are @code{#\space} and
-@code{#\newline}. There are also names for all of the ``C0 control
-characters'' (those with code points below 32). The following table
-describes the names for each character.
+There are also short names for all of the ``C0 control characters''
+(those with code points below 32). The following table lists the short
+name for each character.
@multitable @columnfractions .25 .25 .25 .25
@item 0 = @code{#\nul}
@tab 7 = @code{#\bel}
@item 8 = @code{#\bs}
@tab 9 = @code{#\ht}
- @tab 10 = @code{#\lf}
+ @tab 10 = @code{#\lf}
@tab 11 = @code{#\vt}
@item 12 = @code{#\ff}
@tab 13 = @code{#\cr}
@item 32 = @code{#\sp}
@end multitable
-The ``delete'' character (code point 127) may be referred to with the
-name @code{#\del}.
+The short name for the ``delete'' character (code point U+007F) is
+@code{#\del}.
-One might note that the space character has two names --
-@code{#\space} and @code{#\sp} -- as does the newline character.
-Several other non-printing characters have more than one name, for the
-sake of compatibility with previous versions.
+There are also a few alternative names left over for compatibility with
+previous versions of Guile.
@multitable {@code{#\backspace}} {Preferred}
@item Alternate @tab Standard
-@item @code{#\sp} @tab @code{#\space}
@item @code{#\nl} @tab @code{#\newline}
-@item @code{#\lf} @tab @code{#\newline}
-@item @code{#\tab} @tab @code{#\ht}
-@item @code{#\backspace} @tab @code{#\bs}
-@item @code{#\return} @tab @code{#\cr}
-@item @code{#\page} @tab @code{#\ff}
-@item @code{#\np} @tab @code{#\ff}
+@item @code{#\np} @tab @code{#\page}
@item @code{#\null} @tab @code{#\nul}
@end multitable
-Characters may also be referred to with an octal value, such as
-@code{#\10} for @code{#\bs} or @code{#\177} for @code{#\del}.
+Characters may also be written using their code point values. They can
+be written with as an octal number, such as @code{#\10} for
+@code{#\bs} or @code{#\177} for @code{#\del}.
+
+If one prefers hex to octal, there is an additional syntax for character
+escapes: @code{#\xHHHH} -- the letter 'x' followed by a hexadecimal
+number of one to eight digits.
@rnindex char?
@deffn {Scheme Procedure} char? x
Return @code{#t} iff @var{x} is a character, else @code{#f}.
@end deffn
-Fundamentally, the character comparisons operations below are
+Fundamentally, the character comparison operations below are
numeric comparisons of the character's code points.
@rnindex char=?
equal to the code point of @var{y}, else @code{#f}.
@end deffn
-Case-insensitive character comparisons of characters use @emph{Unicode
-case folding}. In case folding comparisons, if a character is
-lowercase and has an uppercase form that can be expressed as a single
-character, it is converted to uppercase before comparison. Unicode
-case folding is language independent: it uses rules that are generally
-true, but, it cannot cover all cases for all languages.
+@cindex case folding
+
+Case-insensitive character comparisons use @emph{Unicode case
+folding}. In case folding comparisons, if a character is lowercase
+and has an uppercase form that can be expressed as a single character,
+it is converted to uppercase before comparison. All other characters
+undergo no conversion before the comparison occurs. This includes the
+German sharp S (Eszett) which is not uppercased before conversion
+because its uppercase form has two characters. Unicode case folding
+is language independent: it uses rules that are generally true, but,
+it cannot cover all cases for all languages.
@rnindex char-ci=?
@deffn {Scheme Procedure} char-ci=? x y
@code{#f}.
@end deffn
+@deffn {Scheme Procedure} char-general-category chr
+@deffnx {C Function} scm_char_general_category (chr)
+Return a symbol giving the two-letter name of the Unicode general
+category assigned to @var{chr} or @code{#f} if no named category is
+assigned. The following table provides a list of category names along
+with their meanings.
+
+@multitable @columnfractions .1 .4 .1 .4
+@item Lu
+ @tab Uppercase letter
+ @tab Pf
+ @tab Final quote punctuation
+@item Ll
+ @tab Lowercase letter
+ @tab Po
+ @tab Other punctuation
+@item Lt
+ @tab Titlecase letter
+ @tab Sm
+ @tab Math symbol
+@item Lm
+ @tab Modifier letter
+ @tab Sc
+ @tab Currency symbol
+@item Lo
+ @tab Other letter
+ @tab Sk
+ @tab Modifier symbol
+@item Mn
+ @tab Non-spacing mark
+ @tab So
+ @tab Other symbol
+@item Mc
+ @tab Combining spacing mark
+ @tab Zs
+ @tab Space separator
+@item Me
+ @tab Enclosing mark
+ @tab Zl
+ @tab Line separator
+@item Nd
+ @tab Decimal digit number
+ @tab Zp
+ @tab Paragraph separator
+@item Nl
+ @tab Letter number
+ @tab Cc
+ @tab Control
+@item No
+ @tab Other number
+ @tab Cf
+ @tab Format
+@item Pc
+ @tab Connector punctuation
+ @tab Cs
+ @tab Surrogate
+@item Pd
+ @tab Dash punctuation
+ @tab Co
+ @tab Private use
+@item Ps
+ @tab Open punctuation
+ @tab Cn
+ @tab Unassigned
+@item Pe
+ @tab Close punctuation
+ @tab
+ @tab
+@item Pi
+ @tab Initial quote punctuation
+ @tab
+ @tab
+@end multitable
+@end deffn
+
@rnindex char->integer
@deffn {Scheme Procedure} char->integer chr
@deffnx {C Function} scm_char_to_integer (chr)
Return the lowercase character version of @var{chr}.
@end deffn
+@rnindex char-titlecase
+@deffn {Scheme Procedure} char-titlecase chr
+@deffnx {C Function} scm_char_titlecase (chr)
+Return the titlecase character version of @var{chr} if one exists;
+otherwise return the uppercase version.
+
+For most characters these will be the same, but the Unicode Standard
+includes certain digraph compatibility characters, such as @code{U+01F3}
+``dz'', for which the uppercase and titlecase characters are different
+(@code{U+01F1} ``DZ'' and @code{U+01F2} ``Dz'' in this case,
+respectively).
+@end deffn
+
+@tindex scm_t_wchar
+@deftypefn {C Function} scm_t_wchar scm_c_upcase (scm_t_wchar @var{c})
+@deftypefnx {C Function} scm_t_wchar scm_c_downcase (scm_t_wchar @var{c})
+@deftypefnx {C Function} scm_t_wchar scm_c_titlecase (scm_t_wchar @var{c})
+
+These C functions take an integer representation of a Unicode
+codepoint and return the codepoint corresponding to its uppercase,
+lowercase, and titlecase forms respectively. The type
+@code{scm_t_wchar} is a signed, 32-bit integer.
+@end deftypefn
+
@node Character Sets
@subsection Character Sets
Character sets can be created, extended, tested for the membership of a
characters and be compared to other character sets.
-The Guile implementation of character sets currently deals only with
-8-bit characters. In the future, when Guile gets support for
-international character sets, this will change, but the functions
-provided here will always then be able to efficiently cope with very
-large character sets.
-
@menu
* Character Set Predicates/Comparison::
* Iterating Over Character Sets:: Enumerate charset elements.
If @var{error} is a true value, an error is signalled if the
specified range contains characters which are not contained in
the implemented character range. If @var{error} is @code{#f},
-these characters are silently left out of the resultung
+these characters are silently left out of the resulting
character set.
The characters in @var{base_cs} are added to the result, if
If @var{error} is a true value, an error is signalled if the
specified range contains characters which are not contained in
the implemented character range. If @var{error} is @code{#f},
-these characters are silently left out of the resultung
+these characters are silently left out of the resulting
character set.
The characters are added to @var{base_cs} and @var{base_cs} is
@deffn {Scheme Procedure} ->char-set x
@deffnx {C Function} scm_to_char_set (x)
-Coerces x into a char-set. @var{x} may be a string, character or char-set. A string is converted to the set of its constituent characters; a character is converted to a singleton set; a char-set is returned as-is.
+Coerces x into a char-set. @var{x} may be a string, character or
+char-set. A string is converted to the set of its constituent
+characters; a character is converted to a singleton set; a char-set is
+returned as-is.
@end deffn
@c ===================================================================
Access the elements and other information of a character set with these
procedures.
+@deffn {Scheme Procedure} %char-set-dump cs
+Returns an association list containing debugging information
+for @var{cs}. The association list has the following entries.
+@table @code
+@item char-set
+The char-set itself
+@item len
+The number of groups of contiguous code points the char-set
+contains
+@item ranges
+A list of lists where each sublist is a range of code points
+and their associated characters
+@end table
+The return value of this function cannot be relied upon to be
+consistent between versions of Guile and should not be used in code.
+@end deffn
+
@deffn {Scheme Procedure} char-set-size cs
@deffnx {C Function} scm_char_set_size (cs)
Return the number of elements in character set @var{cs}.
Return the complement of the character set @var{cs}.
@end deffn
+Note that the complement of a character set is likely to contain many
+reserved code points (code points that are not associated with
+characters). It may be helpful to modify the output of
+@code{char-set-complement} by computing its intersection with the set
+of designated code points, @code{char-set:designated}.
+
@deffn {Scheme Procedure} char-set-union . rest
@deffnx {C Function} scm_char_set_union (rest)
Return the union of all argument character sets.
@cindex charset
@cindex locale
-Currently, the contents of these character sets are recomputed upon a
-successful @code{setlocale} call (@pxref{Locales}) in order to reflect
-the characters available in the current locale's codeset. For
-instance, @code{char-set:letter} contains 52 characters under an ASCII
-locale (e.g., the default @code{C} locale) and 117 characters under an
-ISO-8859-1 (``Latin-1'') locale.
+These character sets are locale independent and are not recomputed
+upon a @code{setlocale} call. They contain characters from the whole
+range of Unicode code points. For instance, @code{char-set:letter}
+contains about 94,000 characters.
@defvr {Scheme Variable} char-set:lower-case
@defvrx {C Variable} scm_char_set_lower_case
@defvr {Scheme Variable} char-set:title-case
@defvrx {C Variable} scm_char_set_title_case
-This is empty, because ASCII has no titlecase characters.
+All single characters that function as if they were an upper-case
+letter followed by a lower-case letter.
@end defvr
@defvr {Scheme Variable} char-set:letter
@defvrx {C Variable} scm_char_set_letter
-All letters, e.g. the union of @code{char-set:lower-case} and
-@code{char-set:upper-case}.
+All letters. This includes @code{char-set:lower-case},
+@code{char-set:upper-case}, @code{char-set:title-case}, and many
+letters that have no case at all. For example, Chinese and Japanese
+characters typically have no concept of case.
@end defvr
@defvr {Scheme Variable} char-set:digit
@defvr {Scheme Variable} char-set:blank
@defvrx {C Variable} scm_char_set_blank
-All horizontal whitespace characters, that is @code{#\space} and
-@code{#\tab}.
+All horizontal whitespace characters, which notably includes
+@code{#\space} and @code{#\tab}.
@end defvr
@defvr {Scheme Variable} char-set:iso-control
@defvrx {C Variable} scm_char_set_iso_control
-The ISO control characters with the codes 0--31 and 127.
+The ISO control characters are the C0 control characters (U+0000 to
+U+001F), delete (U+007F), and the C1 control characters (U+0080 to
+U+009F).
@end defvr
@defvr {Scheme Variable} char-set:punctuation
@defvrx {C Variable} scm_char_set_punctuation
-The characters @code{!"#%&'()*,-./:;?@@[\\]_@{@}}
+All punctuation characters, such as the characters
+@code{!"#%&'()*,-./:;?@@[\\]_@{@}}
@end defvr
@defvr {Scheme Variable} char-set:symbol
@defvrx {C Variable} scm_char_set_symbol
-The characters @code{$+<=>^`|~}.
+All symbol characters, such as the characters @code{$+<=>^`|~}.
@end defvr
@defvr {Scheme Variable} char-set:hex-digit
The empty character set.
@end defvr
+@defvr {Scheme Variable} char-set:designated
+@defvrx {C Variable} scm_char_set_designated
+This character set contains all designated code points. This includes
+all the code points to which Unicode has assigned a character or other
+meaning.
+@end defvr
+
@defvr {Scheme Variable} char-set:full
@defvrx {C Variable} scm_char_set_full
-This character set contains all possible characters.
+This character set contains all possible code points. This includes
+both designated and reserved code points.
@end defvr
@node Strings
When one of these two strings is modified, as with @code{string-set!},
their common memory does get copied so that each string has its own
-memory and modifying one does not accidently modify the other as well.
+memory and modifying one does not accidentally modify the other as well.
Thus, Guile's strings are `copy on write'; the actual copying of their
memory is delayed until one string is written to.
* Reversing and Appending Strings:: Appending strings to form a new string.
* Mapping Folding and Unfolding:: Iterating over strings.
* Miscellaneous String Operations:: Replicating, insertion, parsing, ...
-* Conversion to/from C::
+* Conversion to/from C::
+* String Internals:: The storage strategy for strings.
@end menu
@node String Syntax
The read syntax for strings is an arbitrarily long sequence of
characters enclosed in double quotes (@nicode{"}).
-Backslash is an escape character and can be used to insert the
-following special characters. @nicode{\"} and @nicode{\\} are R5RS
-standard, the rest are Guile extensions, notice they follow C string
-syntax.
+Backslash is an escape character and can be used to insert the following
+special characters. @nicode{\"} and @nicode{\\} are R5RS standard, the
+next seven are R6RS standard --- notice they follow C syntax --- and the
+remaining four are Guile extensions.
@table @asis
@item @nicode{\\}
Double quote character (an unescaped @nicode{"} is otherwise the end
of the string).
-@item @nicode{\0}
-NUL character (ASCII 0).
-
@item @nicode{\a}
Bell character (ASCII 7).
@item @nicode{\v}
Vertical tab character (ASCII 11).
+@item @nicode{\b}
+Backspace character (ASCII 8).
+
+@item @nicode{\0}
+NUL character (ASCII 0).
+
@item @nicode{\xHH}
Character code given by two hexadecimal digits. For example
@nicode{\x7f} for an ASCII DEL (127).
+
+@item @nicode{\uHHHH}
+Character code given by four hexadecimal digits. For example
+@nicode{\u0100} for a capital A with macron (U+0100).
+
+@item @nicode{\UHHHHHH}
+Character code given by six hexadecimal digits. For example
+@nicode{\U010402}.
@end table
@noindent
"\"Hi\", he said."
@end lisp
+The three escape sequences @code{\xHH}, @code{\uHHHH} and @code{\UHHHHHH} were
+chosen to not break compatibility with code written for previous versions of
+Guile. The R6RS specification suggests a different, incompatible syntax for hex
+escapes: @code{\xHHHH;} -- a character code followed by one to eight hexadecimal
+digits terminated with a semicolon. If this escape format is desired instead,
+it can be enabled with the reader option @code{r6rs-hex-escapes}.
+
+@lisp
+(read-enable 'r6rs-hex-escapes)
+@end lisp
+
+More on reader options in general can be found at (@pxref{Reader
+options}).
@node String Predicates
@subsubsection String Predicates
@deffnx {C Function} scm_string_trim (s, char_pred, start, end)
@deffnx {C Function} scm_string_trim_right (s, char_pred, start, end)
@deffnx {C Function} scm_string_trim_both (s, char_pred, start, end)
-Trim occurrances of @var{char_pred} from the ends of @var{s}.
+Trim occurrences of @var{char_pred} from the ends of @var{s}.
@code{string-trim} trims @var{char_pred} characters from the left
(start) of the string, @code{string-trim-right} trims them from the
predicates (@pxref{Characters}), but are defined on character sequences.
The first set is specified in R5RS and has names that end in @code{?}.
-The second set is specified in SRFI-13 and the names have no ending
-@code{?}. The predicates ending in @code{-ci} ignore the character case
-when comparing strings. @xref{Text Collation, the @code{(ice-9
+The second set is specified in SRFI-13 and the names have not ending
+@code{?}.
+
+The predicates ending in @code{-ci} ignore the character case
+when comparing strings. For now, case-insensitive comparison is done
+using the R5RS rules, where every lower-case character that has a
+single character upper-case form is converted to uppercase before
+comparison. See @xref{Text Collation, the @code{(ice-9
i18n)} module}, for locale-dependent string comparison.
@rnindex string=?
-@deffn {Scheme Procedure} string=? s1 s2
+@deffn {Scheme Procedure} string=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_equal_p (s1, s2, rest)
Lexicographic equality predicate; return @code{#t} if the two
strings are the same length and contain the same characters in
the same positions, otherwise return @code{#f}.
@end deffn
@rnindex string<?
-@deffn {Scheme Procedure} string<? s1 s2
+@deffn {Scheme Procedure} string<? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_less_p (s1, s2, rest)
Lexicographic ordering predicate; return @code{#t} if @var{s1}
is lexicographically less than @var{s2}.
@end deffn
@rnindex string<=?
-@deffn {Scheme Procedure} string<=? s1 s2
+@deffn {Scheme Procedure} string<=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_leq_p (s1, s2, rest)
Lexicographic ordering predicate; return @code{#t} if @var{s1}
is lexicographically less than or equal to @var{s2}.
@end deffn
@rnindex string>?
-@deffn {Scheme Procedure} string>? s1 s2
+@deffn {Scheme Procedure} string>? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_gr_p (s1, s2, rest)
Lexicographic ordering predicate; return @code{#t} if @var{s1}
is lexicographically greater than @var{s2}.
@end deffn
@rnindex string>=?
-@deffn {Scheme Procedure} string>=? s1 s2
+@deffn {Scheme Procedure} string>=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_geq_p (s1, s2, rest)
Lexicographic ordering predicate; return @code{#t} if @var{s1}
is lexicographically greater than or equal to @var{s2}.
@end deffn
@rnindex string-ci=?
-@deffn {Scheme Procedure} string-ci=? s1 s2
+@deffn {Scheme Procedure} string-ci=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_equal_p (s1, s2, rest)
Case-insensitive string equality predicate; return @code{#t} if
the two strings are the same length and their component
characters match (ignoring case) at each position; otherwise
@end deffn
@rnindex string-ci<?
-@deffn {Scheme Procedure} string-ci<? s1 s2
+@deffn {Scheme Procedure} string-ci<? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_less_p (s1, s2, rest)
Case insensitive lexicographic ordering predicate; return
@code{#t} if @var{s1} is lexicographically less than @var{s2}
regardless of case.
@end deffn
@rnindex string<=?
-@deffn {Scheme Procedure} string-ci<=? s1 s2
+@deffn {Scheme Procedure} string-ci<=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_leq_p (s1, s2, rest)
Case insensitive lexicographic ordering predicate; return
@code{#t} if @var{s1} is lexicographically less than or equal
to @var{s2} regardless of case.
@end deffn
@rnindex string-ci>?
-@deffn {Scheme Procedure} string-ci>? s1 s2
+@deffn {Scheme Procedure} string-ci>? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_gr_p (s1, s2, rest)
Case insensitive lexicographic ordering predicate; return
@code{#t} if @var{s1} is lexicographically greater than
@var{s2} regardless of case.
@end deffn
@rnindex string-ci>=?
-@deffn {Scheme Procedure} string-ci>=? s1 s2
+@deffn {Scheme Procedure} string-ci>=? [s1 [s2 . rest]]
+@deffnx {C Function} scm_i_string_ci_geq_p (s1, s2, rest)
Case insensitive lexicographic ordering predicate; return
@code{#t} if @var{s1} is lexicographically greater than or
equal to @var{s2} regardless of case.
equal to, or greater than @var{s2}. The mismatch index is the
largest index @var{i} such that for every 0 <= @var{j} <
@var{i}, @var{s1}[@var{j}] = @var{s2}[@var{j}] -- that is,
-@var{i} is the first position that does not match. The
-character comparison is done case-insensitively.
+@var{i} is the first position where the lowercased letters
+do not match.
+
@end deffn
@deffn {Scheme Procedure} string= s1 s2 [start1 [end1 [start2 [end2]]]]
Compute a hash value for @var{S}. the optional argument @var{bound} is a non-negative exact integer specifying the range of the hash function. A positive value restricts the return value to the range [0,bound).
@end deffn
+Because the same visual appearance of an abstract Unicode character can
+be obtained via multiple sequences of Unicode characters, even the
+case-insensitive string comparison functions described above may return
+@code{#f} when presented with strings containing different
+representations of the same character. For example, the Unicode
+character ``LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE'' can be
+represented with a single character (U+1E69) or by the character ``LATIN
+SMALL LETTER S'' (U+0073) followed by the combining marks ``COMBINING
+DOT BELOW'' (U+0323) and ``COMBINING DOT ABOVE'' (U+0307).
+
+For this reason, it is often desirable to ensure that the strings
+to be compared are using a mutually consistent representation for every
+character. The Unicode standard defines two methods of normalizing the
+contents of strings: Decomposition, which breaks composite characters
+into a set of constituent characters with an ordering defined by the
+Unicode Standard; and composition, which performs the converse.
+
+There are two decomposition operations. ``Canonical decomposition''
+produces character sequences that share the same visual appearance as
+the original characters, while ``compatiblity decomposition'' produces
+ones whose visual appearances may differ from the originals but which
+represent the same abstract character.
+
+These operations are encapsulated in the following set of normalization
+forms:
+
+@table @dfn
+@item NFD
+Characters are decomposed to their canonical forms.
+
+@item NFKD
+Characters are decomposed to their compatibility forms.
+
+@item NFC
+Characters are decomposed to their canonical forms, then composed.
+
+@item NFKC
+Characters are decomposed to their compatibility forms, then composed.
+
+@end table
+
+The functions below put their arguments into one of the forms described
+above.
+
+@deffn {Scheme Procedure} string-normalize-nfd s
+@deffnx {C Function} scm_string_normalize_nfd (s)
+Return the @code{NFD} normalized form of @var{s}.
+@end deffn
+
+@deffn {Scheme Procedure} string-normalize-nfkd s
+@deffnx {C Function} scm_string_normalize_nfkd (s)
+Return the @code{NFKD} normalized form of @var{s}.
+@end deffn
+
+@deffn {Scheme Procedure} string-normalize-nfc s
+@deffnx {C Function} scm_string_normalize_nfc (s)
+Return the @code{NFC} normalized form of @var{s}.
+@end deffn
+
+@deffn {Scheme Procedure} string-normalize-nfkc s
+@deffnx {C Function} scm_string_normalize_nfkc (s)
+Return the @code{NFKC} normalized form of @var{s}.
+@end deffn
+
@node String Searching
@subsubsection String Searching
@deffn {Scheme Procedure} string-index s char_pred [start [end]]
@deffnx {C Function} scm_string_index (s, char_pred, start, end)
Search through the string @var{s} from left to right, returning
-the index of the first occurence of a character which
+the index of the first occurrence of a character which
@itemize @bullet
@item
equals @var{char_pred}, if it is character,
@item
-satisifies the predicate @var{char_pred}, if it is a procedure,
+satisfies the predicate @var{char_pred}, if it is a procedure,
@item
is in the set @var{char_pred}, if it is a character set.
@deffn {Scheme Procedure} string-rindex s char_pred [start [end]]
@deffnx {C Function} scm_string_rindex (s, char_pred, start, end)
Search through the string @var{s} from right to left, returning
-the index of the last occurence of a character which
+the index of the last occurrence of a character which
@itemize @bullet
@item
equals @var{char_pred}, if it is character,
@item
-satisifies the predicate @var{char_pred}, if it is a procedure,
+satisfies the predicate @var{char_pred}, if it is a procedure,
@item
is in the set if @var{char_pred} is a character set.
@deffn {Scheme Procedure} string-index-right s char_pred [start [end]]
@deffnx {C Function} scm_string_index_right (s, char_pred, start, end)
Search through the string @var{s} from right to left, returning
-the index of the last occurence of a character which
+the index of the last occurrence of a character which
@itemize @bullet
@item
equals @var{char_pred}, if it is character,
@item
-satisifies the predicate @var{char_pred}, if it is a procedure,
+satisfies the predicate @var{char_pred}, if it is a procedure,
@item
is in the set if @var{char_pred} is a character set.
@deffn {Scheme Procedure} string-skip s char_pred [start [end]]
@deffnx {C Function} scm_string_skip (s, char_pred, start, end)
Search through the string @var{s} from left to right, returning
-the index of the first occurence of a character which
+the index of the first occurrence of a character which
@itemize @bullet
@item
does not equal @var{char_pred}, if it is character,
@item
-does not satisify the predicate @var{char_pred}, if it is a
+does not satisfy the predicate @var{char_pred}, if it is a
procedure,
@item
@deffn {Scheme Procedure} string-skip-right s char_pred [start [end]]
@deffnx {C Function} scm_string_skip_right (s, char_pred, start, end)
Search through the string @var{s} from right to left, returning
-the index of the last occurence of a character which
+the index of the last occurrence of a character which
@itemize @bullet
@item
equals @var{char_pred}, if it is character,
@item
-satisifies the predicate @var{char_pred}, if it is a procedure.
+satisfies the predicate @var{char_pred}, if it is a procedure.
@item
is in the set @var{char_pred}, if it is a character set.
These are procedures for mapping strings to their upper- or lower-case
equivalents, respectively, or for capitalizing strings.
+They use the basic case mapping rules for Unicode characters. No
+special language or context rules are considered. The resulting strings
+are guaranteed to be the same length as the input strings.
+
+@xref{Character Case Mapping, the @code{(ice-9
+i18n)} module}, for locale-dependent case conversions.
+
@deffn {Scheme Procedure} string-upcase str [start [end]]
@deffnx {C Function} scm_substring_upcase (str, start, end)
@deffnx {C Function} scm_string_upcase (str)
@end example
@end deffn
-@deffn {Scheme Procedure} string-append/shared . ls
-@deffnx {C Function} scm_string_append_shared (ls)
+@deffn {Scheme Procedure} string-append/shared . rest
+@deffnx {C Function} scm_string_append_shared (rest)
Like @code{string-append}, but the result may share memory
with the argument strings.
@end deffn
@deffn {Scheme Procedure} string-concatenate-reverse/shared ls [final_string [end]]
@deffnx {C Function} scm_string_concatenate_reverse_shared (ls, final_string, end)
Like @code{string-concatenate-reverse}, but the result may
-share memory with the the strings in the @var{ls} arguments.
+share memory with the strings in the @var{ls} arguments.
@end deffn
@node Mapping Folding and Unfolding
not an issue (most of the time), since in Scheme you never get to see
the bytes, only the characters.
-Well, ideally, anyway. Right now, Guile simply equates Scheme
-characters and bytes, ignoring the possibility of multi-byte encodings
-completely. This will change in the future, where Guile will use
-Unicode codepoints as its characters and UTF-8 or some other encoding
-as its internal encoding. When you exclusively use the functions
-listed in this section, you are `future-proof'.
+Converting to C and converting from C each have their own challenges.
+
+When converting from C to Scheme, it is important that the sequence of
+bytes in the C string be valid with respect to its encoding. ASCII
+strings, for example, can't have any bytes greater than 127. An ASCII
+byte greater than 127 is considered @emph{ill-formed} and cannot be
+converted into a Scheme character.
+
+Problems can occur in the reverse operation as well. Not all character
+encodings can hold all possible Scheme characters. Some encodings, like
+ASCII for example, can only describe a small subset of all possible
+characters. So, when converting to C, one must first decide what to do
+with Scheme characters that can't be represented in the C string.
Converting a Scheme string to a C string will often allocate fresh
memory to hold the result. You must take care that this memory is
@deftypefn {C Function} SCM scm_from_locale_string (const char *str)
@deftypefnx {C Function} SCM scm_from_locale_stringn (const char *str, size_t len)
-Creates a new Scheme string that has the same contents as @var{str}
-when interpreted in the current locale character encoding.
+Creates a new Scheme string that has the same contents as @var{str} when
+interpreted in the locale character encoding of the
+@code{current-input-port}.
For @code{scm_from_locale_string}, @var{str} must be null-terminated.
@var{str} in bytes, and @var{str} does not need to be null-terminated.
If @var{len} is @code{(size_t)-1}, then @var{str} does need to be
null-terminated and the real length will be found with @code{strlen}.
+
+If the C string is ill-formed, an error will be raised.
@end deftypefn
@deftypefn {C Function} SCM scm_take_locale_string (char *str)
@deftypefn {C Function} {char *} scm_to_locale_string (SCM str)
@deftypefnx {C Function} {char *} scm_to_locale_stringn (SCM str, size_t *lenp)
-Returns a C string in the current locale encoding with the same
-contents as @var{str}. The C string must be freed with @code{free}
-eventually, maybe by using @code{scm_dynwind_free}, @xref{Dynamic
-Wind}.
+Returns a C string with the same contents as @var{str} in the locale
+encoding of the @code{current-output-port}. The C string must be freed
+with @code{free} eventually, maybe by using @code{scm_dynwind_free},
+@xref{Dynamic Wind}.
For @code{scm_to_locale_string}, the returned string is
null-terminated and an error is signalled when @var{str} contains
returned string will not be null-terminated in this case. If
@var{lenp} is @code{NULL}, @code{scm_to_locale_stringn} behaves like
@code{scm_to_locale_string}.
+
+If a character in @var{str} cannot be represented in the locale encoding
+of the current output port, the port conversion strategy of the current
+output port will determine the result, @xref{Ports}. If output port's
+conversion strategy is @code{error}, an error will be raised. If it is
+@code{subsitute}, a replacement character, such as a question mark, will
+be inserted in its place. If it is @code{escape}, a hex escape will be
+inserted in its place.
@end deftypefn
@deftypefn {C Function} size_t scm_to_locale_stringbuf (SCM str, char *buf, size_t max_len)
stored and you probably need to try again with a larger buffer.
@end deftypefn
+@node String Internals
+@subsubsection String Internals
+
+Guile stores each string in memory as a contiguous array of Unicode code
+points along with an associated set of attributes. If all of the code
+points of a string have an integer range between 0 and 255 inclusive,
+the code point array is stored as one byte per code point: it is stored
+as an ISO-8859-1 (aka Latin-1) string. If any of the code points of the
+string has an integer value greater that 255, the code point array is
+stored as four bytes per code point: it is stored as a UTF-32 string.
+
+Conversion between the one-byte-per-code-point and
+four-bytes-per-code-point representations happens automatically as
+necessary.
+
+No API is provided to set the internal representation of strings;
+however, there are pair of procedures available to query it. These are
+debugging procedures. Using them in production code is discouraged,
+since the details of Guile's internal representation of strings may
+change from release to release.
+
+@deffn {Scheme Procedure} string-bytes-per-char str
+@deffnx {C Function} scm_string_bytes_per_char (str)
+Return the number of bytes used to encode a Unicode code point in string
+@var{str}. The result is one or four.
+@end deffn
+
+@deffn {Scheme Procedure} %string-dump str
+@deffnx {C Function} scm_sys_string_dump (str)
+Returns an association list containing debugging information for
+@var{str}. The association list has the following entries.
+@table @code
+
+@item string
+The string itself.
+
+@item start
+The start index of the string into its stringbuf
+
+@item length
+The length of the string
+
+@item shared
+If this string is a substring, it returns its
+parent string. Otherwise, it returns @code{#f}
+
+@item read-only
+@code{#t} if the string is read-only
+
+@item stringbuf-chars
+A new string containing this string's stringbuf's characters
+
+@item stringbuf-length
+The number of characters in this stringbuf
+
+@item stringbuf-shared
+@code{#t} if this stringbuf is shared
+
+@item stringbuf-wide
+@code{#t} if this stringbuf's characters are stored in a 32-bit buffer,
+or @code{#f} if they are stored in an 8-bit buffer
+@end table
+@end deffn
+
+
@node Bytevectors
@subsection Bytevectors
@cindex bytevector
@cindex R6RS
-A @dfn{bytevector} is a raw bit string. The @code{(rnrs bytevector)}
+A @dfn{bytevector} is a raw bit string. The @code{(rnrs bytevectors)}
module provides the programming interface specified by the
@uref{http://www.r6rs.org/, Revised^6 Report on the Algorithmic Language
Scheme (R6RS)}. It contains procedures to manipulate bytevectors and
* Bytevectors as Floats:: Interpreting bytes as real numbers.
* Bytevectors as Strings:: Interpreting bytes as Unicode strings.
* Bytevectors as Generalized Vectors:: Guile extension to the bytevector API.
+* Bytevectors as Uniform Vectors:: Bytevectors and SRFI-4.
@end menu
@node Bytevector Endianness
@cindex Unicode string encoding
Bytevector contents can also be interpreted as Unicode strings encoded
-in one of the most commonly available encoding formats@footnote{Guile
-1.8 does @emph{not} support Unicode strings. Therefore, the procedures
-described here assume that Guile strings are internally encoded
-according to the current locale. For instance, if @code{$LC_CTYPE} is
-@code{fr_FR.ISO-8859-1}, then @code{string->utf-8} @i{et al.} will
-assume that Guile strings are Latin-1-encoded.}.
+in one of the most commonly available encoding formats.
@lisp
(utf8->string (u8-list->bytevector '(99 97 102 101)))
@end lisp
@deffn {Scheme Procedure} string->utf8 str
-@deffnx {Scheme Procedure} string->utf16 str
-@deffnx {Scheme Procedure} string->utf32 str
+@deffnx {Scheme Procedure} string->utf16 str [endianness]
+@deffnx {Scheme Procedure} string->utf32 str [endianness]
@deffnx {C Function} scm_string_to_utf8 (str)
-@deffnx {C Function} scm_string_to_utf16 (str)
-@deffnx {C Function} scm_string_to_utf32 (str)
+@deffnx {C Function} scm_string_to_utf16 (str, endianness)
+@deffnx {C Function} scm_string_to_utf32 (str, endianness)
Return a newly allocated bytevector that contains the UTF-8, UTF-16, or
-UTF-32 (aka. UCS-4) encoding of @var{str}.
+UTF-32 (aka. UCS-4) encoding of @var{str}. For UTF-16 and UTF-32,
+@var{endianness} should be the symbol @code{big} or @code{little}; when omitted,
+it defaults to big endian.
@end deffn
@deffn {Scheme Procedure} utf8->string utf
-@deffnx {Scheme Procedure} utf16->string utf
-@deffnx {Scheme Procedure} utf32->string utf
+@deffnx {Scheme Procedure} utf16->string utf [endianness]
+@deffnx {Scheme Procedure} utf32->string utf [endianness]
@deffnx {C Function} scm_utf8_to_string (utf)
-@deffnx {C Function} scm_utf16_to_string (utf)
-@deffnx {C Function} scm_utf32_to_string (utf)
+@deffnx {C Function} scm_utf16_to_string (utf, endianness)
+@deffnx {C Function} scm_utf32_to_string (utf, endianness)
Return a newly allocated string that contains from the UTF-8-, UTF-16-,
-or UTF-32-decoded contents of bytevector @var{utf}.
+or UTF-32-decoded contents of bytevector @var{utf}. For UTF-16 and UTF-32,
+@var{endianness} should be the symbol @code{big} or @code{little}; when omitted,
+it defaults to big endian.
@end deffn
@node Bytevectors as Generalized Vectors
@end example
+@node Bytevectors as Uniform Vectors
+@subsubsection Accessing Bytevectors with the SRFI-4 API
+
+Bytevectors may also be accessed with the SRFI-4 API. @xref{SRFI-4 and
+Bytevectors}, for more information.
+
+
@node Regular Expressions
@subsection Regular Expressions
@tpindex Regular expressions
@subsection ``Functionality-Centric'' Data Types
Procedures and macros are documented in their own chapter: see
-@ref{Procedures and Macros}.
+@ref{Procedures} and @ref{Macros}.
Variable objects are documented as part of the description of Guile's
module system: see @ref{Variables}.