X-Git-Url: https://git.hcoop.net/bpt/emacs.git/blobdiff_plain/394d33a8ff9bb8b0d64c3f97d4aa074ea38e4a9d..255d07dc33fc9b78acd47348372b1016ca27d210:/lispref/strings.texi diff --git a/lispref/strings.texi b/lispref/strings.texi index 970497871e..a29e84f8ed 100644 --- a/lispref/strings.texi +++ b/lispref/strings.texi @@ -1,7 +1,7 @@ @c -*-texinfo-*- @c This is part of the GNU Emacs Lisp Reference Manual. -@c Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995, 1998, 1999 -@c Free Software Foundation, Inc. +@c Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995, 1998, 1999, 2003 +@c Free Software Foundation, Inc. @c See the file elisp.texi for copying conditions. @setfilename ../info/strings @node Strings and Characters, Lists, Numbers, Top @@ -44,7 +44,7 @@ used. Thus, strings really contain integers. The length of a string (like any array) is fixed, and cannot be altered once the string exists. Strings in Lisp are @emph{not} terminated by a distinguished character code. (By contrast, strings in -C are terminated by a character with @sc{ascii} code 0.) +C are terminated by a character with @acronym{ASCII} code 0.) Since strings are arrays, and therefore sequences as well, you can operate on them with the general array and sequence functions. @@ -52,10 +52,10 @@ operate on them with the general array and sequence functions. change individual characters in a string using the functions @code{aref} and @code{aset} (@pxref{Array Functions}). - There are two text representations for non-@sc{ascii} characters in + There are two text representations for non-@acronym{ASCII} characters in Emacs strings (and in buffers): unibyte and multibyte (@pxref{Text -Representations}). An @sc{ascii} character always occupies one byte in a -string; in fact, when a string is all @sc{ascii}, there is no real +Representations}). An @acronym{ASCII} character always occupies one byte in a +string; in fact, when a string is all @acronym{ASCII}, there is no real difference between the unibyte and multibyte representations. For most Lisp programming, you don't need to be concerned with these two representations. @@ -66,8 +66,8 @@ characters (which are large integers) rather than character codes in the range 128 to 255. Strings cannot hold characters that have the hyper, super or alt -modifiers; they can hold @sc{ascii} control characters, but no other -control characters. They do not distinguish case in @sc{ascii} control +modifiers; they can hold @acronym{ASCII} control characters, but no other +control characters. They do not distinguish case in @acronym{ASCII} control characters. If you want to store such characters in a sequence, such as a key sequence, you must use a vector instead of a string. @xref{Character Type}, for more information about the representation of meta @@ -158,7 +158,7 @@ position up to which the substring is copied. The character whose index is 3 is actually the fourth character in the string. A negative number counts from the end of the string, so that @minus{}1 -signifies the index of the last character of the string. For example: +signifies the index of the last character of the string. For example: @example @group @@ -172,7 +172,7 @@ In this example, the index for @samp{e} is @minus{}3, the index for @samp{f} is @minus{}2, and the index for @samp{g} is @minus{}1. Therefore, @samp{e} and @samp{f} are included, and @samp{g} is excluded. -When @code{nil} is used as an index, it stands for the length of the +When @code{nil} is used for @var{end}, it stands for the length of the string. Thus, @example @@ -208,10 +208,11 @@ For example: @result{} [b (c)] @end example -A @code{wrong-type-argument} error is signaled if either @var{start} or -@var{end} is not an integer or @code{nil}. An @code{args-out-of-range} -error is signaled if @var{start} indicates a character following -@var{end}, or if either integer is out of range for @var{string}. +A @code{wrong-type-argument} error is signaled if @var{start} is not +an integer or if @var{end} is neither an integer nor @code{nil}. An +@code{args-out-of-range} error is signaled if @var{start} indicates a +character following @var{end}, or if either integer is out of range +for @var{string}. Contrast this function with @code{buffer-substring} (@pxref{Buffer Contents}), which returns a string containing a portion of the text in @@ -219,6 +220,14 @@ the current buffer. The beginning of a string is at index 0, but the beginning of a buffer is at index 1. @end defun +@defun substring-no-properties string &optional start end +This works like @code{substring} but discards all text properties from +the value. Also, @var{start} may be omitted or @code{nil}, which is +equivalent to 0. Thus, @w{@code{(substring-no-properties +@var{string})}} returns a copy of @var{string}, with all text +properties removed. +@end defun + @defun concat &rest sequences @cindex copying strings @cindex concatenating strings @@ -255,46 +264,98 @@ printed form is with @code{format} (@pxref{Formatting Strings}) or For information about other concatenation functions, see the description of @code{mapconcat} in @ref{Mapping Functions}, -@code{vconcat} in @ref{Vectors}, and @code{append} in @ref{Building +@code{vconcat} in @ref{Vector Functions}, and @code{append} in @ref{Building Lists}. @end defun -@defun split-string string separators -This function splits @var{string} into substrings at matches for the regular -expression @var{separators}. Each match for @var{separators} defines a -splitting point; the substrings between the splitting points are made -into a list, which is the value returned by @code{split-string}. +@defun split-string string &optional separators omit-nulls +This function splits @var{string} into substrings at matches for the +regular expression @var{separators}. Each match for @var{separators} +defines a splitting point; the substrings between the splitting points +are made into a list, which is the value returned by +@code{split-string}. + +If @var{omit-nulls} is @code{nil}, the result contains null strings +whenever there are two consecutive matches for @var{separators}, or a +match is adjacent to the beginning or end of @var{string}. If +@var{omit-nulls} is @code{t}, these null strings are omitted from the +result list. + If @var{separators} is @code{nil} (or omitted), -the default is @code{"[ \f\t\n\r\v]+"}. +the default is the value of @code{split-string-default-separators}. + +As a special case, when @var{separators} is @code{nil} (or omitted), +null strings are always omitted from the result. Thus: -For example, +@example +(split-string " two words ") + @result{} ("two" "words") +@end example + +The result is not @samp{("" "two" "words" "")}, which would rarely be +useful. If you need such a result, use an explicit value for +@var{separators}: + +@example +(split-string " two words " split-string-default-separators) + @result{} ("" "two" "words" "") +@end example + +More examples: @example (split-string "Soup is good food" "o") -@result{} ("S" "up is g" "" "d f" "" "d") + @result{} ("S" "up is g" "" "d f" "" "d") +(split-string "Soup is good food" "o" t) + @result{} ("S" "up is g" "d f" "d") (split-string "Soup is good food" "o+") -@result{} ("S" "up is g" "d f" "d") + @result{} ("S" "up is g" "d f" "d") @end example -When there is a match adjacent to the beginning or end of the string, -this does not cause a null string to appear at the beginning or end -of the list: +Empty matches do count, except that @code{split-string} will not look +for a final empty match when it already reached the end of the string +using a non-empty match or when @var{string} is empty: @example -(split-string "out to moo" "o+") -@result{} ("ut t" " m") +(split-string "aooob" "o*") + @result{} ("" "a" "" "b" "") +(split-string "ooaboo" "o*") + @result{} ("" "" "a" "b" "") +(split-string "" "") + @result{} ("") @end example -Empty matches do count, when not adjacent to another match: +However, when @var{separators} can match the empty string, +@var{omit-nulls} is usually @code{t}, so that the subtleties in the +three previous examples are rarely relevant: @example -(split-string "Soup is good food" "o*") -@result{}("S" "u" "p" " " "i" "s" " " "g" "d" " " "f" "d") -(split-string "Nice doggy!" "") -@result{}("N" "i" "c" "e" " " "d" "o" "g" "g" "y" "!") +(split-string "Soup is good food" "o*" t) + @result{} ("S" "u" "p" " " "i" "s" " " "g" "d" " " "f" "d") +(split-string "Nice doggy!" "" t) + @result{} ("N" "i" "c" "e" " " "d" "o" "g" "g" "y" "!") +(split-string "" "" t) + @result{} nil +@end example + +Somewhat odd, but predictable, behavior can occur for certain +``non-greedy'' values of @var{separators} that can prefer empty +matches over non-empty matches. Again, such values rarely occur in +practice: + +@example +(split-string "ooo" "o*" t) + @result{} nil +(split-string "ooo" "\\|o+" t) + @result{} ("o" "o" "o") @end example @end defun +@defvar split-string-default-separators +The default value of @var{separators} for @code{split-string}, initially +@w{@samp{"[ \f\t\n\r\v]+"}}. +@end defvar + @node Modifying Strings @section Modifying Strings @@ -316,6 +377,14 @@ Since it is impossible to change the length of an existing string, it is an error if @var{obj} doesn't fit within @var{string}'s actual length, or if any new character requires a different number of bytes from the character currently present at that point in @var{string}. +@end defun + + To clear out a string that contained a password, use +@code{clear-string}: + +@defun clear-string string +This clears the contents of @var{string} to zeros +and may change its length. @end defun @need 2000 @@ -339,7 +408,8 @@ in case if @code{case-fold-search} is non-@code{nil}. @defun string= string1 string2 This function returns @code{t} if the characters of the two strings -match exactly. +match exactly. Symbols are also allowed as arguments, in which case +their print names are used. Case is always significant, regardless of @code{case-fold-search}. @example @@ -355,8 +425,20 @@ The function @code{string=} ignores the text properties of the two strings. When @code{equal} (@pxref{Equality Predicates}) compares two strings, it uses @code{string=}. -If the strings contain non-@sc{ascii} characters, and one is unibyte -while the other is multibyte, then they cannot be equal. @xref{Text +For technical reasons, a unibyte and a multibyte string are +@code{equal} if and only if they contain the same sequence of +character codes and all these codes are either in the range 0 through +127 (@acronym{ASCII}) or 160 through 255 (@code{eight-bit-graphic}). +However, when a unibyte string gets converted to a multibyte string, +all characters with codes in the range 160 through 255 get converted +to characters with higher codes, whereas @acronym{ASCII} characters +remain unchanged. Thus, a unibyte string and its conversion to +multibyte are only @code{equal} if the string is all @acronym{ASCII}. +Character codes 160 through 255 are not entirely proper in multibyte +text, even though they can occur. As a consequence, the situation +where a unibyte and a multibyte string are @code{equal} without both +being all @acronym{ASCII} is a technical oddity that very few Emacs +Lisp programmers ever get confronted with. @xref{Text Representations}. @end defun @@ -377,11 +459,11 @@ function returns @code{t}. If the lesser character is the one from Pairs of characters are compared according to their character codes. Keep in mind that lower case letters have higher numeric values in the -@sc{ascii} character set than their upper case counterparts; digits and +@acronym{ASCII} character set than their upper case counterparts; digits and many punctuation characters have a lower numeric value than upper case -letters. An @sc{ascii} character is less than any non-@sc{ascii} -character; a unibyte non-@sc{ascii} character is always less than any -multibyte non-@sc{ascii} character (@pxref{Text Representations}). +letters. An @acronym{ASCII} character is less than any non-@acronym{ASCII} +character; a unibyte non-@acronym{ASCII} character is always less than any +multibyte non-@acronym{ASCII} character (@pxref{Text Representations}). @example @group @@ -410,9 +492,12 @@ no characters is less than any other string. (string< "abc" "ab") @result{} nil (string< "" "") - @result{} nil + @result{} nil @end group @end example + +Symbols are also allowed as arguments, in which case their print names +are used. @end defun @defun string-lessp string1 string2 @@ -428,9 +513,10 @@ index @var{start2} up to index @var{end2} (@code{nil} means the end of the string). The strings are both converted to multibyte for the comparison -(@pxref{Text Representations}) so that a unibyte string can be equal to -a multibyte string. If @var{ignore-case} is non-@code{nil}, then case -is ignored, so that upper case letters can be equal to lower case letters. +(@pxref{Text Representations}) so that a unibyte string and its +conversion to multibyte are always regarded as equal. If +@var{ignore-case} is non-@code{nil}, then case is ignored, so that +upper case letters can be equal to lower case letters. If the specified portions of the two strings match, the value is @code{t}. Otherwise, the value is an integer which indicates how many @@ -440,16 +526,14 @@ two strings. The sign is negative if @var{string1} (or its specified portion) is less. @end defun -@defun assoc-ignore-case key alist +@defun assoc-string key alist &optional case-fold This function works like @code{assoc}, except that @var{key} must be a -string, and comparison is done using @code{compare-strings}. -Case differences are ignored in this comparison. -@end defun - -@defun assoc-ignore-representation key alist -This function works like @code{assoc}, except that @var{key} must be a -string, and comparison is done using @code{compare-strings}. -Case differences are significant. +string, and comparison is done using @code{compare-strings}. If +@var{case-fold} is non-@code{nil}, it ignores case differences. +Unlike @code{assoc}, this function can also match elements of the alist +that are strings rather than conses. In particular, @var{alist} can +be a list of strings rather than an actual alist. +@xref{Association Lists}. @end defun See also @code{compare-buffer-substrings} in @ref{Comparing Text}, for @@ -486,7 +570,7 @@ This function returns a new string containing one character, @cindex string to character This function returns the first character in @var{string}. If the string is empty, the function returns 0. The value is also 0 when the -first character of @var{string} is the null character, @sc{ascii} code +first character of @var{string} is the null character, @acronym{ASCII} code 0. @example @@ -517,8 +601,10 @@ negative. @example (number-to-string 256) @result{} "256" +@group (number-to-string -23) @result{} "-23" +@end group (number-to-string -23.5) @result{} "-23.5" @end example @@ -532,18 +618,22 @@ See also the function @code{format} in @ref{Formatting Strings}. @defun string-to-number string &optional base @cindex string to number This function returns the numeric value of the characters in -@var{string}. If @var{base} is non-@code{nil}, integers are converted -in that base. If @var{base} is @code{nil}, then base ten is used. -Floating point conversion always uses base ten; we have not implemented -other radices for floating point numbers, because that would be much -more work and does not seem useful. - -The parsing skips spaces and tabs at the beginning of @var{string}, then -reads as much of @var{string} as it can interpret as a number. (On some -systems it ignores other whitespace at the beginning, not just spaces -and tabs.) If the first character after the ignored whitespace is -neither a digit, nor a plus or minus sign, nor the leading dot of a -floating point number, this function returns 0. +@var{string}. If @var{base} is non-@code{nil}, it must be an integer +between 2 and 16 (inclusive), and integers are converted in that base. +If @var{base} is @code{nil}, then base ten is used. Floating point +conversion only works in base ten; we have not implemented other +radices for floating point numbers, because that would be much more +work and does not seem useful. If @var{string} looks like an integer +but its value is too large to fit into a Lisp integer, +@code{string-to-number} returns a floating point result. + +The parsing skips spaces and tabs at the beginning of @var{string}, +then reads as much of @var{string} as it can interpret as a number in +the given base. (On some systems it ignores other whitespace at the +beginning, not just spaces and tabs.) If the first character after +the ignored whitespace is neither a digit in the given base, nor a +plus or minus sign, nor the leading dot of a floating point number, +this function returns 0. @example (string-to-number "256") @@ -554,6 +644,8 @@ floating point number, this function returns 0. @result{} 0 (string-to-number "-4.5") @result{} -4.5 +(string-to-number "1e5") + @result{} 100000.0 @end example @findex string-to-int @@ -593,7 +685,7 @@ in how they use the result of formatting. @defun format string &rest objects This function returns a new string that is made by copying -@var{string} and then replacing any format specification +@var{string} and then replacing any format specification in the copy with encodings of the corresponding @var{objects}. The arguments @var{objects} are the computed values to be formatted. @@ -643,16 +735,12 @@ Starting in Emacs 21, if the object is a string, its text properties are copied into the output. The text properties of the @samp{%s} itself are also copied, but those of the object take priority. -If there is no corresponding object, the empty string is used. - @item %S Replace the specification with the printed representation of the object, made with quoting (that is, using @code{prin1}---@pxref{Output Functions}). Thus, strings are enclosed in @samp{"} characters, and @samp{\} characters appear where necessary before special characters. -If there is no corresponding object, the empty string is used. - @item %o @cindex integer to octal Replace the specification with the base-eight representation of an @@ -703,24 +791,27 @@ operation} error. (format "The buffer object prints as %s." (current-buffer)) @result{} "The buffer object prints as strings.texi." -(format "The octal value of %d is %o, +(format "The octal value of %d is %o, and the hex value is %x." 18 18 18) - @result{} "The octal value of 18 is 22, + @result{} "The octal value of 18 is 22, and the hex value is 12." @end group @end example -@cindex numeric prefix @cindex field width @cindex padding - All the specification characters allow an optional numeric prefix -between the @samp{%} and the character. The optional numeric prefix -defines the minimum width for the object. If the printed representation -of the object contains fewer characters than this, then it is padded. -The padding is on the left if the prefix is positive (or starts with -zero) and on the right if the prefix is negative. The padding character -is normally a space, but if the numeric prefix starts with a zero, zeros -are used for padding. Here are some examples of padding: + All the specification characters allow an optional ``width'', which +is a digit-string between the @samp{%} and the character. If the +printed representation of the object contains fewer characters than +this width, then it is padded. The padding is on the left if the +width is positive (or starts with zero) and on the right if the +width is negative. The padding character is normally a space, but if +the width starts with a zero, zeros are used for padding. Some of +these conventions are ignored for specification characters for which +they do not make sense. That is, @samp{%s}, @samp{%S} and @samp{%c} +accept a width starting with 0, but still pad with @emph{spaces} on +the left. Also, @samp{%%} accepts a width, but ignores it. Here are +some examples of padding: @example (format "%06d is padded on the left with zeros" 123) @@ -730,10 +821,9 @@ are used for padding. Here are some examples of padding: @result{} "123 is padded on the right" @end example - @code{format} never truncates an object's printed representation, no -matter what width you specify. Thus, you can use a numeric prefix to -specify a minimum spacing between columns with no risk of losing -information. +If the width is too small, @code{format} does not truncate the +object's printed representation. Thus, you can use a width to specify +a minimum spacing between columns with no risk of losing information. In the following three examples, @samp{%7s} specifies a minimum width of 7. In the first case, the string inserted in place of @samp{%7s} has @@ -741,38 +831,64 @@ only 3 letters, so 4 blank spaces are inserted for padding. In the second case, the string @code{"specification"} is 13 letters wide but is not truncated. In the third case, the padding is on the right. -@smallexample +@smallexample @group (format "The word `%7s' actually has %d letters in it." "foo" (length "foo")) - @result{} "The word ` foo' actually has 3 letters in it." + @result{} "The word ` foo' actually has 3 letters in it." @end group @group (format "The word `%7s' actually has %d letters in it." - "specification" (length "specification")) - @result{} "The word `specification' actually has 13 letters in it." + "specification" (length "specification")) + @result{} "The word `specification' actually has 13 letters in it." @end group @group (format "The word `%-7s' actually has %d letters in it." "foo" (length "foo")) - @result{} "The word `foo ' actually has 3 letters in it." + @result{} "The word `foo ' actually has 3 letters in it." @end group @end smallexample +@cindex precision in format specifications + All the specification characters allow an optional ``precision'' +before the character (after the width, if present). The precision is +a decimal-point @samp{.} followed by a digit-string. For the +floating-point specifications (@samp{%e}, @samp{%f}, @samp{%g}), the +precision specifies how many decimal places to show; if zero, the +decimal-point itself is also omitted. For @samp{%s} and @samp{%S}, +the precision truncates the string to the given width, so +@samp{%.3s} shows only the first three characters of the +representation for @var{object}. Precision is ignored for other +specification characters. + +@cindex flags in format specifications +Immediately after the @samp{%} and before the optional width and +precision, you can put certain ``flag'' characters. + +A space character inserts a space for positive numbers (otherwise +nothing is inserted for positive numbers). This flag is ignored +except for @samp{%d}, @samp{%e}, @samp{%f}, @samp{%g}. + +The flag @samp{#} indicates ``alternate form''. For @samp{%o} it +ensures that the result begins with a 0. For @samp{%x} and @samp{%X} +the result is prefixed with @samp{0x} or @samp{0X}. For @samp{%e}, +@samp{%f}, and @samp{%g} a decimal point is always shown even if the +precision is zero. + @node Case Conversion -@comment node-name, next, previous, up +@comment node-name, next, previous, up @section Case Conversion in Lisp -@cindex upper case -@cindex lower case -@cindex character case +@cindex upper case +@cindex lower case +@cindex character case @cindex case conversion in Lisp The character case functions change the case of single characters or of the contents of strings. The functions normally convert only alphabetic characters (the letters @samp{A} through @samp{Z} and -@samp{a} through @samp{z}, as well as non-@sc{ascii} letters); other +@samp{a} through @samp{z}, as well as non-@acronym{ASCII} letters); other characters are not altered. You can specify a different case conversion mapping by specifying a case table (@pxref{Case Tables}). @@ -780,7 +896,7 @@ conversion mapping by specifying a case table (@pxref{Case Tables}). arguments. The examples below use the characters @samp{X} and @samp{x} which have -@sc{ascii} codes 88 and 120 respectively. +@acronym{ASCII} codes 88 and 120 respectively. @defun downcase string-or-char This function converts a character or a string to lower case. @@ -840,11 +956,15 @@ When the argument to @code{capitalize} is a character, @code{capitalize} has the same result as @code{upcase}. @example +@group (capitalize "The cat in the hat") @result{} "The Cat In The Hat" +@end group +@group (capitalize "THE 77TH-HATTED CAT") @result{} "The 77th-Hatted Cat" +@end group @group (capitalize ?x) @@ -853,16 +973,20 @@ has the same result as @code{upcase}. @end example @end defun -@defun upcase-initials string -This function capitalizes the initials of the words in @var{string}, -without altering any letters other than the initials. It returns a new -string whose contents are a copy of @var{string}, in which each word has +@defun upcase-initials string-or-char +If @var{string-or-char} is a string, this function capitalizes the +initials of the words in @var{string-or-char}, without altering any +letters other than the initials. It returns a new string whose +contents are a copy of @var{string-or-char}, in which each word has had its initial letter converted to upper case. The definition of a word is any sequence of consecutive characters that are assigned to the word constituent syntax class in the current syntax table (@pxref{Syntax Class Table}). +When the argument to @code{upcase-initials} is a character, +@code{upcase-initials} has the same result as @code{upcase}. + @example @group (upcase-initials "The CAT in the hAt") @@ -917,9 +1041,9 @@ and @samp{A} are related by case-conversion, they should have the same canonical equivalent character (which should be either @samp{a} for both of them, or @samp{A} for both of them). - The extra table @var{equivalences} is a map that cyclicly permutes + The extra table @var{equivalences} is a map that cyclically permutes each equivalence class (of characters with the same canonical -equivalent). (For ordinary @sc{ascii}, this would map @samp{a} into +equivalent). (For ordinary @acronym{ASCII}, this would map @samp{a} into @samp{A} and @samp{A} into @samp{a}, and likewise for each set of equivalent characters.) @@ -956,7 +1080,7 @@ This sets the current buffer's case table to @var{table}. @end defun The following three functions are convenient subroutines for packages -that define non-@sc{ascii} character sets. They modify the specified +that define non-@acronym{ASCII} character sets. They modify the specified case table @var{case-table}; they also modify the standard syntax table. @xref{Syntax Tables}. Normally you would use these functions to change the standard case table. @@ -980,3 +1104,7 @@ This function makes @var{char} case-invariant, with syntax This command displays a description of the contents of the current buffer's case table. @end deffn + +@ignore + arch-tag: 700b8e95-7aa5-4b52-9eb3-8f2e1ea152b4 +@end ignore