HCoop Git - bpt/guile.git/blame_incremental

... / ...

Commit	Line	Data
	1	@c --texinfo--
	2	@c This is part of the GNU Guile Reference Manual.
	3	@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010
	4	@c Free Software Foundation, Inc.
	5	@c See the file guile.texi for copying conditions.
	6
	7	@node Regular Expressions
	8	@section Regular Expressions
	9	@tpindex Regular expressions
	10
	11	@cindex regular expressions
	12	@cindex regex
	13	@cindex emacs regexp
	14
	15	A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
	16	describes a whole class of strings. A full description of regular
	17	expressions and their syntax is beyond the scope of this manual;
	18	an introduction can be found in the Emacs manual (@pxref{Regexps,
	19	, Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or
	20	in many general Unix reference books.
	21
	22	If your system does not include a POSIX regular expression library,
	23	and you have not linked Guile with a third-party regexp library such
	24	as Rx, these functions will not be available. You can tell whether
	25	your Guile installation includes regular expression support by
	26	checking whether @code{(provided? 'regex)} returns true.
	27
	28	The following regexp and string matching features are provided by the
	29	@code{(ice-9 regex)} module. Before using the described functions,
	30	you should load this module by executing @code{(use-modules (ice-9
	31	regex))}.
	32
	33	@menu
	34	* Regexp Functions:: Functions that create and match regexps.
	35	* Match Structures:: Finding what was matched by a regexp.
	36	* Backslash Escapes:: Removing the special meaning of regexp
	37	meta-characters.
	38	@end menu
	39
	40
	41	@node Regexp Functions
	42	@subsection Regexp Functions
	43
	44	By default, Guile supports POSIX extended regular expressions.
	45	That means that the characters @samp{(}, @samp{)}, @samp{+} and
	46	@samp{?} are special, and must be escaped if you wish to match the
	47	literal characters.
	48
	49	This regular expression interface was modeled after that
	50	implemented by SCSH, the Scheme Shell. It is intended to be
	51	upwardly compatible with SCSH regular expressions.
	52
	53	Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
	54	strings, since the underlying C functions treat that as the end of
	55	string. If there's a zero byte an error is thrown.
	56
	57	Patterns and input strings are treated as being in the locale
	58	character set if @code{setlocale} has been called (@pxref{Locales}),
	59	and in a multibyte locale this includes treating multi-byte sequences
	60	as a single character. (Guile strings are currently merely bytes,
	61	though this may change in the future, @xref{Conversion to/from C}.)
	62
	63	@deffn {Scheme Procedure} string-match pattern str [start]
	64	Compile the string @var{pattern} into a regular expression and compare
	65	it with @var{str}. The optional numeric argument @var{start} specifies
	66	the position of @var{str} at which to begin matching.
	67
	68	@code{string-match} returns a @dfn{match structure} which
	69	describes what, if anything, was matched by the regular
	70	expression. @xref{Match Structures}. If @var{str} does not match
	71	@var{pattern} at all, @code{string-match} returns @code{#f}.
	72	@end deffn
	73
	74	Two examples of a match follow. In the first example, the pattern
	75	matches the four digits in the match string. In the second, the pattern
	76	matches nothing.
	77
	78	@example
	79	(string-match "[0-9][0-9][0-9][0-9]" "blah2002")
	80	@result{} #("blah2002" (4 . 8))
	81
	82	(string-match "[A-Za-z]" "123456")
	83	@result{} #f
	84	@end example
	85
	86	Each time @code{string-match} is called, it must compile its
	87	@var{pattern} argument into a regular expression structure. This
	88	operation is expensive, which makes @code{string-match} inefficient if
	89	the same regular expression is used several times (for example, in a
	90	loop). For better performance, you can compile a regular expression in
	91	advance and then match strings against the compiled regexp.
	92
	93	@deffn {Scheme Procedure} make-regexp pat flag@dots{}
	94	@deffnx {C Function} scm_make_regexp (pat, flaglst)
	95	Compile the regular expression described by @var{pat}, and
	96	return the compiled regexp structure. If @var{pat} does not
	97	describe a legal regular expression, @code{make-regexp} throws
	98	a @code{regular-expression-syntax} error.
	99
	100	The @var{flag} arguments change the behavior of the compiled
	101	regular expression. The following values may be supplied:
	102
	103	@defvar regexp/icase
	104	Consider uppercase and lowercase letters to be the same when
	105	matching.
	106	@end defvar
	107
	108	@defvar regexp/newline
	109	If a newline appears in the target string, then permit the
	110	@samp{^} and @samp{$} operators to match immediately after or
	111	immediately before the newline, respectively. Also, the
	112	@samp{.} and @samp{[^...]} operators will never match a newline
	113	character. The intent of this flag is to treat the target
	114	string as a buffer containing many lines of text, and the
	115	regular expression as a pattern that may match a single one of
	116	those lines.
	117	@end defvar
	118
	119	@defvar regexp/basic
	120	Compile a basic (``obsolete'') regexp instead of the extended
	121	(``modern'') regexps that are the default. Basic regexps do
	122	not consider @samp{\|}, @samp{+} or @samp{?} to be special
	123	characters, and require the @samp{@{...@}} and @samp{(...)}
	124	metacharacters to be backslash-escaped (@pxref{Backslash
	125	Escapes}). There are several other differences between basic
	126	and extended regular expressions, but these are the most
	127	significant.
	128	@end defvar
	129
	130	@defvar regexp/extended
	131	Compile an extended regular expression rather than a basic
	132	regexp. This is the default behavior; this flag will not
	133	usually be needed. If a call to @code{make-regexp} includes
	134	both @code{regexp/basic} and @code{regexp/extended} flags, the
	135	one which comes last will override the earlier one.
	136	@end defvar
	137	@end deffn
	138
	139	@deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
	140	@deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
	141	Match the compiled regular expression @var{rx} against
	142	@code{str}. If the optional integer @var{start} argument is
	143	provided, begin matching from that position in the string.
	144	Return a match structure describing the results of the match,
	145	or @code{#f} if no match could be found.
	146
	147	The @var{flags} argument changes the matching behavior. The following
	148	flag values may be supplied, use @code{logior} (@pxref{Bitwise
	149	Operations}) to combine them,
	150
	151	@defvar regexp/notbol
	152	Consider that the @var{start} offset into @var{str} is not the
	153	beginning of a line and should not match operator @samp{^}.
	154
	155	If @var{rx} was created with the @code{regexp/newline} option above,
	156	@samp{^} will still match after a newline in @var{str}.
	157	@end defvar
	158
	159	@defvar regexp/noteol
	160	Consider that the end of @var{str} is not the end of a line and should
	161	not match operator @samp{$}.
	162
	163	If @var{rx} was created with the @code{regexp/newline} option above,
	164	@samp{$} will still match before a newline in @var{str}.
	165	@end defvar
	166	@end deffn
	167
	168	@lisp
	169	;; Regexp to match uppercase letters
	170	(define r (make-regexp "[A-Z]*"))
	171
	172	;; Regexp to match letters, ignoring case
	173	(define ri (make-regexp "[A-Z]*" regexp/icase))
	174
	175	;; Search for bob using regexp r
	176	(match:substring (regexp-exec r "bob"))
	177	@result{} "" ; no match
	178
	179	;; Search for bob using regexp ri
	180	(match:substring (regexp-exec ri "Bob"))
	181	@result{} "Bob" ; matched case insensitive
	182	@end lisp
	183
	184	@deffn {Scheme Procedure} regexp? obj
	185	@deffnx {C Function} scm_regexp_p (obj)
	186	Return @code{#t} if @var{obj} is a compiled regular expression,
	187	or @code{#f} otherwise.
	188	@end deffn
	189
	190	@sp 1
	191	@deffn {Scheme Procedure} list-matches regexp str [flags]
	192	Return a list of match structures which are the non-overlapping
	193	matches of @var{regexp} in @var{str}. @var{regexp} can be either a
	194	pattern string or a compiled regexp. The @var{flags} argument is as
	195	per @code{regexp-exec} above.
	196
	197	@example
	198	(map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
	199	@result{} ("abc" "def")
	200	@end example
	201	@end deffn
	202
	203	@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
	204	Apply @var{proc} to the non-overlapping matches of @var{regexp} in
	205	@var{str}, to build a result. @var{regexp} can be either a pattern
	206	string or a compiled regexp. The @var{flags} argument is as per
	207	@code{regexp-exec} above.
	208
	209	@var{proc} is called as @code{(@var{proc} match prev)} where
	210	@var{match} is a match structure and @var{prev} is the previous return
	211	from @var{proc}. For the first call @var{prev} is the given
	212	@var{init} parameter. @code{fold-matches} returns the final value
	213	from @var{proc}.
	214
	215	For example to count matches,
	216
	217	@example
	218	(fold-matches "[a-z][0-9]" "abc x1 def y2" 0
	219	(lambda (match count)
	220	(1+ count)))
	221	@result{} 2
	222	@end example
	223	@end deffn
	224
	225	@sp 1
	226	Regular expressions are commonly used to find patterns in one string
	227	and replace them with the contents of another string. The following
	228	functions are convenient ways to do this.
	229
	230	@c begin (scm-doc-string "regex.scm" "regexp-substitute")
	231	@deffn {Scheme Procedure} regexp-substitute port match [item@dots{}]
	232	Write to @var{port} selected parts of the match structure @var{match}.
	233	Or if @var{port} is @code{#f} then form a string from those parts and
	234	return that.
	235
	236	Each @var{item} specifies a part to be written, and may be one of the
	237	following,
	238
	239	@itemize @bullet
	240	@item
	241	A string. String arguments are written out verbatim.
	242
	243	@item
	244	An integer. The submatch with that number is written
	245	(@code{match:substring}). Zero is the entire match.
	246
	247	@item
	248	The symbol @samp{pre}. The portion of the matched string preceding
	249	the regexp match is written (@code{match:prefix}).
	250
	251	@item
	252	The symbol @samp{post}. The portion of the matched string following
	253	the regexp match is written (@code{match:suffix}).
	254	@end itemize
	255
	256	For example, changing a match and retaining the text before and after,
	257
	258	@example
	259	(regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
	260	'pre "37" 'post)
	261	@result{} "number 37 is good"
	262	@end example
	263
	264	Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
	265	re-ordering and hyphenating the fields.
	266
	267	@lisp
	268	(define date-regex
	269	"([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
	270	(define s "Date 20020429 12am.")
	271	(regexp-substitute #f (string-match date-regex s)
	272	'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
	273	@result{} "Date 04-29-2002 12am. (20020429)"
	274	@end lisp
	275	@end deffn
	276
	277
	278	@c begin (scm-doc-string "regex.scm" "regexp-substitute")
	279	@deffn {Scheme Procedure} regexp-substitute/global port regexp target [item@dots{}]
	280	@cindex search and replace
	281	Write to @var{port} selected parts of matches of @var{regexp} in
	282	@var{target}. If @var{port} is @code{#f} then form a string from
	283	those parts and return that. @var{regexp} can be a string or a
	284	compiled regex.
	285
	286	This is similar to @code{regexp-substitute}, but allows global
	287	substitutions on @var{target}. Each @var{item} behaves as per
	288	@code{regexp-substitute}, with the following differences,
	289
	290	@itemize @bullet
	291	@item
	292	A function. Called as @code{(@var{item} match)} with the match
	293	structure for the @var{regexp} match, it should return a string to be
	294	written to @var{port}.
	295
	296	@item
	297	The symbol @samp{post}. This doesn't output anything, but instead
	298	causes @code{regexp-substitute/global} to recurse on the unmatched
	299	portion of @var{target}.
	300
	301	This @emph{must} be supplied to perform a global search and replace on
	302	@var{target}; without it @code{regexp-substitute/global} returns after
	303	a single match and output.
	304	@end itemize
	305
	306	For example, to collapse runs of tabs and spaces to a single hyphen
	307	each,
	308
	309	@example
	310	(regexp-substitute/global #f "[ \t]+" "this is the text"
	311	'pre "-" 'post)
	312	@result{} "this-is-the-text"
	313	@end example
	314
	315	Or using a function to reverse the letters in each word,
	316
	317	@example
	318	(regexp-substitute/global #f "[a-z]+" "to do and not-do"
	319	'pre (lambda (m) (string-reverse (match:substring m))) 'post)
	320	@result{} "ot od dna ton-od"
	321	@end example
	322
	323	Without the @code{post} symbol, just one regexp match is made. For
	324	example the following is the date example from
	325	@code{regexp-substitute} above, without the need for the separate
	326	@code{string-match} call.
	327
	328	@lisp
	329	(define date-regex
	330	"([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
	331	(define s "Date 20020429 12am.")
	332	(regexp-substitute/global #f date-regex s
	333	'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
	334
	335	@result{} "Date 04-29-2002 12am. (20020429)"
	336	@end lisp
	337	@end deffn
	338
	339
	340	@node Match Structures
	341	@subsection Match Structures
	342
	343	@cindex match structures
	344
	345	A @dfn{match structure} is the object returned by @code{string-match} and
	346	@code{regexp-exec}. It describes which portion of a string, if any,
	347	matched the given regular expression. Match structures include: a
	348	reference to the string that was checked for matches; the starting and
	349	ending positions of the regexp match; and, if the regexp included any
	350	parenthesized subexpressions, the starting and ending positions of each
	351	submatch.
	352
	353	In each of the regexp match functions described below, the @code{match}
	354	argument must be a match structure returned by a previous call to
	355	@code{string-match} or @code{regexp-exec}. Most of these functions
	356	return some information about the original target string that was
	357	matched against a regular expression; we will call that string
	358	@var{target} for easy reference.
	359
	360	@c begin (scm-doc-string "regex.scm" "regexp-match?")
	361	@deffn {Scheme Procedure} regexp-match? obj
	362	Return @code{#t} if @var{obj} is a match structure returned by a
	363	previous call to @code{regexp-exec}, or @code{#f} otherwise.
	364	@end deffn
	365
	366	@c begin (scm-doc-string "regex.scm" "match:substring")
	367	@deffn {Scheme Procedure} match:substring match [n]
	368	Return the portion of @var{target} matched by subexpression number
	369	@var{n}. Submatch 0 (the default) represents the entire regexp match.
	370	If the regular expression as a whole matched, but the subexpression
	371	number @var{n} did not match, return @code{#f}.
	372	@end deffn
	373
	374	@lisp
	375	(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
	376	(match:substring s)
	377	@result{} "2002"
	378
	379	;; match starting at offset 6 in the string
	380	(match:substring
	381	(string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
	382	@result{} "7654"
	383	@end lisp
	384
	385	@c begin (scm-doc-string "regex.scm" "match:start")
	386	@deffn {Scheme Procedure} match:start match [n]
	387	Return the starting position of submatch number @var{n}.
	388	@end deffn
	389
	390	In the following example, the result is 4, since the match starts at
	391	character index 4:
	392
	393	@lisp
	394	(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
	395	(match:start s)
	396	@result{} 4
	397	@end lisp
	398
	399	@c begin (scm-doc-string "regex.scm" "match:end")
	400	@deffn {Scheme Procedure} match:end match [n]
	401	Return the ending position of submatch number @var{n}.
	402	@end deffn
	403
	404	In the following example, the result is 8, since the match runs between
	405	characters 4 and 8 (i.e.@: the ``2002'').
	406
	407	@lisp
	408	(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
	409	(match:end s)
	410	@result{} 8
	411	@end lisp
	412
	413	@c begin (scm-doc-string "regex.scm" "match:prefix")
	414	@deffn {Scheme Procedure} match:prefix match
	415	Return the unmatched portion of @var{target} preceding the regexp match.
	416
	417	@lisp
	418	(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
	419	(match:prefix s)
	420	@result{} "blah"
	421	@end lisp
	422	@end deffn
	423
	424	@c begin (scm-doc-string "regex.scm" "match:suffix")
	425	@deffn {Scheme Procedure} match:suffix match
	426	Return the unmatched portion of @var{target} following the regexp match.
	427	@end deffn
	428
	429	@lisp
	430	(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
	431	(match:suffix s)
	432	@result{} "foo"
	433	@end lisp
	434
	435	@c begin (scm-doc-string "regex.scm" "match:count")
	436	@deffn {Scheme Procedure} match:count match
	437	Return the number of parenthesized subexpressions from @var{match}.
	438	Note that the entire regular expression match itself counts as a
	439	subexpression, and failed submatches are included in the count.
	440	@end deffn
	441
	442	@c begin (scm-doc-string "regex.scm" "match:string")
	443	@deffn {Scheme Procedure} match:string match
	444	Return the original @var{target} string.
	445	@end deffn
	446
	447	@lisp
	448	(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
	449	(match:string s)
	450	@result{} "blah2002foo"
	451	@end lisp
	452
	453
	454	@node Backslash Escapes
	455	@subsection Backslash Escapes
	456
	457	Sometimes you will want a regexp to match characters like @samp{*} or
	458	@samp{$} exactly. For example, to check whether a particular string
	459	represents a menu entry from an Info node, it would be useful to match
	460	it against a regexp like @samp{^* [^:]*::}. However, this won't work;
	461	because the asterisk is a metacharacter, it won't match the @samp{*} at
	462	the beginning of the string. In this case, we want to make the first
	463	asterisk un-magic.
	464
	465	You can do this by preceding the metacharacter with a backslash
	466	character @samp{\}. (This is also called @dfn{quoting} the
	467	metacharacter, and is known as a @dfn{backslash escape}.) When Guile
	468	sees a backslash in a regular expression, it considers the following
	469	glyph to be an ordinary character, no matter what special meaning it
	470	would ordinarily have. Therefore, we can make the above example work by
	471	changing the regexp to @samp{^\* [^:]::}. The @samp{\} sequence tells
	472	the regular expression engine to match only a single asterisk in the
	473	target string.
	474
	475	Since the backslash is itself a metacharacter, you may force a regexp to
	476	match a backslash in the target string by preceding the backslash with
	477	itself. For example, to find variable references in a @TeX{} program,
	478	you might want to find occurrences of the string @samp{\let\} followed
	479	by any number of alphabetic characters. The regular expression
	480	@samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
	481	regexp each match a single backslash in the target string.
	482
	483	@c begin (scm-doc-string "regex.scm" "regexp-quote")
	484	@deffn {Scheme Procedure} regexp-quote str
	485	Quote each special character found in @var{str} with a backslash, and
	486	return the resulting string.
	487	@end deffn
	488
	489	@strong{Very important:} Using backslash escapes in Guile source code
	490	(as in Emacs Lisp or C) can be tricky, because the backslash character
	491	has special meaning for the Guile reader. For example, if Guile
	492	encounters the character sequence @samp{\n} in the middle of a string
	493	while processing Scheme code, it replaces those characters with a
	494	newline character. Similarly, the character sequence @samp{\t} is
	495	replaced by a horizontal tab. Several of these @dfn{escape sequences}
	496	are processed by the Guile reader before your code is executed.
	497	Unrecognized escape sequences are ignored: if the characters @samp{\*}
	498	appear in a string, they will be translated to the single character
	499	@samp{*}.
	500
	501	This translation is obviously undesirable for regular expressions, since
	502	we want to be able to include backslashes in a string in order to
	503	escape regexp metacharacters. Therefore, to make sure that a backslash
	504	is preserved in a string in your Guile program, you must use @emph{two}
	505	consecutive backslashes:
	506
	507	@lisp
	508	(define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
	509	@end lisp
	510
	511	The string in this example is preprocessed by the Guile reader before
	512	any code is executed. The resulting argument to @code{make-regexp} is
	513	the string @samp{^\* [^:]*}, which is what we really want.
	514
	515	This also means that in order to write a regular expression that matches
	516	a single backslash character, the regular expression string in the
	517	source code must include @emph{four} backslashes. Each consecutive pair
	518	of backslashes gets translated by the Guile reader to a single
	519	backslash, and the resulting double-backslash is interpreted by the
	520	regexp engine as matching a single backslash character. Hence:
	521
	522	@lisp
	523	(define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
	524	@end lisp
	525
	526	The reason for the unwieldiness of this syntax is historical. Both
	527	regular expression pattern matchers and Unix string processing systems
	528	have traditionally used backslashes with the special meanings
	529	described above. The POSIX regular expression specification and ANSI C
	530	standard both require these semantics. Attempting to abandon either
	531	convention would cause other kinds of compatibility problems, possibly
	532	more severe ones. Therefore, without extending the Scheme reader to
	533	support strings with different quoting conventions (an ungainly and
	534	confusing extension when implemented in other languages), we must adhere
	535	to this cumbersome escape syntax.