add syntax-local-binding
[bpt/guile.git] / doc / ref / api-regex.texi
CommitLineData
96ca59d8
NJ
1@c -*-texinfo-*-
2@c This is part of the GNU Guile Reference Manual.
3@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010
4@c Free Software Foundation, Inc.
5@c See the file guile.texi for copying conditions.
6
7@node Regular Expressions
8@section Regular Expressions
9@tpindex Regular expressions
10
11@cindex regular expressions
12@cindex regex
13@cindex emacs regexp
14
15A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
16describes a whole class of strings. A full description of regular
17expressions and their syntax is beyond the scope of this manual;
18an introduction can be found in the Emacs manual (@pxref{Regexps,
19, Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or
20in many general Unix reference books.
21
22If your system does not include a POSIX regular expression library,
23and you have not linked Guile with a third-party regexp library such
24as Rx, these functions will not be available. You can tell whether
25your Guile installation includes regular expression support by
26checking whether @code{(provided? 'regex)} returns true.
27
28The following regexp and string matching features are provided by the
29@code{(ice-9 regex)} module. Before using the described functions,
30you should load this module by executing @code{(use-modules (ice-9
31regex))}.
32
33@menu
34* Regexp Functions:: Functions that create and match regexps.
35* Match Structures:: Finding what was matched by a regexp.
36* Backslash Escapes:: Removing the special meaning of regexp
37 meta-characters.
38@end menu
39
40
41@node Regexp Functions
42@subsection Regexp Functions
43
44By default, Guile supports POSIX extended regular expressions.
45That means that the characters @samp{(}, @samp{)}, @samp{+} and
46@samp{?} are special, and must be escaped if you wish to match the
47literal characters.
48
49This regular expression interface was modeled after that
50implemented by SCSH, the Scheme Shell. It is intended to be
51upwardly compatible with SCSH regular expressions.
52
53Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
54strings, since the underlying C functions treat that as the end of
55string. If there's a zero byte an error is thrown.
56
57Patterns and input strings are treated as being in the locale
58character set if @code{setlocale} has been called (@pxref{Locales}),
59and in a multibyte locale this includes treating multi-byte sequences
60as a single character. (Guile strings are currently merely bytes,
61though this may change in the future, @xref{Conversion to/from C}.)
62
63@deffn {Scheme Procedure} string-match pattern str [start]
64Compile the string @var{pattern} into a regular expression and compare
65it with @var{str}. The optional numeric argument @var{start} specifies
66the position of @var{str} at which to begin matching.
67
68@code{string-match} returns a @dfn{match structure} which
69describes what, if anything, was matched by the regular
70expression. @xref{Match Structures}. If @var{str} does not match
71@var{pattern} at all, @code{string-match} returns @code{#f}.
72@end deffn
73
74Two examples of a match follow. In the first example, the pattern
75matches the four digits in the match string. In the second, the pattern
76matches nothing.
77
78@example
79(string-match "[0-9][0-9][0-9][0-9]" "blah2002")
80@result{} #("blah2002" (4 . 8))
81
82(string-match "[A-Za-z]" "123456")
83@result{} #f
84@end example
85
86Each time @code{string-match} is called, it must compile its
87@var{pattern} argument into a regular expression structure. This
88operation is expensive, which makes @code{string-match} inefficient if
89the same regular expression is used several times (for example, in a
90loop). For better performance, you can compile a regular expression in
91advance and then match strings against the compiled regexp.
92
93@deffn {Scheme Procedure} make-regexp pat flag@dots{}
94@deffnx {C Function} scm_make_regexp (pat, flaglst)
95Compile the regular expression described by @var{pat}, and
96return the compiled regexp structure. If @var{pat} does not
97describe a legal regular expression, @code{make-regexp} throws
98a @code{regular-expression-syntax} error.
99
100The @var{flag} arguments change the behavior of the compiled
101regular expression. The following values may be supplied:
102
103@defvar regexp/icase
104Consider uppercase and lowercase letters to be the same when
105matching.
106@end defvar
107
108@defvar regexp/newline
109If a newline appears in the target string, then permit the
110@samp{^} and @samp{$} operators to match immediately after or
111immediately before the newline, respectively. Also, the
112@samp{.} and @samp{[^...]} operators will never match a newline
113character. The intent of this flag is to treat the target
114string as a buffer containing many lines of text, and the
115regular expression as a pattern that may match a single one of
116those lines.
117@end defvar
118
119@defvar regexp/basic
120Compile a basic (``obsolete'') regexp instead of the extended
121(``modern'') regexps that are the default. Basic regexps do
122not consider @samp{|}, @samp{+} or @samp{?} to be special
123characters, and require the @samp{@{...@}} and @samp{(...)}
124metacharacters to be backslash-escaped (@pxref{Backslash
125Escapes}). There are several other differences between basic
126and extended regular expressions, but these are the most
127significant.
128@end defvar
129
130@defvar regexp/extended
131Compile an extended regular expression rather than a basic
132regexp. This is the default behavior; this flag will not
133usually be needed. If a call to @code{make-regexp} includes
134both @code{regexp/basic} and @code{regexp/extended} flags, the
135one which comes last will override the earlier one.
136@end defvar
137@end deffn
138
139@deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
140@deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
141Match the compiled regular expression @var{rx} against
142@code{str}. If the optional integer @var{start} argument is
143provided, begin matching from that position in the string.
144Return a match structure describing the results of the match,
145or @code{#f} if no match could be found.
146
147The @var{flags} argument changes the matching behavior. The following
148flag values may be supplied, use @code{logior} (@pxref{Bitwise
149Operations}) to combine them,
150
151@defvar regexp/notbol
152Consider that the @var{start} offset into @var{str} is not the
153beginning of a line and should not match operator @samp{^}.
154
155If @var{rx} was created with the @code{regexp/newline} option above,
156@samp{^} will still match after a newline in @var{str}.
157@end defvar
158
159@defvar regexp/noteol
160Consider that the end of @var{str} is not the end of a line and should
161not match operator @samp{$}.
162
163If @var{rx} was created with the @code{regexp/newline} option above,
164@samp{$} will still match before a newline in @var{str}.
165@end defvar
166@end deffn
167
168@lisp
169;; Regexp to match uppercase letters
170(define r (make-regexp "[A-Z]*"))
171
172;; Regexp to match letters, ignoring case
173(define ri (make-regexp "[A-Z]*" regexp/icase))
174
175;; Search for bob using regexp r
176(match:substring (regexp-exec r "bob"))
177@result{} "" ; no match
178
179;; Search for bob using regexp ri
180(match:substring (regexp-exec ri "Bob"))
181@result{} "Bob" ; matched case insensitive
182@end lisp
183
184@deffn {Scheme Procedure} regexp? obj
185@deffnx {C Function} scm_regexp_p (obj)
186Return @code{#t} if @var{obj} is a compiled regular expression,
187or @code{#f} otherwise.
188@end deffn
189
190@sp 1
191@deffn {Scheme Procedure} list-matches regexp str [flags]
192Return a list of match structures which are the non-overlapping
193matches of @var{regexp} in @var{str}. @var{regexp} can be either a
194pattern string or a compiled regexp. The @var{flags} argument is as
195per @code{regexp-exec} above.
196
197@example
198(map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
199@result{} ("abc" "def")
200@end example
201@end deffn
202
203@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
204Apply @var{proc} to the non-overlapping matches of @var{regexp} in
205@var{str}, to build a result. @var{regexp} can be either a pattern
206string or a compiled regexp. The @var{flags} argument is as per
207@code{regexp-exec} above.
208
209@var{proc} is called as @code{(@var{proc} match prev)} where
210@var{match} is a match structure and @var{prev} is the previous return
211from @var{proc}. For the first call @var{prev} is the given
212@var{init} parameter. @code{fold-matches} returns the final value
213from @var{proc}.
214
215For example to count matches,
216
217@example
218(fold-matches "[a-z][0-9]" "abc x1 def y2" 0
219 (lambda (match count)
220 (1+ count)))
221@result{} 2
222@end example
223@end deffn
224
225@sp 1
226Regular expressions are commonly used to find patterns in one string
227and replace them with the contents of another string. The following
228functions are convenient ways to do this.
229
230@c begin (scm-doc-string "regex.scm" "regexp-substitute")
231@deffn {Scheme Procedure} regexp-substitute port match [item@dots{}]
232Write to @var{port} selected parts of the match structure @var{match}.
233Or if @var{port} is @code{#f} then form a string from those parts and
234return that.
235
236Each @var{item} specifies a part to be written, and may be one of the
237following,
238
239@itemize @bullet
240@item
241A string. String arguments are written out verbatim.
242
243@item
244An integer. The submatch with that number is written
245(@code{match:substring}). Zero is the entire match.
246
247@item
248The symbol @samp{pre}. The portion of the matched string preceding
249the regexp match is written (@code{match:prefix}).
250
251@item
252The symbol @samp{post}. The portion of the matched string following
253the regexp match is written (@code{match:suffix}).
254@end itemize
255
256For example, changing a match and retaining the text before and after,
257
258@example
259(regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
260 'pre "37" 'post)
261@result{} "number 37 is good"
262@end example
263
264Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
265re-ordering and hyphenating the fields.
266
267@lisp
268(define date-regex
269 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
270(define s "Date 20020429 12am.")
271(regexp-substitute #f (string-match date-regex s)
272 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
273@result{} "Date 04-29-2002 12am. (20020429)"
274@end lisp
275@end deffn
276
277
278@c begin (scm-doc-string "regex.scm" "regexp-substitute")
279@deffn {Scheme Procedure} regexp-substitute/global port regexp target [item@dots{}]
280@cindex search and replace
281Write to @var{port} selected parts of matches of @var{regexp} in
282@var{target}. If @var{port} is @code{#f} then form a string from
283those parts and return that. @var{regexp} can be a string or a
284compiled regex.
285
286This is similar to @code{regexp-substitute}, but allows global
287substitutions on @var{target}. Each @var{item} behaves as per
288@code{regexp-substitute}, with the following differences,
289
290@itemize @bullet
291@item
292A function. Called as @code{(@var{item} match)} with the match
293structure for the @var{regexp} match, it should return a string to be
294written to @var{port}.
295
296@item
297The symbol @samp{post}. This doesn't output anything, but instead
298causes @code{regexp-substitute/global} to recurse on the unmatched
299portion of @var{target}.
300
301This @emph{must} be supplied to perform a global search and replace on
302@var{target}; without it @code{regexp-substitute/global} returns after
303a single match and output.
304@end itemize
305
306For example, to collapse runs of tabs and spaces to a single hyphen
307each,
308
309@example
310(regexp-substitute/global #f "[ \t]+" "this is the text"
311 'pre "-" 'post)
312@result{} "this-is-the-text"
313@end example
314
315Or using a function to reverse the letters in each word,
316
317@example
318(regexp-substitute/global #f "[a-z]+" "to do and not-do"
319 'pre (lambda (m) (string-reverse (match:substring m))) 'post)
320@result{} "ot od dna ton-od"
321@end example
322
323Without the @code{post} symbol, just one regexp match is made. For
324example the following is the date example from
325@code{regexp-substitute} above, without the need for the separate
326@code{string-match} call.
327
328@lisp
329(define date-regex
330 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
331(define s "Date 20020429 12am.")
332(regexp-substitute/global #f date-regex s
333 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
334
335@result{} "Date 04-29-2002 12am. (20020429)"
336@end lisp
337@end deffn
338
339
340@node Match Structures
341@subsection Match Structures
342
343@cindex match structures
344
345A @dfn{match structure} is the object returned by @code{string-match} and
346@code{regexp-exec}. It describes which portion of a string, if any,
347matched the given regular expression. Match structures include: a
348reference to the string that was checked for matches; the starting and
349ending positions of the regexp match; and, if the regexp included any
350parenthesized subexpressions, the starting and ending positions of each
351submatch.
352
353In each of the regexp match functions described below, the @code{match}
354argument must be a match structure returned by a previous call to
355@code{string-match} or @code{regexp-exec}. Most of these functions
356return some information about the original target string that was
357matched against a regular expression; we will call that string
358@var{target} for easy reference.
359
360@c begin (scm-doc-string "regex.scm" "regexp-match?")
361@deffn {Scheme Procedure} regexp-match? obj
362Return @code{#t} if @var{obj} is a match structure returned by a
363previous call to @code{regexp-exec}, or @code{#f} otherwise.
364@end deffn
365
366@c begin (scm-doc-string "regex.scm" "match:substring")
367@deffn {Scheme Procedure} match:substring match [n]
368Return the portion of @var{target} matched by subexpression number
369@var{n}. Submatch 0 (the default) represents the entire regexp match.
370If the regular expression as a whole matched, but the subexpression
371number @var{n} did not match, return @code{#f}.
372@end deffn
373
374@lisp
375(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
376(match:substring s)
377@result{} "2002"
378
379;; match starting at offset 6 in the string
380(match:substring
381 (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
382@result{} "7654"
383@end lisp
384
385@c begin (scm-doc-string "regex.scm" "match:start")
386@deffn {Scheme Procedure} match:start match [n]
387Return the starting position of submatch number @var{n}.
388@end deffn
389
390In the following example, the result is 4, since the match starts at
391character index 4:
392
393@lisp
394(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
395(match:start s)
396@result{} 4
397@end lisp
398
399@c begin (scm-doc-string "regex.scm" "match:end")
400@deffn {Scheme Procedure} match:end match [n]
401Return the ending position of submatch number @var{n}.
402@end deffn
403
404In the following example, the result is 8, since the match runs between
679cceed 405characters 4 and 8 (i.e.@: the ``2002'').
96ca59d8
NJ
406
407@lisp
408(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
409(match:end s)
410@result{} 8
411@end lisp
412
413@c begin (scm-doc-string "regex.scm" "match:prefix")
414@deffn {Scheme Procedure} match:prefix match
415Return the unmatched portion of @var{target} preceding the regexp match.
416
417@lisp
418(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
419(match:prefix s)
420@result{} "blah"
421@end lisp
422@end deffn
423
424@c begin (scm-doc-string "regex.scm" "match:suffix")
425@deffn {Scheme Procedure} match:suffix match
426Return the unmatched portion of @var{target} following the regexp match.
427@end deffn
428
429@lisp
430(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
431(match:suffix s)
432@result{} "foo"
433@end lisp
434
435@c begin (scm-doc-string "regex.scm" "match:count")
436@deffn {Scheme Procedure} match:count match
437Return the number of parenthesized subexpressions from @var{match}.
438Note that the entire regular expression match itself counts as a
439subexpression, and failed submatches are included in the count.
440@end deffn
441
442@c begin (scm-doc-string "regex.scm" "match:string")
443@deffn {Scheme Procedure} match:string match
444Return the original @var{target} string.
445@end deffn
446
447@lisp
448(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
449(match:string s)
450@result{} "blah2002foo"
451@end lisp
452
453
454@node Backslash Escapes
455@subsection Backslash Escapes
456
457Sometimes you will want a regexp to match characters like @samp{*} or
458@samp{$} exactly. For example, to check whether a particular string
459represents a menu entry from an Info node, it would be useful to match
460it against a regexp like @samp{^* [^:]*::}. However, this won't work;
461because the asterisk is a metacharacter, it won't match the @samp{*} at
462the beginning of the string. In this case, we want to make the first
463asterisk un-magic.
464
465You can do this by preceding the metacharacter with a backslash
466character @samp{\}. (This is also called @dfn{quoting} the
467metacharacter, and is known as a @dfn{backslash escape}.) When Guile
468sees a backslash in a regular expression, it considers the following
469glyph to be an ordinary character, no matter what special meaning it
470would ordinarily have. Therefore, we can make the above example work by
471changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells
472the regular expression engine to match only a single asterisk in the
473target string.
474
475Since the backslash is itself a metacharacter, you may force a regexp to
476match a backslash in the target string by preceding the backslash with
477itself. For example, to find variable references in a @TeX{} program,
478you might want to find occurrences of the string @samp{\let\} followed
479by any number of alphabetic characters. The regular expression
480@samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
481regexp each match a single backslash in the target string.
482
483@c begin (scm-doc-string "regex.scm" "regexp-quote")
484@deffn {Scheme Procedure} regexp-quote str
485Quote each special character found in @var{str} with a backslash, and
486return the resulting string.
487@end deffn
488
489@strong{Very important:} Using backslash escapes in Guile source code
490(as in Emacs Lisp or C) can be tricky, because the backslash character
491has special meaning for the Guile reader. For example, if Guile
492encounters the character sequence @samp{\n} in the middle of a string
493while processing Scheme code, it replaces those characters with a
494newline character. Similarly, the character sequence @samp{\t} is
495replaced by a horizontal tab. Several of these @dfn{escape sequences}
496are processed by the Guile reader before your code is executed.
497Unrecognized escape sequences are ignored: if the characters @samp{\*}
498appear in a string, they will be translated to the single character
499@samp{*}.
500
501This translation is obviously undesirable for regular expressions, since
502we want to be able to include backslashes in a string in order to
503escape regexp metacharacters. Therefore, to make sure that a backslash
504is preserved in a string in your Guile program, you must use @emph{two}
505consecutive backslashes:
506
507@lisp
508(define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
509@end lisp
510
511The string in this example is preprocessed by the Guile reader before
512any code is executed. The resulting argument to @code{make-regexp} is
513the string @samp{^\* [^:]*}, which is what we really want.
514
515This also means that in order to write a regular expression that matches
516a single backslash character, the regular expression string in the
517source code must include @emph{four} backslashes. Each consecutive pair
518of backslashes gets translated by the Guile reader to a single
519backslash, and the resulting double-backslash is interpreted by the
520regexp engine as matching a single backslash character. Hence:
521
522@lisp
523(define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
524@end lisp
525
526The reason for the unwieldiness of this syntax is historical. Both
527regular expression pattern matchers and Unix string processing systems
528have traditionally used backslashes with the special meanings
529described above. The POSIX regular expression specification and ANSI C
530standard both require these semantics. Attempting to abandon either
531convention would cause other kinds of compatibility problems, possibly
532more severe ones. Therefore, without extending the Scheme reader to
533support strings with different quoting conventions (an ungainly and
534confusing extension when implemented in other languages), we must adhere
535to this cumbersome escape syntax.