2 @c This is part of the GNU Guile Reference Manual.
3 @c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010, 2012
4 @c Free Software Foundation, Inc.
5 @c See the file guile.texi for copying conditions.
7 @node Regular Expressions
8 @section Regular Expressions
9 @tpindex Regular expressions
11 @cindex regular expressions
15 A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
16 describes a whole class of strings. A full description of regular
17 expressions and their syntax is beyond the scope of this manual;
18 an introduction can be found in the Emacs manual (@pxref{Regexps,
19 , Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or
20 in many general Unix reference books.
22 If your system does not include a POSIX regular expression library,
23 and you have not linked Guile with a third-party regexp library such
24 as Rx, these functions will not be available. You can tell whether
25 your Guile installation includes regular expression support by
26 checking whether @code{(provided? 'regex)} returns true.
28 The following regexp and string matching features are provided by the
29 @code{(ice-9 regex)} module. Before using the described functions,
30 you should load this module by executing @code{(use-modules (ice-9
34 * Regexp Functions:: Functions that create and match regexps.
35 * Match Structures:: Finding what was matched by a regexp.
36 * Backslash Escapes:: Removing the special meaning of regexp
41 @node Regexp Functions
42 @subsection Regexp Functions
44 By default, Guile supports POSIX extended regular expressions.
45 That means that the characters @samp{(}, @samp{)}, @samp{+} and
46 @samp{?} are special, and must be escaped if you wish to match the
49 This regular expression interface was modeled after that
50 implemented by SCSH, the Scheme Shell. It is intended to be
51 upwardly compatible with SCSH regular expressions.
53 Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
54 strings, since the underlying C functions treat that as the end of
55 string. If there's a zero byte an error is thrown.
57 Internally, patterns and input strings are converted to the current
58 locale's encoding, and then passed to the C library's regular expression
59 routines (@pxref{Regular Expressions,,, libc, The GNU C Library
60 Reference Manual}). The returned match structures always point to
61 characters in the strings, not to individual bytes, even in the case of
64 @deffn {Scheme Procedure} string-match pattern str [start]
65 Compile the string @var{pattern} into a regular expression and compare
66 it with @var{str}. The optional numeric argument @var{start} specifies
67 the position of @var{str} at which to begin matching.
69 @code{string-match} returns a @dfn{match structure} which
70 describes what, if anything, was matched by the regular
71 expression. @xref{Match Structures}. If @var{str} does not match
72 @var{pattern} at all, @code{string-match} returns @code{#f}.
75 Two examples of a match follow. In the first example, the pattern
76 matches the four digits in the match string. In the second, the pattern
80 (string-match "[0-9][0-9][0-9][0-9]" "blah2002")
81 @result{} #("blah2002" (4 . 8))
83 (string-match "[A-Za-z]" "123456")
87 Each time @code{string-match} is called, it must compile its
88 @var{pattern} argument into a regular expression structure. This
89 operation is expensive, which makes @code{string-match} inefficient if
90 the same regular expression is used several times (for example, in a
91 loop). For better performance, you can compile a regular expression in
92 advance and then match strings against the compiled regexp.
94 @deffn {Scheme Procedure} make-regexp pat flag@dots{}
95 @deffnx {C Function} scm_make_regexp (pat, flaglst)
96 Compile the regular expression described by @var{pat}, and
97 return the compiled regexp structure. If @var{pat} does not
98 describe a legal regular expression, @code{make-regexp} throws
99 a @code{regular-expression-syntax} error.
101 The @var{flag} arguments change the behavior of the compiled
102 regular expression. The following values may be supplied:
105 Consider uppercase and lowercase letters to be the same when
109 @defvar regexp/newline
110 If a newline appears in the target string, then permit the
111 @samp{^} and @samp{$} operators to match immediately after or
112 immediately before the newline, respectively. Also, the
113 @samp{.} and @samp{[^...]} operators will never match a newline
114 character. The intent of this flag is to treat the target
115 string as a buffer containing many lines of text, and the
116 regular expression as a pattern that may match a single one of
121 Compile a basic (``obsolete'') regexp instead of the extended
122 (``modern'') regexps that are the default. Basic regexps do
123 not consider @samp{|}, @samp{+} or @samp{?} to be special
124 characters, and require the @samp{@{...@}} and @samp{(...)}
125 metacharacters to be backslash-escaped (@pxref{Backslash
126 Escapes}). There are several other differences between basic
127 and extended regular expressions, but these are the most
131 @defvar regexp/extended
132 Compile an extended regular expression rather than a basic
133 regexp. This is the default behavior; this flag will not
134 usually be needed. If a call to @code{make-regexp} includes
135 both @code{regexp/basic} and @code{regexp/extended} flags, the
136 one which comes last will override the earlier one.
140 @deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
141 @deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
142 Match the compiled regular expression @var{rx} against
143 @code{str}. If the optional integer @var{start} argument is
144 provided, begin matching from that position in the string.
145 Return a match structure describing the results of the match,
146 or @code{#f} if no match could be found.
148 The @var{flags} argument changes the matching behavior. The following
149 flag values may be supplied, use @code{logior} (@pxref{Bitwise
150 Operations}) to combine them,
152 @defvar regexp/notbol
153 Consider that the @var{start} offset into @var{str} is not the
154 beginning of a line and should not match operator @samp{^}.
156 If @var{rx} was created with the @code{regexp/newline} option above,
157 @samp{^} will still match after a newline in @var{str}.
160 @defvar regexp/noteol
161 Consider that the end of @var{str} is not the end of a line and should
162 not match operator @samp{$}.
164 If @var{rx} was created with the @code{regexp/newline} option above,
165 @samp{$} will still match before a newline in @var{str}.
170 ;; Regexp to match uppercase letters
171 (define r (make-regexp "[A-Z]*"))
173 ;; Regexp to match letters, ignoring case
174 (define ri (make-regexp "[A-Z]*" regexp/icase))
176 ;; Search for bob using regexp r
177 (match:substring (regexp-exec r "bob"))
178 @result{} "" ; no match
180 ;; Search for bob using regexp ri
181 (match:substring (regexp-exec ri "Bob"))
182 @result{} "Bob" ; matched case insensitive
185 @deffn {Scheme Procedure} regexp? obj
186 @deffnx {C Function} scm_regexp_p (obj)
187 Return @code{#t} if @var{obj} is a compiled regular expression,
188 or @code{#f} otherwise.
192 @deffn {Scheme Procedure} list-matches regexp str [flags]
193 Return a list of match structures which are the non-overlapping
194 matches of @var{regexp} in @var{str}. @var{regexp} can be either a
195 pattern string or a compiled regexp. The @var{flags} argument is as
196 per @code{regexp-exec} above.
199 (map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
200 @result{} ("abc" "def")
204 @deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
205 Apply @var{proc} to the non-overlapping matches of @var{regexp} in
206 @var{str}, to build a result. @var{regexp} can be either a pattern
207 string or a compiled regexp. The @var{flags} argument is as per
208 @code{regexp-exec} above.
210 @var{proc} is called as @code{(@var{proc} match prev)} where
211 @var{match} is a match structure and @var{prev} is the previous return
212 from @var{proc}. For the first call @var{prev} is the given
213 @var{init} parameter. @code{fold-matches} returns the final value
216 For example to count matches,
219 (fold-matches "[a-z][0-9]" "abc x1 def y2" 0
220 (lambda (match count)
227 Regular expressions are commonly used to find patterns in one string
228 and replace them with the contents of another string. The following
229 functions are convenient ways to do this.
231 @c begin (scm-doc-string "regex.scm" "regexp-substitute")
232 @deffn {Scheme Procedure} regexp-substitute port match item @dots{}
233 Write to @var{port} selected parts of the match structure @var{match}.
234 Or if @var{port} is @code{#f} then form a string from those parts and
237 Each @var{item} specifies a part to be written, and may be one of the
242 A string. String arguments are written out verbatim.
245 An integer. The submatch with that number is written
246 (@code{match:substring}). Zero is the entire match.
249 The symbol @samp{pre}. The portion of the matched string preceding
250 the regexp match is written (@code{match:prefix}).
253 The symbol @samp{post}. The portion of the matched string following
254 the regexp match is written (@code{match:suffix}).
257 For example, changing a match and retaining the text before and after,
260 (regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
262 @result{} "number 37 is good"
265 Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
266 re-ordering and hyphenating the fields.
270 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
271 (define s "Date 20020429 12am.")
272 (regexp-substitute #f (string-match date-regex s)
273 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
274 @result{} "Date 04-29-2002 12am. (20020429)"
279 @c begin (scm-doc-string "regex.scm" "regexp-substitute")
280 @deffn {Scheme Procedure} regexp-substitute/global port regexp target item@dots{}
281 @cindex search and replace
282 Write to @var{port} selected parts of matches of @var{regexp} in
283 @var{target}. If @var{port} is @code{#f} then form a string from
284 those parts and return that. @var{regexp} can be a string or a
287 This is similar to @code{regexp-substitute}, but allows global
288 substitutions on @var{target}. Each @var{item} behaves as per
289 @code{regexp-substitute}, with the following differences,
293 A function. Called as @code{(@var{item} match)} with the match
294 structure for the @var{regexp} match, it should return a string to be
295 written to @var{port}.
298 The symbol @samp{post}. This doesn't output anything, but instead
299 causes @code{regexp-substitute/global} to recurse on the unmatched
300 portion of @var{target}.
302 This @emph{must} be supplied to perform a global search and replace on
303 @var{target}; without it @code{regexp-substitute/global} returns after
304 a single match and output.
307 For example, to collapse runs of tabs and spaces to a single hyphen
311 (regexp-substitute/global #f "[ \t]+" "this is the text"
313 @result{} "this-is-the-text"
316 Or using a function to reverse the letters in each word,
319 (regexp-substitute/global #f "[a-z]+" "to do and not-do"
320 'pre (lambda (m) (string-reverse (match:substring m))) 'post)
321 @result{} "ot od dna ton-od"
324 Without the @code{post} symbol, just one regexp match is made. For
325 example the following is the date example from
326 @code{regexp-substitute} above, without the need for the separate
327 @code{string-match} call.
331 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
332 (define s "Date 20020429 12am.")
333 (regexp-substitute/global #f date-regex s
334 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
336 @result{} "Date 04-29-2002 12am. (20020429)"
341 @node Match Structures
342 @subsection Match Structures
344 @cindex match structures
346 A @dfn{match structure} is the object returned by @code{string-match} and
347 @code{regexp-exec}. It describes which portion of a string, if any,
348 matched the given regular expression. Match structures include: a
349 reference to the string that was checked for matches; the starting and
350 ending positions of the regexp match; and, if the regexp included any
351 parenthesized subexpressions, the starting and ending positions of each
354 In each of the regexp match functions described below, the @code{match}
355 argument must be a match structure returned by a previous call to
356 @code{string-match} or @code{regexp-exec}. Most of these functions
357 return some information about the original target string that was
358 matched against a regular expression; we will call that string
359 @var{target} for easy reference.
361 @c begin (scm-doc-string "regex.scm" "regexp-match?")
362 @deffn {Scheme Procedure} regexp-match? obj
363 Return @code{#t} if @var{obj} is a match structure returned by a
364 previous call to @code{regexp-exec}, or @code{#f} otherwise.
367 @c begin (scm-doc-string "regex.scm" "match:substring")
368 @deffn {Scheme Procedure} match:substring match [n]
369 Return the portion of @var{target} matched by subexpression number
370 @var{n}. Submatch 0 (the default) represents the entire regexp match.
371 If the regular expression as a whole matched, but the subexpression
372 number @var{n} did not match, return @code{#f}.
376 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
380 ;; match starting at offset 6 in the string
382 (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
386 @c begin (scm-doc-string "regex.scm" "match:start")
387 @deffn {Scheme Procedure} match:start match [n]
388 Return the starting position of submatch number @var{n}.
391 In the following example, the result is 4, since the match starts at
395 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
400 @c begin (scm-doc-string "regex.scm" "match:end")
401 @deffn {Scheme Procedure} match:end match [n]
402 Return the ending position of submatch number @var{n}.
405 In the following example, the result is 8, since the match runs between
406 characters 4 and 8 (i.e.@: the ``2002'').
409 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
414 @c begin (scm-doc-string "regex.scm" "match:prefix")
415 @deffn {Scheme Procedure} match:prefix match
416 Return the unmatched portion of @var{target} preceding the regexp match.
419 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
425 @c begin (scm-doc-string "regex.scm" "match:suffix")
426 @deffn {Scheme Procedure} match:suffix match
427 Return the unmatched portion of @var{target} following the regexp match.
431 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
436 @c begin (scm-doc-string "regex.scm" "match:count")
437 @deffn {Scheme Procedure} match:count match
438 Return the number of parenthesized subexpressions from @var{match}.
439 Note that the entire regular expression match itself counts as a
440 subexpression, and failed submatches are included in the count.
443 @c begin (scm-doc-string "regex.scm" "match:string")
444 @deffn {Scheme Procedure} match:string match
445 Return the original @var{target} string.
449 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
451 @result{} "blah2002foo"
455 @node Backslash Escapes
456 @subsection Backslash Escapes
458 Sometimes you will want a regexp to match characters like @samp{*} or
459 @samp{$} exactly. For example, to check whether a particular string
460 represents a menu entry from an Info node, it would be useful to match
461 it against a regexp like @samp{^* [^:]*::}. However, this won't work;
462 because the asterisk is a metacharacter, it won't match the @samp{*} at
463 the beginning of the string. In this case, we want to make the first
466 You can do this by preceding the metacharacter with a backslash
467 character @samp{\}. (This is also called @dfn{quoting} the
468 metacharacter, and is known as a @dfn{backslash escape}.) When Guile
469 sees a backslash in a regular expression, it considers the following
470 glyph to be an ordinary character, no matter what special meaning it
471 would ordinarily have. Therefore, we can make the above example work by
472 changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells
473 the regular expression engine to match only a single asterisk in the
476 Since the backslash is itself a metacharacter, you may force a regexp to
477 match a backslash in the target string by preceding the backslash with
478 itself. For example, to find variable references in a @TeX{} program,
479 you might want to find occurrences of the string @samp{\let\} followed
480 by any number of alphabetic characters. The regular expression
481 @samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
482 regexp each match a single backslash in the target string.
484 @c begin (scm-doc-string "regex.scm" "regexp-quote")
485 @deffn {Scheme Procedure} regexp-quote str
486 Quote each special character found in @var{str} with a backslash, and
487 return the resulting string.
490 @strong{Very important:} Using backslash escapes in Guile source code
491 (as in Emacs Lisp or C) can be tricky, because the backslash character
492 has special meaning for the Guile reader. For example, if Guile
493 encounters the character sequence @samp{\n} in the middle of a string
494 while processing Scheme code, it replaces those characters with a
495 newline character. Similarly, the character sequence @samp{\t} is
496 replaced by a horizontal tab. Several of these @dfn{escape sequences}
497 are processed by the Guile reader before your code is executed.
498 Unrecognized escape sequences are ignored: if the characters @samp{\*}
499 appear in a string, they will be translated to the single character
502 This translation is obviously undesirable for regular expressions, since
503 we want to be able to include backslashes in a string in order to
504 escape regexp metacharacters. Therefore, to make sure that a backslash
505 is preserved in a string in your Guile program, you must use @emph{two}
506 consecutive backslashes:
509 (define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
512 The string in this example is preprocessed by the Guile reader before
513 any code is executed. The resulting argument to @code{make-regexp} is
514 the string @samp{^\* [^:]*}, which is what we really want.
516 This also means that in order to write a regular expression that matches
517 a single backslash character, the regular expression string in the
518 source code must include @emph{four} backslashes. Each consecutive pair
519 of backslashes gets translated by the Guile reader to a single
520 backslash, and the resulting double-backslash is interpreted by the
521 regexp engine as matching a single backslash character. Hence:
524 (define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
527 The reason for the unwieldiness of this syntax is historical. Both
528 regular expression pattern matchers and Unix string processing systems
529 have traditionally used backslashes with the special meanings
530 described above. The POSIX regular expression specification and ANSI C
531 standard both require these semantics. Attempting to abandon either
532 convention would cause other kinds of compatibility problems, possibly
533 more severe ones. Therefore, without extending the Scheme reader to
534 support strings with different quoting conventions (an ungainly and
535 confusing extension when implemented in other languages), we must adhere
536 to this cumbersome escape syntax.