Misc textual editing
[bpt/guile.git] / doc / ref / api-regex.texi
1 @c -*-texinfo-*-
2 @c This is part of the GNU Guile Reference Manual.
3 @c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010
4 @c Free Software Foundation, Inc.
5 @c See the file guile.texi for copying conditions.
6
7 @node Regular Expressions
8 @section Regular Expressions
9 @tpindex Regular expressions
10
11 @cindex regular expressions
12 @cindex regex
13 @cindex emacs regexp
14
15 A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
16 describes a whole class of strings. A full description of regular
17 expressions and their syntax is beyond the scope of this manual;
18 an introduction can be found in the Emacs manual (@pxref{Regexps,
19 , Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or
20 in many general Unix reference books.
21
22 If your system does not include a POSIX regular expression library,
23 and you have not linked Guile with a third-party regexp library such
24 as Rx, these functions will not be available. You can tell whether
25 your Guile installation includes regular expression support by
26 checking whether @code{(provided? 'regex)} returns true.
27
28 The following regexp and string matching features are provided by the
29 @code{(ice-9 regex)} module. Before using the described functions,
30 you should load this module by executing @code{(use-modules (ice-9
31 regex))}.
32
33 @menu
34 * Regexp Functions:: Functions that create and match regexps.
35 * Match Structures:: Finding what was matched by a regexp.
36 * Backslash Escapes:: Removing the special meaning of regexp
37 meta-characters.
38 @end menu
39
40
41 @node Regexp Functions
42 @subsection Regexp Functions
43
44 By default, Guile supports POSIX extended regular expressions.
45 That means that the characters @samp{(}, @samp{)}, @samp{+} and
46 @samp{?} are special, and must be escaped if you wish to match the
47 literal characters.
48
49 This regular expression interface was modeled after that
50 implemented by SCSH, the Scheme Shell. It is intended to be
51 upwardly compatible with SCSH regular expressions.
52
53 Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
54 strings, since the underlying C functions treat that as the end of
55 string. If there's a zero byte an error is thrown.
56
57 Patterns and input strings are treated as being in the locale
58 character set if @code{setlocale} has been called (@pxref{Locales}),
59 and in a multibyte locale this includes treating multi-byte sequences
60 as a single character. (Guile strings are currently merely bytes,
61 though this may change in the future, @xref{Conversion to/from C}.)
62
63 @deffn {Scheme Procedure} string-match pattern str [start]
64 Compile the string @var{pattern} into a regular expression and compare
65 it with @var{str}. The optional numeric argument @var{start} specifies
66 the position of @var{str} at which to begin matching.
67
68 @code{string-match} returns a @dfn{match structure} which
69 describes what, if anything, was matched by the regular
70 expression. @xref{Match Structures}. If @var{str} does not match
71 @var{pattern} at all, @code{string-match} returns @code{#f}.
72 @end deffn
73
74 Two examples of a match follow. In the first example, the pattern
75 matches the four digits in the match string. In the second, the pattern
76 matches nothing.
77
78 @example
79 (string-match "[0-9][0-9][0-9][0-9]" "blah2002")
80 @result{} #("blah2002" (4 . 8))
81
82 (string-match "[A-Za-z]" "123456")
83 @result{} #f
84 @end example
85
86 Each time @code{string-match} is called, it must compile its
87 @var{pattern} argument into a regular expression structure. This
88 operation is expensive, which makes @code{string-match} inefficient if
89 the same regular expression is used several times (for example, in a
90 loop). For better performance, you can compile a regular expression in
91 advance and then match strings against the compiled regexp.
92
93 @deffn {Scheme Procedure} make-regexp pat flag@dots{}
94 @deffnx {C Function} scm_make_regexp (pat, flaglst)
95 Compile the regular expression described by @var{pat}, and
96 return the compiled regexp structure. If @var{pat} does not
97 describe a legal regular expression, @code{make-regexp} throws
98 a @code{regular-expression-syntax} error.
99
100 The @var{flag} arguments change the behavior of the compiled
101 regular expression. The following values may be supplied:
102
103 @defvar regexp/icase
104 Consider uppercase and lowercase letters to be the same when
105 matching.
106 @end defvar
107
108 @defvar regexp/newline
109 If a newline appears in the target string, then permit the
110 @samp{^} and @samp{$} operators to match immediately after or
111 immediately before the newline, respectively. Also, the
112 @samp{.} and @samp{[^...]} operators will never match a newline
113 character. The intent of this flag is to treat the target
114 string as a buffer containing many lines of text, and the
115 regular expression as a pattern that may match a single one of
116 those lines.
117 @end defvar
118
119 @defvar regexp/basic
120 Compile a basic (``obsolete'') regexp instead of the extended
121 (``modern'') regexps that are the default. Basic regexps do
122 not consider @samp{|}, @samp{+} or @samp{?} to be special
123 characters, and require the @samp{@{...@}} and @samp{(...)}
124 metacharacters to be backslash-escaped (@pxref{Backslash
125 Escapes}). There are several other differences between basic
126 and extended regular expressions, but these are the most
127 significant.
128 @end defvar
129
130 @defvar regexp/extended
131 Compile an extended regular expression rather than a basic
132 regexp. This is the default behavior; this flag will not
133 usually be needed. If a call to @code{make-regexp} includes
134 both @code{regexp/basic} and @code{regexp/extended} flags, the
135 one which comes last will override the earlier one.
136 @end defvar
137 @end deffn
138
139 @deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
140 @deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
141 Match the compiled regular expression @var{rx} against
142 @code{str}. If the optional integer @var{start} argument is
143 provided, begin matching from that position in the string.
144 Return a match structure describing the results of the match,
145 or @code{#f} if no match could be found.
146
147 The @var{flags} argument changes the matching behavior. The following
148 flag values may be supplied, use @code{logior} (@pxref{Bitwise
149 Operations}) to combine them,
150
151 @defvar regexp/notbol
152 Consider that the @var{start} offset into @var{str} is not the
153 beginning of a line and should not match operator @samp{^}.
154
155 If @var{rx} was created with the @code{regexp/newline} option above,
156 @samp{^} will still match after a newline in @var{str}.
157 @end defvar
158
159 @defvar regexp/noteol
160 Consider that the end of @var{str} is not the end of a line and should
161 not match operator @samp{$}.
162
163 If @var{rx} was created with the @code{regexp/newline} option above,
164 @samp{$} will still match before a newline in @var{str}.
165 @end defvar
166 @end deffn
167
168 @lisp
169 ;; Regexp to match uppercase letters
170 (define r (make-regexp "[A-Z]*"))
171
172 ;; Regexp to match letters, ignoring case
173 (define ri (make-regexp "[A-Z]*" regexp/icase))
174
175 ;; Search for bob using regexp r
176 (match:substring (regexp-exec r "bob"))
177 @result{} "" ; no match
178
179 ;; Search for bob using regexp ri
180 (match:substring (regexp-exec ri "Bob"))
181 @result{} "Bob" ; matched case insensitive
182 @end lisp
183
184 @deffn {Scheme Procedure} regexp? obj
185 @deffnx {C Function} scm_regexp_p (obj)
186 Return @code{#t} if @var{obj} is a compiled regular expression,
187 or @code{#f} otherwise.
188 @end deffn
189
190 @sp 1
191 @deffn {Scheme Procedure} list-matches regexp str [flags]
192 Return a list of match structures which are the non-overlapping
193 matches of @var{regexp} in @var{str}. @var{regexp} can be either a
194 pattern string or a compiled regexp. The @var{flags} argument is as
195 per @code{regexp-exec} above.
196
197 @example
198 (map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
199 @result{} ("abc" "def")
200 @end example
201 @end deffn
202
203 @deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
204 Apply @var{proc} to the non-overlapping matches of @var{regexp} in
205 @var{str}, to build a result. @var{regexp} can be either a pattern
206 string or a compiled regexp. The @var{flags} argument is as per
207 @code{regexp-exec} above.
208
209 @var{proc} is called as @code{(@var{proc} match prev)} where
210 @var{match} is a match structure and @var{prev} is the previous return
211 from @var{proc}. For the first call @var{prev} is the given
212 @var{init} parameter. @code{fold-matches} returns the final value
213 from @var{proc}.
214
215 For example to count matches,
216
217 @example
218 (fold-matches "[a-z][0-9]" "abc x1 def y2" 0
219 (lambda (match count)
220 (1+ count)))
221 @result{} 2
222 @end example
223 @end deffn
224
225 @sp 1
226 Regular expressions are commonly used to find patterns in one string
227 and replace them with the contents of another string. The following
228 functions are convenient ways to do this.
229
230 @c begin (scm-doc-string "regex.scm" "regexp-substitute")
231 @deffn {Scheme Procedure} regexp-substitute port match [item@dots{}]
232 Write to @var{port} selected parts of the match structure @var{match}.
233 Or if @var{port} is @code{#f} then form a string from those parts and
234 return that.
235
236 Each @var{item} specifies a part to be written, and may be one of the
237 following,
238
239 @itemize @bullet
240 @item
241 A string. String arguments are written out verbatim.
242
243 @item
244 An integer. The submatch with that number is written
245 (@code{match:substring}). Zero is the entire match.
246
247 @item
248 The symbol @samp{pre}. The portion of the matched string preceding
249 the regexp match is written (@code{match:prefix}).
250
251 @item
252 The symbol @samp{post}. The portion of the matched string following
253 the regexp match is written (@code{match:suffix}).
254 @end itemize
255
256 For example, changing a match and retaining the text before and after,
257
258 @example
259 (regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
260 'pre "37" 'post)
261 @result{} "number 37 is good"
262 @end example
263
264 Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
265 re-ordering and hyphenating the fields.
266
267 @lisp
268 (define date-regex
269 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
270 (define s "Date 20020429 12am.")
271 (regexp-substitute #f (string-match date-regex s)
272 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
273 @result{} "Date 04-29-2002 12am. (20020429)"
274 @end lisp
275 @end deffn
276
277
278 @c begin (scm-doc-string "regex.scm" "regexp-substitute")
279 @deffn {Scheme Procedure} regexp-substitute/global port regexp target [item@dots{}]
280 @cindex search and replace
281 Write to @var{port} selected parts of matches of @var{regexp} in
282 @var{target}. If @var{port} is @code{#f} then form a string from
283 those parts and return that. @var{regexp} can be a string or a
284 compiled regex.
285
286 This is similar to @code{regexp-substitute}, but allows global
287 substitutions on @var{target}. Each @var{item} behaves as per
288 @code{regexp-substitute}, with the following differences,
289
290 @itemize @bullet
291 @item
292 A function. Called as @code{(@var{item} match)} with the match
293 structure for the @var{regexp} match, it should return a string to be
294 written to @var{port}.
295
296 @item
297 The symbol @samp{post}. This doesn't output anything, but instead
298 causes @code{regexp-substitute/global} to recurse on the unmatched
299 portion of @var{target}.
300
301 This @emph{must} be supplied to perform a global search and replace on
302 @var{target}; without it @code{regexp-substitute/global} returns after
303 a single match and output.
304 @end itemize
305
306 For example, to collapse runs of tabs and spaces to a single hyphen
307 each,
308
309 @example
310 (regexp-substitute/global #f "[ \t]+" "this is the text"
311 'pre "-" 'post)
312 @result{} "this-is-the-text"
313 @end example
314
315 Or using a function to reverse the letters in each word,
316
317 @example
318 (regexp-substitute/global #f "[a-z]+" "to do and not-do"
319 'pre (lambda (m) (string-reverse (match:substring m))) 'post)
320 @result{} "ot od dna ton-od"
321 @end example
322
323 Without the @code{post} symbol, just one regexp match is made. For
324 example the following is the date example from
325 @code{regexp-substitute} above, without the need for the separate
326 @code{string-match} call.
327
328 @lisp
329 (define date-regex
330 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
331 (define s "Date 20020429 12am.")
332 (regexp-substitute/global #f date-regex s
333 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
334
335 @result{} "Date 04-29-2002 12am. (20020429)"
336 @end lisp
337 @end deffn
338
339
340 @node Match Structures
341 @subsection Match Structures
342
343 @cindex match structures
344
345 A @dfn{match structure} is the object returned by @code{string-match} and
346 @code{regexp-exec}. It describes which portion of a string, if any,
347 matched the given regular expression. Match structures include: a
348 reference to the string that was checked for matches; the starting and
349 ending positions of the regexp match; and, if the regexp included any
350 parenthesized subexpressions, the starting and ending positions of each
351 submatch.
352
353 In each of the regexp match functions described below, the @code{match}
354 argument must be a match structure returned by a previous call to
355 @code{string-match} or @code{regexp-exec}. Most of these functions
356 return some information about the original target string that was
357 matched against a regular expression; we will call that string
358 @var{target} for easy reference.
359
360 @c begin (scm-doc-string "regex.scm" "regexp-match?")
361 @deffn {Scheme Procedure} regexp-match? obj
362 Return @code{#t} if @var{obj} is a match structure returned by a
363 previous call to @code{regexp-exec}, or @code{#f} otherwise.
364 @end deffn
365
366 @c begin (scm-doc-string "regex.scm" "match:substring")
367 @deffn {Scheme Procedure} match:substring match [n]
368 Return the portion of @var{target} matched by subexpression number
369 @var{n}. Submatch 0 (the default) represents the entire regexp match.
370 If the regular expression as a whole matched, but the subexpression
371 number @var{n} did not match, return @code{#f}.
372 @end deffn
373
374 @lisp
375 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
376 (match:substring s)
377 @result{} "2002"
378
379 ;; match starting at offset 6 in the string
380 (match:substring
381 (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
382 @result{} "7654"
383 @end lisp
384
385 @c begin (scm-doc-string "regex.scm" "match:start")
386 @deffn {Scheme Procedure} match:start match [n]
387 Return the starting position of submatch number @var{n}.
388 @end deffn
389
390 In the following example, the result is 4, since the match starts at
391 character index 4:
392
393 @lisp
394 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
395 (match:start s)
396 @result{} 4
397 @end lisp
398
399 @c begin (scm-doc-string "regex.scm" "match:end")
400 @deffn {Scheme Procedure} match:end match [n]
401 Return the ending position of submatch number @var{n}.
402 @end deffn
403
404 In the following example, the result is 8, since the match runs between
405 characters 4 and 8 (i.e.@: the ``2002'').
406
407 @lisp
408 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
409 (match:end s)
410 @result{} 8
411 @end lisp
412
413 @c begin (scm-doc-string "regex.scm" "match:prefix")
414 @deffn {Scheme Procedure} match:prefix match
415 Return the unmatched portion of @var{target} preceding the regexp match.
416
417 @lisp
418 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
419 (match:prefix s)
420 @result{} "blah"
421 @end lisp
422 @end deffn
423
424 @c begin (scm-doc-string "regex.scm" "match:suffix")
425 @deffn {Scheme Procedure} match:suffix match
426 Return the unmatched portion of @var{target} following the regexp match.
427 @end deffn
428
429 @lisp
430 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
431 (match:suffix s)
432 @result{} "foo"
433 @end lisp
434
435 @c begin (scm-doc-string "regex.scm" "match:count")
436 @deffn {Scheme Procedure} match:count match
437 Return the number of parenthesized subexpressions from @var{match}.
438 Note that the entire regular expression match itself counts as a
439 subexpression, and failed submatches are included in the count.
440 @end deffn
441
442 @c begin (scm-doc-string "regex.scm" "match:string")
443 @deffn {Scheme Procedure} match:string match
444 Return the original @var{target} string.
445 @end deffn
446
447 @lisp
448 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
449 (match:string s)
450 @result{} "blah2002foo"
451 @end lisp
452
453
454 @node Backslash Escapes
455 @subsection Backslash Escapes
456
457 Sometimes you will want a regexp to match characters like @samp{*} or
458 @samp{$} exactly. For example, to check whether a particular string
459 represents a menu entry from an Info node, it would be useful to match
460 it against a regexp like @samp{^* [^:]*::}. However, this won't work;
461 because the asterisk is a metacharacter, it won't match the @samp{*} at
462 the beginning of the string. In this case, we want to make the first
463 asterisk un-magic.
464
465 You can do this by preceding the metacharacter with a backslash
466 character @samp{\}. (This is also called @dfn{quoting} the
467 metacharacter, and is known as a @dfn{backslash escape}.) When Guile
468 sees a backslash in a regular expression, it considers the following
469 glyph to be an ordinary character, no matter what special meaning it
470 would ordinarily have. Therefore, we can make the above example work by
471 changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells
472 the regular expression engine to match only a single asterisk in the
473 target string.
474
475 Since the backslash is itself a metacharacter, you may force a regexp to
476 match a backslash in the target string by preceding the backslash with
477 itself. For example, to find variable references in a @TeX{} program,
478 you might want to find occurrences of the string @samp{\let\} followed
479 by any number of alphabetic characters. The regular expression
480 @samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
481 regexp each match a single backslash in the target string.
482
483 @c begin (scm-doc-string "regex.scm" "regexp-quote")
484 @deffn {Scheme Procedure} regexp-quote str
485 Quote each special character found in @var{str} with a backslash, and
486 return the resulting string.
487 @end deffn
488
489 @strong{Very important:} Using backslash escapes in Guile source code
490 (as in Emacs Lisp or C) can be tricky, because the backslash character
491 has special meaning for the Guile reader. For example, if Guile
492 encounters the character sequence @samp{\n} in the middle of a string
493 while processing Scheme code, it replaces those characters with a
494 newline character. Similarly, the character sequence @samp{\t} is
495 replaced by a horizontal tab. Several of these @dfn{escape sequences}
496 are processed by the Guile reader before your code is executed.
497 Unrecognized escape sequences are ignored: if the characters @samp{\*}
498 appear in a string, they will be translated to the single character
499 @samp{*}.
500
501 This translation is obviously undesirable for regular expressions, since
502 we want to be able to include backslashes in a string in order to
503 escape regexp metacharacters. Therefore, to make sure that a backslash
504 is preserved in a string in your Guile program, you must use @emph{two}
505 consecutive backslashes:
506
507 @lisp
508 (define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
509 @end lisp
510
511 The string in this example is preprocessed by the Guile reader before
512 any code is executed. The resulting argument to @code{make-regexp} is
513 the string @samp{^\* [^:]*}, which is what we really want.
514
515 This also means that in order to write a regular expression that matches
516 a single backslash character, the regular expression string in the
517 source code must include @emph{four} backslashes. Each consecutive pair
518 of backslashes gets translated by the Guile reader to a single
519 backslash, and the resulting double-backslash is interpreted by the
520 regexp engine as matching a single backslash character. Hence:
521
522 @lisp
523 (define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
524 @end lisp
525
526 The reason for the unwieldiness of this syntax is historical. Both
527 regular expression pattern matchers and Unix string processing systems
528 have traditionally used backslashes with the special meanings
529 described above. The POSIX regular expression specification and ANSI C
530 standard both require these semantics. Attempting to abandon either
531 convention would cause other kinds of compatibility problems, possibly
532 more severe ones. Therefore, without extending the Scheme reader to
533 support strings with different quoting conventions (an ungainly and
534 confusing extension when implemented in other languages), we must adhere
535 to this cumbersome escape syntax.