Update vm.texi's "Instruction Set" section.
[bpt/guile.git] / doc / ref / api-regex.texi
CommitLineData
96ca59d8
NJ
1@c -*-texinfo-*-
2@c This is part of the GNU Guile Reference Manual.
7aa394b5 3@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010, 2012
96ca59d8
NJ
4@c Free Software Foundation, Inc.
5@c See the file guile.texi for copying conditions.
6
7@node Regular Expressions
8@section Regular Expressions
9@tpindex Regular expressions
10
11@cindex regular expressions
12@cindex regex
13@cindex emacs regexp
14
15A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
16describes a whole class of strings. A full description of regular
17expressions and their syntax is beyond the scope of this manual;
18an introduction can be found in the Emacs manual (@pxref{Regexps,
19, Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or
20in many general Unix reference books.
21
22If your system does not include a POSIX regular expression library,
23and you have not linked Guile with a third-party regexp library such
24as Rx, these functions will not be available. You can tell whether
25your Guile installation includes regular expression support by
26checking whether @code{(provided? 'regex)} returns true.
27
28The following regexp and string matching features are provided by the
29@code{(ice-9 regex)} module. Before using the described functions,
30you should load this module by executing @code{(use-modules (ice-9
31regex))}.
32
33@menu
34* Regexp Functions:: Functions that create and match regexps.
35* Match Structures:: Finding what was matched by a regexp.
36* Backslash Escapes:: Removing the special meaning of regexp
37 meta-characters.
38@end menu
39
40
41@node Regexp Functions
42@subsection Regexp Functions
43
44By default, Guile supports POSIX extended regular expressions.
45That means that the characters @samp{(}, @samp{)}, @samp{+} and
46@samp{?} are special, and must be escaped if you wish to match the
47literal characters.
48
49This regular expression interface was modeled after that
50implemented by SCSH, the Scheme Shell. It is intended to be
51upwardly compatible with SCSH regular expressions.
52
53Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
54strings, since the underlying C functions treat that as the end of
55string. If there's a zero byte an error is thrown.
56
7aa394b5
LC
57Internally, patterns and input strings are converted to the current
58locale's encoding, and then passed to the C library's regular expression
59routines (@pxref{Regular Expressions,,, libc, The GNU C Library
60Reference Manual}). The returned match structures always point to
61characters in the strings, not to individual bytes, even in the case of
62multi-byte encodings.
96ca59d8
NJ
63
64@deffn {Scheme Procedure} string-match pattern str [start]
65Compile the string @var{pattern} into a regular expression and compare
66it with @var{str}. The optional numeric argument @var{start} specifies
67the position of @var{str} at which to begin matching.
68
69@code{string-match} returns a @dfn{match structure} which
70describes what, if anything, was matched by the regular
71expression. @xref{Match Structures}. If @var{str} does not match
72@var{pattern} at all, @code{string-match} returns @code{#f}.
73@end deffn
74
75Two examples of a match follow. In the first example, the pattern
76matches the four digits in the match string. In the second, the pattern
77matches nothing.
78
79@example
80(string-match "[0-9][0-9][0-9][0-9]" "blah2002")
81@result{} #("blah2002" (4 . 8))
82
83(string-match "[A-Za-z]" "123456")
84@result{} #f
85@end example
86
87Each time @code{string-match} is called, it must compile its
88@var{pattern} argument into a regular expression structure. This
89operation is expensive, which makes @code{string-match} inefficient if
90the same regular expression is used several times (for example, in a
91loop). For better performance, you can compile a regular expression in
92advance and then match strings against the compiled regexp.
93
94@deffn {Scheme Procedure} make-regexp pat flag@dots{}
95@deffnx {C Function} scm_make_regexp (pat, flaglst)
96Compile the regular expression described by @var{pat}, and
97return the compiled regexp structure. If @var{pat} does not
98describe a legal regular expression, @code{make-regexp} throws
99a @code{regular-expression-syntax} error.
100
101The @var{flag} arguments change the behavior of the compiled
102regular expression. The following values may be supplied:
103
104@defvar regexp/icase
105Consider uppercase and lowercase letters to be the same when
106matching.
107@end defvar
108
109@defvar regexp/newline
110If a newline appears in the target string, then permit the
111@samp{^} and @samp{$} operators to match immediately after or
112immediately before the newline, respectively. Also, the
113@samp{.} and @samp{[^...]} operators will never match a newline
114character. The intent of this flag is to treat the target
115string as a buffer containing many lines of text, and the
116regular expression as a pattern that may match a single one of
117those lines.
118@end defvar
119
120@defvar regexp/basic
121Compile a basic (``obsolete'') regexp instead of the extended
122(``modern'') regexps that are the default. Basic regexps do
123not consider @samp{|}, @samp{+} or @samp{?} to be special
124characters, and require the @samp{@{...@}} and @samp{(...)}
125metacharacters to be backslash-escaped (@pxref{Backslash
126Escapes}). There are several other differences between basic
127and extended regular expressions, but these are the most
128significant.
129@end defvar
130
131@defvar regexp/extended
132Compile an extended regular expression rather than a basic
133regexp. This is the default behavior; this flag will not
134usually be needed. If a call to @code{make-regexp} includes
135both @code{regexp/basic} and @code{regexp/extended} flags, the
136one which comes last will override the earlier one.
137@end defvar
138@end deffn
139
140@deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
141@deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
142Match the compiled regular expression @var{rx} against
143@code{str}. If the optional integer @var{start} argument is
144provided, begin matching from that position in the string.
145Return a match structure describing the results of the match,
146or @code{#f} if no match could be found.
147
148The @var{flags} argument changes the matching behavior. The following
149flag values may be supplied, use @code{logior} (@pxref{Bitwise
150Operations}) to combine them,
151
152@defvar regexp/notbol
153Consider that the @var{start} offset into @var{str} is not the
154beginning of a line and should not match operator @samp{^}.
155
156If @var{rx} was created with the @code{regexp/newline} option above,
157@samp{^} will still match after a newline in @var{str}.
158@end defvar
159
160@defvar regexp/noteol
161Consider that the end of @var{str} is not the end of a line and should
162not match operator @samp{$}.
163
164If @var{rx} was created with the @code{regexp/newline} option above,
165@samp{$} will still match before a newline in @var{str}.
166@end defvar
167@end deffn
168
169@lisp
170;; Regexp to match uppercase letters
171(define r (make-regexp "[A-Z]*"))
172
173;; Regexp to match letters, ignoring case
174(define ri (make-regexp "[A-Z]*" regexp/icase))
175
176;; Search for bob using regexp r
177(match:substring (regexp-exec r "bob"))
178@result{} "" ; no match
179
180;; Search for bob using regexp ri
181(match:substring (regexp-exec ri "Bob"))
182@result{} "Bob" ; matched case insensitive
183@end lisp
184
185@deffn {Scheme Procedure} regexp? obj
186@deffnx {C Function} scm_regexp_p (obj)
187Return @code{#t} if @var{obj} is a compiled regular expression,
188or @code{#f} otherwise.
189@end deffn
190
191@sp 1
192@deffn {Scheme Procedure} list-matches regexp str [flags]
193Return a list of match structures which are the non-overlapping
194matches of @var{regexp} in @var{str}. @var{regexp} can be either a
195pattern string or a compiled regexp. The @var{flags} argument is as
196per @code{regexp-exec} above.
197
198@example
199(map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
200@result{} ("abc" "def")
201@end example
202@end deffn
203
204@deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
205Apply @var{proc} to the non-overlapping matches of @var{regexp} in
206@var{str}, to build a result. @var{regexp} can be either a pattern
207string or a compiled regexp. The @var{flags} argument is as per
208@code{regexp-exec} above.
209
210@var{proc} is called as @code{(@var{proc} match prev)} where
211@var{match} is a match structure and @var{prev} is the previous return
212from @var{proc}. For the first call @var{prev} is the given
213@var{init} parameter. @code{fold-matches} returns the final value
214from @var{proc}.
215
216For example to count matches,
217
218@example
219(fold-matches "[a-z][0-9]" "abc x1 def y2" 0
220 (lambda (match count)
221 (1+ count)))
222@result{} 2
223@end example
224@end deffn
225
226@sp 1
227Regular expressions are commonly used to find patterns in one string
228and replace them with the contents of another string. The following
229functions are convenient ways to do this.
230
231@c begin (scm-doc-string "regex.scm" "regexp-substitute")
df0a1002 232@deffn {Scheme Procedure} regexp-substitute port match item @dots{}
96ca59d8
NJ
233Write to @var{port} selected parts of the match structure @var{match}.
234Or if @var{port} is @code{#f} then form a string from those parts and
235return that.
236
237Each @var{item} specifies a part to be written, and may be one of the
238following,
239
240@itemize @bullet
241@item
242A string. String arguments are written out verbatim.
243
244@item
245An integer. The submatch with that number is written
246(@code{match:substring}). Zero is the entire match.
247
248@item
249The symbol @samp{pre}. The portion of the matched string preceding
250the regexp match is written (@code{match:prefix}).
251
252@item
253The symbol @samp{post}. The portion of the matched string following
254the regexp match is written (@code{match:suffix}).
255@end itemize
256
257For example, changing a match and retaining the text before and after,
258
259@example
260(regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
261 'pre "37" 'post)
262@result{} "number 37 is good"
263@end example
264
265Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
266re-ordering and hyphenating the fields.
267
268@lisp
269(define date-regex
270 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
271(define s "Date 20020429 12am.")
272(regexp-substitute #f (string-match date-regex s)
273 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
274@result{} "Date 04-29-2002 12am. (20020429)"
275@end lisp
276@end deffn
277
278
279@c begin (scm-doc-string "regex.scm" "regexp-substitute")
df0a1002 280@deffn {Scheme Procedure} regexp-substitute/global port regexp target item@dots{}
96ca59d8
NJ
281@cindex search and replace
282Write to @var{port} selected parts of matches of @var{regexp} in
283@var{target}. If @var{port} is @code{#f} then form a string from
284those parts and return that. @var{regexp} can be a string or a
285compiled regex.
286
287This is similar to @code{regexp-substitute}, but allows global
288substitutions on @var{target}. Each @var{item} behaves as per
289@code{regexp-substitute}, with the following differences,
290
291@itemize @bullet
292@item
293A function. Called as @code{(@var{item} match)} with the match
294structure for the @var{regexp} match, it should return a string to be
295written to @var{port}.
296
297@item
298The symbol @samp{post}. This doesn't output anything, but instead
299causes @code{regexp-substitute/global} to recurse on the unmatched
300portion of @var{target}.
301
302This @emph{must} be supplied to perform a global search and replace on
303@var{target}; without it @code{regexp-substitute/global} returns after
304a single match and output.
305@end itemize
306
307For example, to collapse runs of tabs and spaces to a single hyphen
308each,
309
310@example
311(regexp-substitute/global #f "[ \t]+" "this is the text"
312 'pre "-" 'post)
313@result{} "this-is-the-text"
314@end example
315
316Or using a function to reverse the letters in each word,
317
318@example
319(regexp-substitute/global #f "[a-z]+" "to do and not-do"
320 'pre (lambda (m) (string-reverse (match:substring m))) 'post)
321@result{} "ot od dna ton-od"
322@end example
323
324Without the @code{post} symbol, just one regexp match is made. For
325example the following is the date example from
326@code{regexp-substitute} above, without the need for the separate
327@code{string-match} call.
328
329@lisp
330(define date-regex
331 "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
332(define s "Date 20020429 12am.")
333(regexp-substitute/global #f date-regex s
334 'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
335
336@result{} "Date 04-29-2002 12am. (20020429)"
337@end lisp
338@end deffn
339
340
341@node Match Structures
342@subsection Match Structures
343
344@cindex match structures
345
346A @dfn{match structure} is the object returned by @code{string-match} and
347@code{regexp-exec}. It describes which portion of a string, if any,
348matched the given regular expression. Match structures include: a
349reference to the string that was checked for matches; the starting and
350ending positions of the regexp match; and, if the regexp included any
351parenthesized subexpressions, the starting and ending positions of each
352submatch.
353
354In each of the regexp match functions described below, the @code{match}
355argument must be a match structure returned by a previous call to
356@code{string-match} or @code{regexp-exec}. Most of these functions
357return some information about the original target string that was
358matched against a regular expression; we will call that string
359@var{target} for easy reference.
360
361@c begin (scm-doc-string "regex.scm" "regexp-match?")
362@deffn {Scheme Procedure} regexp-match? obj
363Return @code{#t} if @var{obj} is a match structure returned by a
364previous call to @code{regexp-exec}, or @code{#f} otherwise.
365@end deffn
366
367@c begin (scm-doc-string "regex.scm" "match:substring")
368@deffn {Scheme Procedure} match:substring match [n]
369Return the portion of @var{target} matched by subexpression number
370@var{n}. Submatch 0 (the default) represents the entire regexp match.
371If the regular expression as a whole matched, but the subexpression
372number @var{n} did not match, return @code{#f}.
373@end deffn
374
375@lisp
376(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
377(match:substring s)
378@result{} "2002"
379
380;; match starting at offset 6 in the string
381(match:substring
382 (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
383@result{} "7654"
384@end lisp
385
386@c begin (scm-doc-string "regex.scm" "match:start")
387@deffn {Scheme Procedure} match:start match [n]
388Return the starting position of submatch number @var{n}.
389@end deffn
390
391In the following example, the result is 4, since the match starts at
392character index 4:
393
394@lisp
395(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
396(match:start s)
397@result{} 4
398@end lisp
399
400@c begin (scm-doc-string "regex.scm" "match:end")
401@deffn {Scheme Procedure} match:end match [n]
402Return the ending position of submatch number @var{n}.
403@end deffn
404
405In the following example, the result is 8, since the match runs between
679cceed 406characters 4 and 8 (i.e.@: the ``2002'').
96ca59d8
NJ
407
408@lisp
409(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
410(match:end s)
411@result{} 8
412@end lisp
413
414@c begin (scm-doc-string "regex.scm" "match:prefix")
415@deffn {Scheme Procedure} match:prefix match
416Return the unmatched portion of @var{target} preceding the regexp match.
417
418@lisp
419(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
420(match:prefix s)
421@result{} "blah"
422@end lisp
423@end deffn
424
425@c begin (scm-doc-string "regex.scm" "match:suffix")
426@deffn {Scheme Procedure} match:suffix match
427Return the unmatched portion of @var{target} following the regexp match.
428@end deffn
429
430@lisp
431(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
432(match:suffix s)
433@result{} "foo"
434@end lisp
435
436@c begin (scm-doc-string "regex.scm" "match:count")
437@deffn {Scheme Procedure} match:count match
438Return the number of parenthesized subexpressions from @var{match}.
439Note that the entire regular expression match itself counts as a
440subexpression, and failed submatches are included in the count.
441@end deffn
442
443@c begin (scm-doc-string "regex.scm" "match:string")
444@deffn {Scheme Procedure} match:string match
445Return the original @var{target} string.
446@end deffn
447
448@lisp
449(define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
450(match:string s)
451@result{} "blah2002foo"
452@end lisp
453
454
455@node Backslash Escapes
456@subsection Backslash Escapes
457
458Sometimes you will want a regexp to match characters like @samp{*} or
459@samp{$} exactly. For example, to check whether a particular string
460represents a menu entry from an Info node, it would be useful to match
461it against a regexp like @samp{^* [^:]*::}. However, this won't work;
462because the asterisk is a metacharacter, it won't match the @samp{*} at
463the beginning of the string. In this case, we want to make the first
464asterisk un-magic.
465
466You can do this by preceding the metacharacter with a backslash
467character @samp{\}. (This is also called @dfn{quoting} the
468metacharacter, and is known as a @dfn{backslash escape}.) When Guile
469sees a backslash in a regular expression, it considers the following
470glyph to be an ordinary character, no matter what special meaning it
471would ordinarily have. Therefore, we can make the above example work by
472changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells
473the regular expression engine to match only a single asterisk in the
474target string.
475
476Since the backslash is itself a metacharacter, you may force a regexp to
477match a backslash in the target string by preceding the backslash with
478itself. For example, to find variable references in a @TeX{} program,
479you might want to find occurrences of the string @samp{\let\} followed
480by any number of alphabetic characters. The regular expression
481@samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
482regexp each match a single backslash in the target string.
483
484@c begin (scm-doc-string "regex.scm" "regexp-quote")
485@deffn {Scheme Procedure} regexp-quote str
486Quote each special character found in @var{str} with a backslash, and
487return the resulting string.
488@end deffn
489
490@strong{Very important:} Using backslash escapes in Guile source code
491(as in Emacs Lisp or C) can be tricky, because the backslash character
492has special meaning for the Guile reader. For example, if Guile
493encounters the character sequence @samp{\n} in the middle of a string
494while processing Scheme code, it replaces those characters with a
495newline character. Similarly, the character sequence @samp{\t} is
496replaced by a horizontal tab. Several of these @dfn{escape sequences}
497are processed by the Guile reader before your code is executed.
498Unrecognized escape sequences are ignored: if the characters @samp{\*}
499appear in a string, they will be translated to the single character
500@samp{*}.
501
502This translation is obviously undesirable for regular expressions, since
503we want to be able to include backslashes in a string in order to
504escape regexp metacharacters. Therefore, to make sure that a backslash
505is preserved in a string in your Guile program, you must use @emph{two}
506consecutive backslashes:
507
508@lisp
509(define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
510@end lisp
511
512The string in this example is preprocessed by the Guile reader before
513any code is executed. The resulting argument to @code{make-regexp} is
514the string @samp{^\* [^:]*}, which is what we really want.
515
516This also means that in order to write a regular expression that matches
517a single backslash character, the regular expression string in the
518source code must include @emph{four} backslashes. Each consecutive pair
519of backslashes gets translated by the Guile reader to a single
520backslash, and the resulting double-backslash is interpreted by the
521regexp engine as matching a single backslash character. Hence:
522
523@lisp
524(define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
525@end lisp
526
527The reason for the unwieldiness of this syntax is historical. Both
528regular expression pattern matchers and Unix string processing systems
529have traditionally used backslashes with the special meanings
530described above. The POSIX regular expression specification and ANSI C
531standard both require these semantics. Attempting to abandon either
532convention would cause other kinds of compatibility problems, possibly
533more severe ones. Therefore, without extending the Scheme reader to
534support strings with different quoting conventions (an ungainly and
535confusing extension when implemented in other languages), we must adhere
536to this cumbersome escape syntax.