doc/ref/api-regex.texi

   1 @c -*-texinfo-*-
   2 @c This is part of the GNU Guile Reference Manual.
   3 @c Copyright (C)  1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010
   4 @c   Free Software Foundation, Inc.
   5 @c See the file guile.texi for copying conditions.
   6
   7 @node Regular Expressions
   8 @section Regular Expressions
   9 @tpindex Regular expressions
  10
  11 @cindex regular expressions
  12 @cindex regex
  13 @cindex emacs regexp
  14
  15 A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
  16 describes a whole class of strings.  A full description of regular
  17 expressions and their syntax is beyond the scope of this manual;
  18 an introduction can be found in the Emacs manual (@pxref{Regexps,
  19 , Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or
  20 in many general Unix reference books.
  21
  22 If your system does not include a POSIX regular expression library,
  23 and you have not linked Guile with a third-party regexp library such
  24 as Rx, these functions will not be available.  You can tell whether
  25 your Guile installation includes regular expression support by
  26 checking whether @code{(provided? 'regex)} returns true.
  27
  28 The following regexp and string matching features are provided by the
  29 @code{(ice-9 regex)} module.  Before using the described functions,
  30 you should load this module by executing @code{(use-modules (ice-9
  31 regex))}.
  32
  33 @menu
  34 * Regexp Functions::            Functions that create and match regexps.
  35 * Match Structures::            Finding what was matched by a regexp.
  36 * Backslash Escapes::           Removing the special meaning of regexp
  37                                 meta-characters.
  38 @end menu
  39
  40
  41 @node Regexp Functions
  42 @subsection Regexp Functions
  43
  44 By default, Guile supports POSIX extended regular expressions.
  45 That means that the characters @samp{(}, @samp{)}, @samp{+} and
  46 @samp{?} are special, and must be escaped if you wish to match the
  47 literal characters.
  48
  49 This regular expression interface was modeled after that
  50 implemented by SCSH, the Scheme Shell.  It is intended to be
  51 upwardly compatible with SCSH regular expressions.
  52
  53 Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
  54 strings, since the underlying C functions treat that as the end of
  55 string.  If there's a zero byte an error is thrown.
  56
  57 Patterns and input strings are treated as being in the locale
  58 character set if @code{setlocale} has been called (@pxref{Locales}),
  59 and in a multibyte locale this includes treating multi-byte sequences
  60 as a single character.  (Guile strings are currently merely bytes,
  61 though this may change in the future, @xref{Conversion to/from C}.)
  62
  63 @deffn {Scheme Procedure} string-match pattern str [start]
  64 Compile the string @var{pattern} into a regular expression and compare
  65 it with @var{str}.  The optional numeric argument @var{start} specifies
  66 the position of @var{str} at which to begin matching.
  67
  68 @code{string-match} returns a @dfn{match structure} which
  69 describes what, if anything, was matched by the regular
  70 expression.  @xref{Match Structures}.  If @var{str} does not match
  71 @var{pattern} at all, @code{string-match} returns @code{#f}.
  72 @end deffn
  73
  74 Two examples of a match follow.  In the first example, the pattern
  75 matches the four digits in the match string.  In the second, the pattern
  76 matches nothing.
  77
  78 @example
  79 (string-match "[0-9][0-9][0-9][0-9]" "blah2002")
  80 @result{} #("blah2002" (4 . 8))
  81
  82 (string-match "[A-Za-z]" "123456")
  83 @result{} #f
  84 @end example
  85
  86 Each time @code{string-match} is called, it must compile its
  87 @var{pattern} argument into a regular expression structure.  This
  88 operation is expensive, which makes @code{string-match} inefficient if
  89 the same regular expression is used several times (for example, in a
  90 loop).  For better performance, you can compile a regular expression in
  91 advance and then match strings against the compiled regexp.
  92
  93 @deffn {Scheme Procedure} make-regexp pat flag@dots{}
  94 @deffnx {C Function} scm_make_regexp (pat, flaglst)
  95 Compile the regular expression described by @var{pat}, and
  96 return the compiled regexp structure.  If @var{pat} does not
  97 describe a legal regular expression, @code{make-regexp} throws
  98 a @code{regular-expression-syntax} error.
  99
 100 The @var{flag} arguments change the behavior of the compiled
 101 regular expression.  The following values may be supplied:
 102
 103 @defvar regexp/icase
 104 Consider uppercase and lowercase letters to be the same when
 105 matching.
 106 @end defvar
 107
 108 @defvar regexp/newline
 109 If a newline appears in the target string, then permit the
 110 @samp{^} and @samp{$} operators to match immediately after or
 111 immediately before the newline, respectively.  Also, the
 112 @samp{.} and @samp{[^...]} operators will never match a newline
 113 character.  The intent of this flag is to treat the target
 114 string as a buffer containing many lines of text, and the
 115 regular expression as a pattern that may match a single one of
 116 those lines.
 117 @end defvar
 118
 119 @defvar regexp/basic
 120 Compile a basic (``obsolete'') regexp instead of the extended
 121 (``modern'') regexps that are the default.  Basic regexps do
 122 not consider @samp{|}, @samp{+} or @samp{?} to be special
 123 characters, and require the @samp{@{...@}} and @samp{(...)}
 124 metacharacters to be backslash-escaped (@pxref{Backslash
 125 Escapes}).  There are several other differences between basic
 126 and extended regular expressions, but these are the most
 127 significant.
 128 @end defvar
 129
 130 @defvar regexp/extended
 131 Compile an extended regular expression rather than a basic
 132 regexp.  This is the default behavior; this flag will not
 133 usually be needed.  If a call to @code{make-regexp} includes
 134 both @code{regexp/basic} and @code{regexp/extended} flags, the
 135 one which comes last will override the earlier one.
 136 @end defvar
 137 @end deffn
 138
 139 @deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
 140 @deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
 141 Match the compiled regular expression @var{rx} against
 142 @code{str}.  If the optional integer @var{start} argument is
 143 provided, begin matching from that position in the string.
 144 Return a match structure describing the results of the match,
 145 or @code{#f} if no match could be found.
 146
 147 The @var{flags} argument changes the matching behavior.  The following
 148 flag values may be supplied, use @code{logior} (@pxref{Bitwise
 149 Operations}) to combine them,
 150
 151 @defvar regexp/notbol
 152 Consider that the @var{start} offset into @var{str} is not the
 153 beginning of a line and should not match operator @samp{^}.
 154
 155 If @var{rx} was created with the @code{regexp/newline} option above,
 156 @samp{^} will still match after a newline in @var{str}.
 157 @end defvar
 158
 159 @defvar regexp/noteol
 160 Consider that the end of @var{str} is not the end of a line and should
 161 not match operator @samp{$}.
 162
 163 If @var{rx} was created with the @code{regexp/newline} option above,
 164 @samp{$} will still match before a newline in @var{str}.
 165 @end defvar
 166 @end deffn
 167
 168 @lisp
 169 ;; Regexp to match uppercase letters
 170 (define r (make-regexp "[A-Z]*"))
 171
 172 ;; Regexp to match letters, ignoring case
 173 (define ri (make-regexp "[A-Z]*" regexp/icase))
 174
 175 ;; Search for bob using regexp r
 176 (match:substring (regexp-exec r "bob"))
 177 @result{} ""                  ; no match
 178
 179 ;; Search for bob using regexp ri
 180 (match:substring (regexp-exec ri "Bob"))
 181 @result{} "Bob"               ; matched case insensitive
 182 @end lisp
 183
 184 @deffn {Scheme Procedure} regexp? obj
 185 @deffnx {C Function} scm_regexp_p (obj)
 186 Return @code{#t} if @var{obj} is a compiled regular expression,
 187 or @code{#f} otherwise.
 188 @end deffn
 189
 190 @sp 1
 191 @deffn {Scheme Procedure} list-matches regexp str [flags]
 192 Return a list of match structures which are the non-overlapping
 193 matches of @var{regexp} in @var{str}.  @var{regexp} can be either a
 194 pattern string or a compiled regexp.  The @var{flags} argument is as
 195 per @code{regexp-exec} above.
 196
 197 @example
 198 (map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
 199 @result{} ("abc" "def")
 200 @end  example
 201 @end deffn
 202
 203 @deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
 204 Apply @var{proc} to the non-overlapping matches of @var{regexp} in
 205 @var{str}, to build a result.  @var{regexp} can be either a pattern
 206 string or a compiled regexp.  The @var{flags} argument is as per
 207 @code{regexp-exec} above.
 208
 209 @var{proc} is called as @code{(@var{proc} match prev)} where
 210 @var{match} is a match structure and @var{prev} is the previous return
 211 from @var{proc}.  For the first call @var{prev} is the given
 212 @var{init} parameter.  @code{fold-matches} returns the final value
 213 from @var{proc}.
 214
 215 For example to count matches,
 216
 217 @example
 218 (fold-matches "[a-z][0-9]" "abc x1 def y2" 0
 219               (lambda (match count)
 220                 (1+ count)))
 221 @result{} 2
 222 @end example
 223 @end deffn
 224
 225 @sp 1
 226 Regular expressions are commonly used to find patterns in one string
 227 and replace them with the contents of another string.  The following
 228 functions are convenient ways to do this.
 229
 230 @c begin (scm-doc-string "regex.scm" "regexp-substitute")
 231 @deffn {Scheme Procedure} regexp-substitute port match [item@dots{}]
 232 Write to @var{port} selected parts of the match structure @var{match}.
 233 Or if @var{port} is @code{#f} then form a string from those parts and
 234 return that.
 235
 236 Each @var{item} specifies a part to be written, and may be one of the
 237 following,
 238
 239 @itemize @bullet
 240 @item
 241 A string.  String arguments are written out verbatim.
 242
 243 @item
 244 An integer.  The submatch with that number is written
 245 (@code{match:substring}).  Zero is the entire match.
 246
 247 @item
 248 The symbol @samp{pre}.  The portion of the matched string preceding
 249 the regexp match is written (@code{match:prefix}).
 250
 251 @item
 252 The symbol @samp{post}.  The portion of the matched string following
 253 the regexp match is written (@code{match:suffix}).
 254 @end itemize
 255
 256 For example, changing a match and retaining the text before and after,
 257
 258 @example
 259 (regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
 260                    'pre "37" 'post)
 261 @result{} "number 37 is good"
 262 @end example
 263
 264 Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
 265 re-ordering and hyphenating the fields.
 266
 267 @lisp
 268 (define date-regex
 269    "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
 270 (define s "Date 20020429 12am.")
 271 (regexp-substitute #f (string-match date-regex s)
 272                    'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
 273 @result{} "Date 04-29-2002 12am. (20020429)"
 274 @end lisp
 275 @end deffn
 276
 277
 278 @c begin (scm-doc-string "regex.scm" "regexp-substitute")
 279 @deffn {Scheme Procedure} regexp-substitute/global port regexp target [item@dots{}]
 280 @cindex search and replace
 281 Write to @var{port} selected parts of matches of @var{regexp} in
 282 @var{target}.  If @var{port} is @code{#f} then form a string from
 283 those parts and return that.  @var{regexp} can be a string or a
 284 compiled regex.
 285
 286 This is similar to @code{regexp-substitute}, but allows global
 287 substitutions on @var{target}.  Each @var{item} behaves as per
 288 @code{regexp-substitute}, with the following differences,
 289
 290 @itemize @bullet
 291 @item
 292 A function.  Called as @code{(@var{item} match)} with the match
 293 structure for the @var{regexp} match, it should return a string to be
 294 written to @var{port}.
 295
 296 @item
 297 The symbol @samp{post}.  This doesn't output anything, but instead
 298 causes @code{regexp-substitute/global} to recurse on the unmatched
 299 portion of @var{target}.
 300
 301 This @emph{must} be supplied to perform a global search and replace on
 302 @var{target}; without it @code{regexp-substitute/global} returns after
 303 a single match and output.
 304 @end itemize
 305
 306 For example, to collapse runs of tabs and spaces to a single hyphen
 307 each,
 308
 309 @example
 310 (regexp-substitute/global #f "[ \t]+"  "this   is   the text"
 311                           'pre "-" 'post)
 312 @result{} "this-is-the-text"
 313 @end example
 314
 315 Or using a function to reverse the letters in each word,
 316
 317 @example
 318 (regexp-substitute/global #f "[a-z]+"  "to do and not-do"
 319   'pre (lambda (m) (string-reverse (match:substring m))) 'post)
 320 @result{} "ot od dna ton-od"
 321 @end example
 322
 323 Without the @code{post} symbol, just one regexp match is made.  For
 324 example the following is the date example from
 325 @code{regexp-substitute} above, without the need for the separate
 326 @code{string-match} call.
 327
 328 @lisp
 329 (define date-regex
 330    "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
 331 (define s "Date 20020429 12am.")
 332 (regexp-substitute/global #f date-regex s
 333                           'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
 334
 335 @result{} "Date 04-29-2002 12am. (20020429)"
 336 @end lisp
 337 @end deffn
 338
 339
 340 @node Match Structures
 341 @subsection Match Structures
 342
 343 @cindex match structures
 344
 345 A @dfn{match structure} is the object returned by @code{string-match} and
 346 @code{regexp-exec}.  It describes which portion of a string, if any,
 347 matched the given regular expression.  Match structures include: a
 348 reference to the string that was checked for matches; the starting and
 349 ending positions of the regexp match; and, if the regexp included any
 350 parenthesized subexpressions, the starting and ending positions of each
 351 submatch.
 352
 353 In each of the regexp match functions described below, the @code{match}
 354 argument must be a match structure returned by a previous call to
 355 @code{string-match} or @code{regexp-exec}.  Most of these functions
 356 return some information about the original target string that was
 357 matched against a regular expression; we will call that string
 358 @var{target} for easy reference.
 359
 360 @c begin (scm-doc-string "regex.scm" "regexp-match?")
 361 @deffn {Scheme Procedure} regexp-match? obj
 362 Return @code{#t} if @var{obj} is a match structure returned by a
 363 previous call to @code{regexp-exec}, or @code{#f} otherwise.
 364 @end deffn
 365
 366 @c begin (scm-doc-string "regex.scm" "match:substring")
 367 @deffn {Scheme Procedure} match:substring match [n]
 368 Return the portion of @var{target} matched by subexpression number
 369 @var{n}.  Submatch 0 (the default) represents the entire regexp match.
 370 If the regular expression as a whole matched, but the subexpression
 371 number @var{n} did not match, return @code{#f}.
 372 @end deffn
 373
 374 @lisp
 375 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 376 (match:substring s)
 377 @result{} "2002"
 378
 379 ;; match starting at offset 6 in the string
 380 (match:substring
 381   (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
 382 @result{} "7654"
 383 @end lisp
 384
 385 @c begin (scm-doc-string "regex.scm" "match:start")
 386 @deffn {Scheme Procedure} match:start match [n]
 387 Return the starting position of submatch number @var{n}.
 388 @end deffn
 389
 390 In the following example, the result is 4, since the match starts at
 391 character index 4:
 392
 393 @lisp
 394 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 395 (match:start s)
 396 @result{} 4
 397 @end lisp
 398
 399 @c begin (scm-doc-string "regex.scm" "match:end")
 400 @deffn {Scheme Procedure} match:end match [n]
 401 Return the ending position of submatch number @var{n}.
 402 @end deffn
 403
 404 In the following example, the result is 8, since the match runs between
 405 characters 4 and 8 (i.e.@: the ``2002'').
 406
 407 @lisp
 408 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 409 (match:end s)
 410 @result{} 8
 411 @end lisp
 412
 413 @c begin (scm-doc-string "regex.scm" "match:prefix")
 414 @deffn {Scheme Procedure} match:prefix match
 415 Return the unmatched portion of @var{target} preceding the regexp match.
 416
 417 @lisp
 418 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 419 (match:prefix s)
 420 @result{} "blah"
 421 @end lisp
 422 @end deffn
 423
 424 @c begin (scm-doc-string "regex.scm" "match:suffix")
 425 @deffn {Scheme Procedure} match:suffix match
 426 Return the unmatched portion of @var{target} following the regexp match.
 427 @end deffn
 428
 429 @lisp
 430 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 431 (match:suffix s)
 432 @result{} "foo"
 433 @end lisp
 434
 435 @c begin (scm-doc-string "regex.scm" "match:count")
 436 @deffn {Scheme Procedure} match:count match
 437 Return the number of parenthesized subexpressions from @var{match}.
 438 Note that the entire regular expression match itself counts as a
 439 subexpression, and failed submatches are included in the count.
 440 @end deffn
 441
 442 @c begin (scm-doc-string "regex.scm" "match:string")
 443 @deffn {Scheme Procedure} match:string match
 444 Return the original @var{target} string.
 445 @end deffn
 446
 447 @lisp
 448 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 449 (match:string s)
 450 @result{} "blah2002foo"
 451 @end lisp
 452
 453
 454 @node Backslash Escapes
 455 @subsection Backslash Escapes
 456
 457 Sometimes you will want a regexp to match characters like @samp{*} or
 458 @samp{$} exactly.  For example, to check whether a particular string
 459 represents a menu entry from an Info node, it would be useful to match
 460 it against a regexp like @samp{^* [^:]*::}.  However, this won't work;
 461 because the asterisk is a metacharacter, it won't match the @samp{*} at
 462 the beginning of the string.  In this case, we want to make the first
 463 asterisk un-magic.
 464
 465 You can do this by preceding the metacharacter with a backslash
 466 character @samp{\}.  (This is also called @dfn{quoting} the
 467 metacharacter, and is known as a @dfn{backslash escape}.)  When Guile
 468 sees a backslash in a regular expression, it considers the following
 469 glyph to be an ordinary character, no matter what special meaning it
 470 would ordinarily have.  Therefore, we can make the above example work by
 471 changing the regexp to @samp{^\* [^:]*::}.  The @samp{\*} sequence tells
 472 the regular expression engine to match only a single asterisk in the
 473 target string.
 474
 475 Since the backslash is itself a metacharacter, you may force a regexp to
 476 match a backslash in the target string by preceding the backslash with
 477 itself.  For example, to find variable references in a @TeX{} program,
 478 you might want to find occurrences of the string @samp{\let\} followed
 479 by any number of alphabetic characters.  The regular expression
 480 @samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
 481 regexp each match a single backslash in the target string.
 482
 483 @c begin (scm-doc-string "regex.scm" "regexp-quote")
 484 @deffn {Scheme Procedure} regexp-quote str
 485 Quote each special character found in @var{str} with a backslash, and
 486 return the resulting string.
 487 @end deffn
 488
 489 @strong{Very important:} Using backslash escapes in Guile source code
 490 (as in Emacs Lisp or C) can be tricky, because the backslash character
 491 has special meaning for the Guile reader.  For example, if Guile
 492 encounters the character sequence @samp{\n} in the middle of a string
 493 while processing Scheme code, it replaces those characters with a
 494 newline character.  Similarly, the character sequence @samp{\t} is
 495 replaced by a horizontal tab.  Several of these @dfn{escape sequences}
 496 are processed by the Guile reader before your code is executed.
 497 Unrecognized escape sequences are ignored: if the characters @samp{\*}
 498 appear in a string, they will be translated to the single character
 499 @samp{*}.
 500
 501 This translation is obviously undesirable for regular expressions, since
 502 we want to be able to include backslashes in a string in order to
 503 escape regexp metacharacters.  Therefore, to make sure that a backslash
 504 is preserved in a string in your Guile program, you must use @emph{two}
 505 consecutive backslashes:
 506
 507 @lisp
 508 (define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
 509 @end lisp
 510
 511 The string in this example is preprocessed by the Guile reader before
 512 any code is executed.  The resulting argument to @code{make-regexp} is
 513 the string @samp{^\* [^:]*}, which is what we really want.
 514
 515 This also means that in order to write a regular expression that matches
 516 a single backslash character, the regular expression string in the
 517 source code must include @emph{four} backslashes.  Each consecutive pair
 518 of backslashes gets translated by the Guile reader to a single
 519 backslash, and the resulting double-backslash is interpreted by the
 520 regexp engine as matching a single backslash character.  Hence:
 521
 522 @lisp
 523 (define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
 524 @end lisp
 525
 526 The reason for the unwieldiness of this syntax is historical.  Both
 527 regular expression pattern matchers and Unix string processing systems
 528 have traditionally used backslashes with the special meanings
 529 described above.  The POSIX regular expression specification and ANSI C
 530 standard both require these semantics.  Attempting to abandon either
 531 convention would cause other kinds of compatibility problems, possibly
 532 more severe ones.  Therefore, without extending the Scheme reader to
 533 support strings with different quoting conventions (an ungainly and
 534 confusing extension when implemented in other languages), we must adhere
 535 to this cumbersome escape syntax.