doc/ref/api-regex.texi

   1 @c -*-texinfo-*-
   2 @c This is part of the GNU Guile Reference Manual.
   3 @c Copyright (C)  1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010, 2012
   4 @c   Free Software Foundation, Inc.
   5 @c See the file guile.texi for copying conditions.
   6
   7 @node Regular Expressions
   8 @section Regular Expressions
   9 @tpindex Regular expressions
  10
  11 @cindex regular expressions
  12 @cindex regex
  13 @cindex emacs regexp
  14
  15 A @dfn{regular expression} (or @dfn{regexp}) is a pattern that
  16 describes a whole class of strings.  A full description of regular
  17 expressions and their syntax is beyond the scope of this manual;
  18 an introduction can be found in the Emacs manual (@pxref{Regexps,
  19 , Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or
  20 in many general Unix reference books.
  21
  22 If your system does not include a POSIX regular expression library,
  23 and you have not linked Guile with a third-party regexp library such
  24 as Rx, these functions will not be available.  You can tell whether
  25 your Guile installation includes regular expression support by
  26 checking whether @code{(provided? 'regex)} returns true.
  27
  28 The following regexp and string matching features are provided by the
  29 @code{(ice-9 regex)} module.  Before using the described functions,
  30 you should load this module by executing @code{(use-modules (ice-9
  31 regex))}.
  32
  33 @menu
  34 * Regexp Functions::            Functions that create and match regexps.
  35 * Match Structures::            Finding what was matched by a regexp.
  36 * Backslash Escapes::           Removing the special meaning of regexp
  37                                 meta-characters.
  38 @end menu
  39
  40
  41 @node Regexp Functions
  42 @subsection Regexp Functions
  43
  44 By default, Guile supports POSIX extended regular expressions.
  45 That means that the characters @samp{(}, @samp{)}, @samp{+} and
  46 @samp{?} are special, and must be escaped if you wish to match the
  47 literal characters.
  48
  49 This regular expression interface was modeled after that
  50 implemented by SCSH, the Scheme Shell.  It is intended to be
  51 upwardly compatible with SCSH regular expressions.
  52
  53 Zero bytes (@code{#\nul}) cannot be used in regex patterns or input
  54 strings, since the underlying C functions treat that as the end of
  55 string.  If there's a zero byte an error is thrown.
  56
  57 Internally, patterns and input strings are converted to the current
  58 locale's encoding, and then passed to the C library's regular expression
  59 routines (@pxref{Regular Expressions,,, libc, The GNU C Library
  60 Reference Manual}).  The returned match structures always point to
  61 characters in the strings, not to individual bytes, even in the case of
  62 multi-byte encodings.
  63
  64 @deffn {Scheme Procedure} string-match pattern str [start]
  65 Compile the string @var{pattern} into a regular expression and compare
  66 it with @var{str}.  The optional numeric argument @var{start} specifies
  67 the position of @var{str} at which to begin matching.
  68
  69 @code{string-match} returns a @dfn{match structure} which
  70 describes what, if anything, was matched by the regular
  71 expression.  @xref{Match Structures}.  If @var{str} does not match
  72 @var{pattern} at all, @code{string-match} returns @code{#f}.
  73 @end deffn
  74
  75 Two examples of a match follow.  In the first example, the pattern
  76 matches the four digits in the match string.  In the second, the pattern
  77 matches nothing.
  78
  79 @example
  80 (string-match "[0-9][0-9][0-9][0-9]" "blah2002")
  81 @result{} #("blah2002" (4 . 8))
  82
  83 (string-match "[A-Za-z]" "123456")
  84 @result{} #f
  85 @end example
  86
  87 Each time @code{string-match} is called, it must compile its
  88 @var{pattern} argument into a regular expression structure.  This
  89 operation is expensive, which makes @code{string-match} inefficient if
  90 the same regular expression is used several times (for example, in a
  91 loop).  For better performance, you can compile a regular expression in
  92 advance and then match strings against the compiled regexp.
  93
  94 @deffn {Scheme Procedure} make-regexp pat flag@dots{}
  95 @deffnx {C Function} scm_make_regexp (pat, flaglst)
  96 Compile the regular expression described by @var{pat}, and
  97 return the compiled regexp structure.  If @var{pat} does not
  98 describe a legal regular expression, @code{make-regexp} throws
  99 a @code{regular-expression-syntax} error.
 100
 101 The @var{flag} arguments change the behavior of the compiled
 102 regular expression.  The following values may be supplied:
 103
 104 @defvar regexp/icase
 105 Consider uppercase and lowercase letters to be the same when
 106 matching.
 107 @end defvar
 108
 109 @defvar regexp/newline
 110 If a newline appears in the target string, then permit the
 111 @samp{^} and @samp{$} operators to match immediately after or
 112 immediately before the newline, respectively.  Also, the
 113 @samp{.} and @samp{[^...]} operators will never match a newline
 114 character.  The intent of this flag is to treat the target
 115 string as a buffer containing many lines of text, and the
 116 regular expression as a pattern that may match a single one of
 117 those lines.
 118 @end defvar
 119
 120 @defvar regexp/basic
 121 Compile a basic (``obsolete'') regexp instead of the extended
 122 (``modern'') regexps that are the default.  Basic regexps do
 123 not consider @samp{|}, @samp{+} or @samp{?} to be special
 124 characters, and require the @samp{@{...@}} and @samp{(...)}
 125 metacharacters to be backslash-escaped (@pxref{Backslash
 126 Escapes}).  There are several other differences between basic
 127 and extended regular expressions, but these are the most
 128 significant.
 129 @end defvar
 130
 131 @defvar regexp/extended
 132 Compile an extended regular expression rather than a basic
 133 regexp.  This is the default behavior; this flag will not
 134 usually be needed.  If a call to @code{make-regexp} includes
 135 both @code{regexp/basic} and @code{regexp/extended} flags, the
 136 one which comes last will override the earlier one.
 137 @end defvar
 138 @end deffn
 139
 140 @deffn {Scheme Procedure} regexp-exec rx str [start [flags]]
 141 @deffnx {C Function} scm_regexp_exec (rx, str, start, flags)
 142 Match the compiled regular expression @var{rx} against
 143 @code{str}.  If the optional integer @var{start} argument is
 144 provided, begin matching from that position in the string.
 145 Return a match structure describing the results of the match,
 146 or @code{#f} if no match could be found.
 147
 148 The @var{flags} argument changes the matching behavior.  The following
 149 flag values may be supplied, use @code{logior} (@pxref{Bitwise
 150 Operations}) to combine them,
 151
 152 @defvar regexp/notbol
 153 Consider that the @var{start} offset into @var{str} is not the
 154 beginning of a line and should not match operator @samp{^}.
 155
 156 If @var{rx} was created with the @code{regexp/newline} option above,
 157 @samp{^} will still match after a newline in @var{str}.
 158 @end defvar
 159
 160 @defvar regexp/noteol
 161 Consider that the end of @var{str} is not the end of a line and should
 162 not match operator @samp{$}.
 163
 164 If @var{rx} was created with the @code{regexp/newline} option above,
 165 @samp{$} will still match before a newline in @var{str}.
 166 @end defvar
 167 @end deffn
 168
 169 @lisp
 170 ;; Regexp to match uppercase letters
 171 (define r (make-regexp "[A-Z]*"))
 172
 173 ;; Regexp to match letters, ignoring case
 174 (define ri (make-regexp "[A-Z]*" regexp/icase))
 175
 176 ;; Search for bob using regexp r
 177 (match:substring (regexp-exec r "bob"))
 178 @result{} ""                  ; no match
 179
 180 ;; Search for bob using regexp ri
 181 (match:substring (regexp-exec ri "Bob"))
 182 @result{} "Bob"               ; matched case insensitive
 183 @end lisp
 184
 185 @deffn {Scheme Procedure} regexp? obj
 186 @deffnx {C Function} scm_regexp_p (obj)
 187 Return @code{#t} if @var{obj} is a compiled regular expression,
 188 or @code{#f} otherwise.
 189 @end deffn
 190
 191 @sp 1
 192 @deffn {Scheme Procedure} list-matches regexp str [flags]
 193 Return a list of match structures which are the non-overlapping
 194 matches of @var{regexp} in @var{str}.  @var{regexp} can be either a
 195 pattern string or a compiled regexp.  The @var{flags} argument is as
 196 per @code{regexp-exec} above.
 197
 198 @example
 199 (map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
 200 @result{} ("abc" "def")
 201 @end  example
 202 @end deffn
 203
 204 @deffn {Scheme Procedure} fold-matches regexp str init proc [flags]
 205 Apply @var{proc} to the non-overlapping matches of @var{regexp} in
 206 @var{str}, to build a result.  @var{regexp} can be either a pattern
 207 string or a compiled regexp.  The @var{flags} argument is as per
 208 @code{regexp-exec} above.
 209
 210 @var{proc} is called as @code{(@var{proc} match prev)} where
 211 @var{match} is a match structure and @var{prev} is the previous return
 212 from @var{proc}.  For the first call @var{prev} is the given
 213 @var{init} parameter.  @code{fold-matches} returns the final value
 214 from @var{proc}.
 215
 216 For example to count matches,
 217
 218 @example
 219 (fold-matches "[a-z][0-9]" "abc x1 def y2" 0
 220               (lambda (match count)
 221                 (1+ count)))
 222 @result{} 2
 223 @end example
 224 @end deffn
 225
 226 @sp 1
 227 Regular expressions are commonly used to find patterns in one string
 228 and replace them with the contents of another string.  The following
 229 functions are convenient ways to do this.
 230
 231 @c begin (scm-doc-string "regex.scm" "regexp-substitute")
 232 @deffn {Scheme Procedure} regexp-substitute port match item @dots{}
 233 Write to @var{port} selected parts of the match structure @var{match}.
 234 Or if @var{port} is @code{#f} then form a string from those parts and
 235 return that.
 236
 237 Each @var{item} specifies a part to be written, and may be one of the
 238 following,
 239
 240 @itemize @bullet
 241 @item
 242 A string.  String arguments are written out verbatim.
 243
 244 @item
 245 An integer.  The submatch with that number is written
 246 (@code{match:substring}).  Zero is the entire match.
 247
 248 @item
 249 The symbol @samp{pre}.  The portion of the matched string preceding
 250 the regexp match is written (@code{match:prefix}).
 251
 252 @item
 253 The symbol @samp{post}.  The portion of the matched string following
 254 the regexp match is written (@code{match:suffix}).
 255 @end itemize
 256
 257 For example, changing a match and retaining the text before and after,
 258
 259 @example
 260 (regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
 261                    'pre "37" 'post)
 262 @result{} "number 37 is good"
 263 @end example
 264
 265 Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and
 266 re-ordering and hyphenating the fields.
 267
 268 @lisp
 269 (define date-regex
 270    "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
 271 (define s "Date 20020429 12am.")
 272 (regexp-substitute #f (string-match date-regex s)
 273                    'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
 274 @result{} "Date 04-29-2002 12am. (20020429)"
 275 @end lisp
 276 @end deffn
 277
 278
 279 @c begin (scm-doc-string "regex.scm" "regexp-substitute")
 280 @deffn {Scheme Procedure} regexp-substitute/global port regexp target item@dots{}
 281 @cindex search and replace
 282 Write to @var{port} selected parts of matches of @var{regexp} in
 283 @var{target}.  If @var{port} is @code{#f} then form a string from
 284 those parts and return that.  @var{regexp} can be a string or a
 285 compiled regex.
 286
 287 This is similar to @code{regexp-substitute}, but allows global
 288 substitutions on @var{target}.  Each @var{item} behaves as per
 289 @code{regexp-substitute}, with the following differences,
 290
 291 @itemize @bullet
 292 @item
 293 A function.  Called as @code{(@var{item} match)} with the match
 294 structure for the @var{regexp} match, it should return a string to be
 295 written to @var{port}.
 296
 297 @item
 298 The symbol @samp{post}.  This doesn't output anything, but instead
 299 causes @code{regexp-substitute/global} to recurse on the unmatched
 300 portion of @var{target}.
 301
 302 This @emph{must} be supplied to perform a global search and replace on
 303 @var{target}; without it @code{regexp-substitute/global} returns after
 304 a single match and output.
 305 @end itemize
 306
 307 For example, to collapse runs of tabs and spaces to a single hyphen
 308 each,
 309
 310 @example
 311 (regexp-substitute/global #f "[ \t]+"  "this   is   the text"
 312                           'pre "-" 'post)
 313 @result{} "this-is-the-text"
 314 @end example
 315
 316 Or using a function to reverse the letters in each word,
 317
 318 @example
 319 (regexp-substitute/global #f "[a-z]+"  "to do and not-do"
 320   'pre (lambda (m) (string-reverse (match:substring m))) 'post)
 321 @result{} "ot od dna ton-od"
 322 @end example
 323
 324 Without the @code{post} symbol, just one regexp match is made.  For
 325 example the following is the date example from
 326 @code{regexp-substitute} above, without the need for the separate
 327 @code{string-match} call.
 328
 329 @lisp
 330 (define date-regex
 331    "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
 332 (define s "Date 20020429 12am.")
 333 (regexp-substitute/global #f date-regex s
 334                           'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
 335
 336 @result{} "Date 04-29-2002 12am. (20020429)"
 337 @end lisp
 338 @end deffn
 339
 340
 341 @node Match Structures
 342 @subsection Match Structures
 343
 344 @cindex match structures
 345
 346 A @dfn{match structure} is the object returned by @code{string-match} and
 347 @code{regexp-exec}.  It describes which portion of a string, if any,
 348 matched the given regular expression.  Match structures include: a
 349 reference to the string that was checked for matches; the starting and
 350 ending positions of the regexp match; and, if the regexp included any
 351 parenthesized subexpressions, the starting and ending positions of each
 352 submatch.
 353
 354 In each of the regexp match functions described below, the @code{match}
 355 argument must be a match structure returned by a previous call to
 356 @code{string-match} or @code{regexp-exec}.  Most of these functions
 357 return some information about the original target string that was
 358 matched against a regular expression; we will call that string
 359 @var{target} for easy reference.
 360
 361 @c begin (scm-doc-string "regex.scm" "regexp-match?")
 362 @deffn {Scheme Procedure} regexp-match? obj
 363 Return @code{#t} if @var{obj} is a match structure returned by a
 364 previous call to @code{regexp-exec}, or @code{#f} otherwise.
 365 @end deffn
 366
 367 @c begin (scm-doc-string "regex.scm" "match:substring")
 368 @deffn {Scheme Procedure} match:substring match [n]
 369 Return the portion of @var{target} matched by subexpression number
 370 @var{n}.  Submatch 0 (the default) represents the entire regexp match.
 371 If the regular expression as a whole matched, but the subexpression
 372 number @var{n} did not match, return @code{#f}.
 373 @end deffn
 374
 375 @lisp
 376 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 377 (match:substring s)
 378 @result{} "2002"
 379
 380 ;; match starting at offset 6 in the string
 381 (match:substring
 382   (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6))
 383 @result{} "7654"
 384 @end lisp
 385
 386 @c begin (scm-doc-string "regex.scm" "match:start")
 387 @deffn {Scheme Procedure} match:start match [n]
 388 Return the starting position of submatch number @var{n}.
 389 @end deffn
 390
 391 In the following example, the result is 4, since the match starts at
 392 character index 4:
 393
 394 @lisp
 395 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 396 (match:start s)
 397 @result{} 4
 398 @end lisp
 399
 400 @c begin (scm-doc-string "regex.scm" "match:end")
 401 @deffn {Scheme Procedure} match:end match [n]
 402 Return the ending position of submatch number @var{n}.
 403 @end deffn
 404
 405 In the following example, the result is 8, since the match runs between
 406 characters 4 and 8 (i.e.@: the ``2002'').
 407
 408 @lisp
 409 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 410 (match:end s)
 411 @result{} 8
 412 @end lisp
 413
 414 @c begin (scm-doc-string "regex.scm" "match:prefix")
 415 @deffn {Scheme Procedure} match:prefix match
 416 Return the unmatched portion of @var{target} preceding the regexp match.
 417
 418 @lisp
 419 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 420 (match:prefix s)
 421 @result{} "blah"
 422 @end lisp
 423 @end deffn
 424
 425 @c begin (scm-doc-string "regex.scm" "match:suffix")
 426 @deffn {Scheme Procedure} match:suffix match
 427 Return the unmatched portion of @var{target} following the regexp match.
 428 @end deffn
 429
 430 @lisp
 431 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 432 (match:suffix s)
 433 @result{} "foo"
 434 @end lisp
 435
 436 @c begin (scm-doc-string "regex.scm" "match:count")
 437 @deffn {Scheme Procedure} match:count match
 438 Return the number of parenthesized subexpressions from @var{match}.
 439 Note that the entire regular expression match itself counts as a
 440 subexpression, and failed submatches are included in the count.
 441 @end deffn
 442
 443 @c begin (scm-doc-string "regex.scm" "match:string")
 444 @deffn {Scheme Procedure} match:string match
 445 Return the original @var{target} string.
 446 @end deffn
 447
 448 @lisp
 449 (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo"))
 450 (match:string s)
 451 @result{} "blah2002foo"
 452 @end lisp
 453
 454
 455 @node Backslash Escapes
 456 @subsection Backslash Escapes
 457
 458 Sometimes you will want a regexp to match characters like @samp{*} or
 459 @samp{$} exactly.  For example, to check whether a particular string
 460 represents a menu entry from an Info node, it would be useful to match
 461 it against a regexp like @samp{^* [^:]*::}.  However, this won't work;
 462 because the asterisk is a metacharacter, it won't match the @samp{*} at
 463 the beginning of the string.  In this case, we want to make the first
 464 asterisk un-magic.
 465
 466 You can do this by preceding the metacharacter with a backslash
 467 character @samp{\}.  (This is also called @dfn{quoting} the
 468 metacharacter, and is known as a @dfn{backslash escape}.)  When Guile
 469 sees a backslash in a regular expression, it considers the following
 470 glyph to be an ordinary character, no matter what special meaning it
 471 would ordinarily have.  Therefore, we can make the above example work by
 472 changing the regexp to @samp{^\* [^:]*::}.  The @samp{\*} sequence tells
 473 the regular expression engine to match only a single asterisk in the
 474 target string.
 475
 476 Since the backslash is itself a metacharacter, you may force a regexp to
 477 match a backslash in the target string by preceding the backslash with
 478 itself.  For example, to find variable references in a @TeX{} program,
 479 you might want to find occurrences of the string @samp{\let\} followed
 480 by any number of alphabetic characters.  The regular expression
 481 @samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the
 482 regexp each match a single backslash in the target string.
 483
 484 @c begin (scm-doc-string "regex.scm" "regexp-quote")
 485 @deffn {Scheme Procedure} regexp-quote str
 486 Quote each special character found in @var{str} with a backslash, and
 487 return the resulting string.
 488 @end deffn
 489
 490 @strong{Very important:} Using backslash escapes in Guile source code
 491 (as in Emacs Lisp or C) can be tricky, because the backslash character
 492 has special meaning for the Guile reader.  For example, if Guile
 493 encounters the character sequence @samp{\n} in the middle of a string
 494 while processing Scheme code, it replaces those characters with a
 495 newline character.  Similarly, the character sequence @samp{\t} is
 496 replaced by a horizontal tab.  Several of these @dfn{escape sequences}
 497 are processed by the Guile reader before your code is executed.
 498 Unrecognized escape sequences are ignored: if the characters @samp{\*}
 499 appear in a string, they will be translated to the single character
 500 @samp{*}.
 501
 502 This translation is obviously undesirable for regular expressions, since
 503 we want to be able to include backslashes in a string in order to
 504 escape regexp metacharacters.  Therefore, to make sure that a backslash
 505 is preserved in a string in your Guile program, you must use @emph{two}
 506 consecutive backslashes:
 507
 508 @lisp
 509 (define Info-menu-entry-pattern (make-regexp "^\\* [^:]*"))
 510 @end lisp
 511
 512 The string in this example is preprocessed by the Guile reader before
 513 any code is executed.  The resulting argument to @code{make-regexp} is
 514 the string @samp{^\* [^:]*}, which is what we really want.
 515
 516 This also means that in order to write a regular expression that matches
 517 a single backslash character, the regular expression string in the
 518 source code must include @emph{four} backslashes.  Each consecutive pair
 519 of backslashes gets translated by the Guile reader to a single
 520 backslash, and the resulting double-backslash is interpreted by the
 521 regexp engine as matching a single backslash character.  Hence:
 522
 523 @lisp
 524 (define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*"))
 525 @end lisp
 526
 527 The reason for the unwieldiness of this syntax is historical.  Both
 528 regular expression pattern matchers and Unix string processing systems
 529 have traditionally used backslashes with the special meanings
 530 described above.  The POSIX regular expression specification and ANSI C
 531 standard both require these semantics.  Attempting to abandon either
 532 convention would cause other kinds of compatibility problems, possibly
 533 more severe ones.  Therefore, without extending the Scheme reader to
 534 support strings with different quoting conventions (an ungainly and
 535 confusing extension when implemented in other languages), we must adhere
 536 to this cumbersome escape syntax.