| 1 | @c -*-texinfo-*- |
| 2 | @c This is part of the GNU Guile Reference Manual. |
| 3 | @c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2007, 2009, 2010 |
| 4 | @c Free Software Foundation, Inc. |
| 5 | @c See the file guile.texi for copying conditions. |
| 6 | |
| 7 | @node Regular Expressions |
| 8 | @section Regular Expressions |
| 9 | @tpindex Regular expressions |
| 10 | |
| 11 | @cindex regular expressions |
| 12 | @cindex regex |
| 13 | @cindex emacs regexp |
| 14 | |
| 15 | A @dfn{regular expression} (or @dfn{regexp}) is a pattern that |
| 16 | describes a whole class of strings. A full description of regular |
| 17 | expressions and their syntax is beyond the scope of this manual; |
| 18 | an introduction can be found in the Emacs manual (@pxref{Regexps, |
| 19 | , Syntax of Regular Expressions, emacs, The GNU Emacs Manual}), or |
| 20 | in many general Unix reference books. |
| 21 | |
| 22 | If your system does not include a POSIX regular expression library, |
| 23 | and you have not linked Guile with a third-party regexp library such |
| 24 | as Rx, these functions will not be available. You can tell whether |
| 25 | your Guile installation includes regular expression support by |
| 26 | checking whether @code{(provided? 'regex)} returns true. |
| 27 | |
| 28 | The following regexp and string matching features are provided by the |
| 29 | @code{(ice-9 regex)} module. Before using the described functions, |
| 30 | you should load this module by executing @code{(use-modules (ice-9 |
| 31 | regex))}. |
| 32 | |
| 33 | @menu |
| 34 | * Regexp Functions:: Functions that create and match regexps. |
| 35 | * Match Structures:: Finding what was matched by a regexp. |
| 36 | * Backslash Escapes:: Removing the special meaning of regexp |
| 37 | meta-characters. |
| 38 | @end menu |
| 39 | |
| 40 | |
| 41 | @node Regexp Functions |
| 42 | @subsection Regexp Functions |
| 43 | |
| 44 | By default, Guile supports POSIX extended regular expressions. |
| 45 | That means that the characters @samp{(}, @samp{)}, @samp{+} and |
| 46 | @samp{?} are special, and must be escaped if you wish to match the |
| 47 | literal characters. |
| 48 | |
| 49 | This regular expression interface was modeled after that |
| 50 | implemented by SCSH, the Scheme Shell. It is intended to be |
| 51 | upwardly compatible with SCSH regular expressions. |
| 52 | |
| 53 | Zero bytes (@code{#\nul}) cannot be used in regex patterns or input |
| 54 | strings, since the underlying C functions treat that as the end of |
| 55 | string. If there's a zero byte an error is thrown. |
| 56 | |
| 57 | Patterns and input strings are treated as being in the locale |
| 58 | character set if @code{setlocale} has been called (@pxref{Locales}), |
| 59 | and in a multibyte locale this includes treating multi-byte sequences |
| 60 | as a single character. (Guile strings are currently merely bytes, |
| 61 | though this may change in the future, @xref{Conversion to/from C}.) |
| 62 | |
| 63 | @deffn {Scheme Procedure} string-match pattern str [start] |
| 64 | Compile the string @var{pattern} into a regular expression and compare |
| 65 | it with @var{str}. The optional numeric argument @var{start} specifies |
| 66 | the position of @var{str} at which to begin matching. |
| 67 | |
| 68 | @code{string-match} returns a @dfn{match structure} which |
| 69 | describes what, if anything, was matched by the regular |
| 70 | expression. @xref{Match Structures}. If @var{str} does not match |
| 71 | @var{pattern} at all, @code{string-match} returns @code{#f}. |
| 72 | @end deffn |
| 73 | |
| 74 | Two examples of a match follow. In the first example, the pattern |
| 75 | matches the four digits in the match string. In the second, the pattern |
| 76 | matches nothing. |
| 77 | |
| 78 | @example |
| 79 | (string-match "[0-9][0-9][0-9][0-9]" "blah2002") |
| 80 | @result{} #("blah2002" (4 . 8)) |
| 81 | |
| 82 | (string-match "[A-Za-z]" "123456") |
| 83 | @result{} #f |
| 84 | @end example |
| 85 | |
| 86 | Each time @code{string-match} is called, it must compile its |
| 87 | @var{pattern} argument into a regular expression structure. This |
| 88 | operation is expensive, which makes @code{string-match} inefficient if |
| 89 | the same regular expression is used several times (for example, in a |
| 90 | loop). For better performance, you can compile a regular expression in |
| 91 | advance and then match strings against the compiled regexp. |
| 92 | |
| 93 | @deffn {Scheme Procedure} make-regexp pat flag@dots{} |
| 94 | @deffnx {C Function} scm_make_regexp (pat, flaglst) |
| 95 | Compile the regular expression described by @var{pat}, and |
| 96 | return the compiled regexp structure. If @var{pat} does not |
| 97 | describe a legal regular expression, @code{make-regexp} throws |
| 98 | a @code{regular-expression-syntax} error. |
| 99 | |
| 100 | The @var{flag} arguments change the behavior of the compiled |
| 101 | regular expression. The following values may be supplied: |
| 102 | |
| 103 | @defvar regexp/icase |
| 104 | Consider uppercase and lowercase letters to be the same when |
| 105 | matching. |
| 106 | @end defvar |
| 107 | |
| 108 | @defvar regexp/newline |
| 109 | If a newline appears in the target string, then permit the |
| 110 | @samp{^} and @samp{$} operators to match immediately after or |
| 111 | immediately before the newline, respectively. Also, the |
| 112 | @samp{.} and @samp{[^...]} operators will never match a newline |
| 113 | character. The intent of this flag is to treat the target |
| 114 | string as a buffer containing many lines of text, and the |
| 115 | regular expression as a pattern that may match a single one of |
| 116 | those lines. |
| 117 | @end defvar |
| 118 | |
| 119 | @defvar regexp/basic |
| 120 | Compile a basic (``obsolete'') regexp instead of the extended |
| 121 | (``modern'') regexps that are the default. Basic regexps do |
| 122 | not consider @samp{|}, @samp{+} or @samp{?} to be special |
| 123 | characters, and require the @samp{@{...@}} and @samp{(...)} |
| 124 | metacharacters to be backslash-escaped (@pxref{Backslash |
| 125 | Escapes}). There are several other differences between basic |
| 126 | and extended regular expressions, but these are the most |
| 127 | significant. |
| 128 | @end defvar |
| 129 | |
| 130 | @defvar regexp/extended |
| 131 | Compile an extended regular expression rather than a basic |
| 132 | regexp. This is the default behavior; this flag will not |
| 133 | usually be needed. If a call to @code{make-regexp} includes |
| 134 | both @code{regexp/basic} and @code{regexp/extended} flags, the |
| 135 | one which comes last will override the earlier one. |
| 136 | @end defvar |
| 137 | @end deffn |
| 138 | |
| 139 | @deffn {Scheme Procedure} regexp-exec rx str [start [flags]] |
| 140 | @deffnx {C Function} scm_regexp_exec (rx, str, start, flags) |
| 141 | Match the compiled regular expression @var{rx} against |
| 142 | @code{str}. If the optional integer @var{start} argument is |
| 143 | provided, begin matching from that position in the string. |
| 144 | Return a match structure describing the results of the match, |
| 145 | or @code{#f} if no match could be found. |
| 146 | |
| 147 | The @var{flags} argument changes the matching behavior. The following |
| 148 | flag values may be supplied, use @code{logior} (@pxref{Bitwise |
| 149 | Operations}) to combine them, |
| 150 | |
| 151 | @defvar regexp/notbol |
| 152 | Consider that the @var{start} offset into @var{str} is not the |
| 153 | beginning of a line and should not match operator @samp{^}. |
| 154 | |
| 155 | If @var{rx} was created with the @code{regexp/newline} option above, |
| 156 | @samp{^} will still match after a newline in @var{str}. |
| 157 | @end defvar |
| 158 | |
| 159 | @defvar regexp/noteol |
| 160 | Consider that the end of @var{str} is not the end of a line and should |
| 161 | not match operator @samp{$}. |
| 162 | |
| 163 | If @var{rx} was created with the @code{regexp/newline} option above, |
| 164 | @samp{$} will still match before a newline in @var{str}. |
| 165 | @end defvar |
| 166 | @end deffn |
| 167 | |
| 168 | @lisp |
| 169 | ;; Regexp to match uppercase letters |
| 170 | (define r (make-regexp "[A-Z]*")) |
| 171 | |
| 172 | ;; Regexp to match letters, ignoring case |
| 173 | (define ri (make-regexp "[A-Z]*" regexp/icase)) |
| 174 | |
| 175 | ;; Search for bob using regexp r |
| 176 | (match:substring (regexp-exec r "bob")) |
| 177 | @result{} "" ; no match |
| 178 | |
| 179 | ;; Search for bob using regexp ri |
| 180 | (match:substring (regexp-exec ri "Bob")) |
| 181 | @result{} "Bob" ; matched case insensitive |
| 182 | @end lisp |
| 183 | |
| 184 | @deffn {Scheme Procedure} regexp? obj |
| 185 | @deffnx {C Function} scm_regexp_p (obj) |
| 186 | Return @code{#t} if @var{obj} is a compiled regular expression, |
| 187 | or @code{#f} otherwise. |
| 188 | @end deffn |
| 189 | |
| 190 | @sp 1 |
| 191 | @deffn {Scheme Procedure} list-matches regexp str [flags] |
| 192 | Return a list of match structures which are the non-overlapping |
| 193 | matches of @var{regexp} in @var{str}. @var{regexp} can be either a |
| 194 | pattern string or a compiled regexp. The @var{flags} argument is as |
| 195 | per @code{regexp-exec} above. |
| 196 | |
| 197 | @example |
| 198 | (map match:substring (list-matches "[a-z]+" "abc 42 def 78")) |
| 199 | @result{} ("abc" "def") |
| 200 | @end example |
| 201 | @end deffn |
| 202 | |
| 203 | @deffn {Scheme Procedure} fold-matches regexp str init proc [flags] |
| 204 | Apply @var{proc} to the non-overlapping matches of @var{regexp} in |
| 205 | @var{str}, to build a result. @var{regexp} can be either a pattern |
| 206 | string or a compiled regexp. The @var{flags} argument is as per |
| 207 | @code{regexp-exec} above. |
| 208 | |
| 209 | @var{proc} is called as @code{(@var{proc} match prev)} where |
| 210 | @var{match} is a match structure and @var{prev} is the previous return |
| 211 | from @var{proc}. For the first call @var{prev} is the given |
| 212 | @var{init} parameter. @code{fold-matches} returns the final value |
| 213 | from @var{proc}. |
| 214 | |
| 215 | For example to count matches, |
| 216 | |
| 217 | @example |
| 218 | (fold-matches "[a-z][0-9]" "abc x1 def y2" 0 |
| 219 | (lambda (match count) |
| 220 | (1+ count))) |
| 221 | @result{} 2 |
| 222 | @end example |
| 223 | @end deffn |
| 224 | |
| 225 | @sp 1 |
| 226 | Regular expressions are commonly used to find patterns in one string |
| 227 | and replace them with the contents of another string. The following |
| 228 | functions are convenient ways to do this. |
| 229 | |
| 230 | @c begin (scm-doc-string "regex.scm" "regexp-substitute") |
| 231 | @deffn {Scheme Procedure} regexp-substitute port match [item@dots{}] |
| 232 | Write to @var{port} selected parts of the match structure @var{match}. |
| 233 | Or if @var{port} is @code{#f} then form a string from those parts and |
| 234 | return that. |
| 235 | |
| 236 | Each @var{item} specifies a part to be written, and may be one of the |
| 237 | following, |
| 238 | |
| 239 | @itemize @bullet |
| 240 | @item |
| 241 | A string. String arguments are written out verbatim. |
| 242 | |
| 243 | @item |
| 244 | An integer. The submatch with that number is written |
| 245 | (@code{match:substring}). Zero is the entire match. |
| 246 | |
| 247 | @item |
| 248 | The symbol @samp{pre}. The portion of the matched string preceding |
| 249 | the regexp match is written (@code{match:prefix}). |
| 250 | |
| 251 | @item |
| 252 | The symbol @samp{post}. The portion of the matched string following |
| 253 | the regexp match is written (@code{match:suffix}). |
| 254 | @end itemize |
| 255 | |
| 256 | For example, changing a match and retaining the text before and after, |
| 257 | |
| 258 | @example |
| 259 | (regexp-substitute #f (string-match "[0-9]+" "number 25 is good") |
| 260 | 'pre "37" 'post) |
| 261 | @result{} "number 37 is good" |
| 262 | @end example |
| 263 | |
| 264 | Or matching a @sc{yyyymmdd} format date such as @samp{20020828} and |
| 265 | re-ordering and hyphenating the fields. |
| 266 | |
| 267 | @lisp |
| 268 | (define date-regex |
| 269 | "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])") |
| 270 | (define s "Date 20020429 12am.") |
| 271 | (regexp-substitute #f (string-match date-regex s) |
| 272 | 'pre 2 "-" 3 "-" 1 'post " (" 0 ")") |
| 273 | @result{} "Date 04-29-2002 12am. (20020429)" |
| 274 | @end lisp |
| 275 | @end deffn |
| 276 | |
| 277 | |
| 278 | @c begin (scm-doc-string "regex.scm" "regexp-substitute") |
| 279 | @deffn {Scheme Procedure} regexp-substitute/global port regexp target [item@dots{}] |
| 280 | @cindex search and replace |
| 281 | Write to @var{port} selected parts of matches of @var{regexp} in |
| 282 | @var{target}. If @var{port} is @code{#f} then form a string from |
| 283 | those parts and return that. @var{regexp} can be a string or a |
| 284 | compiled regex. |
| 285 | |
| 286 | This is similar to @code{regexp-substitute}, but allows global |
| 287 | substitutions on @var{target}. Each @var{item} behaves as per |
| 288 | @code{regexp-substitute}, with the following differences, |
| 289 | |
| 290 | @itemize @bullet |
| 291 | @item |
| 292 | A function. Called as @code{(@var{item} match)} with the match |
| 293 | structure for the @var{regexp} match, it should return a string to be |
| 294 | written to @var{port}. |
| 295 | |
| 296 | @item |
| 297 | The symbol @samp{post}. This doesn't output anything, but instead |
| 298 | causes @code{regexp-substitute/global} to recurse on the unmatched |
| 299 | portion of @var{target}. |
| 300 | |
| 301 | This @emph{must} be supplied to perform a global search and replace on |
| 302 | @var{target}; without it @code{regexp-substitute/global} returns after |
| 303 | a single match and output. |
| 304 | @end itemize |
| 305 | |
| 306 | For example, to collapse runs of tabs and spaces to a single hyphen |
| 307 | each, |
| 308 | |
| 309 | @example |
| 310 | (regexp-substitute/global #f "[ \t]+" "this is the text" |
| 311 | 'pre "-" 'post) |
| 312 | @result{} "this-is-the-text" |
| 313 | @end example |
| 314 | |
| 315 | Or using a function to reverse the letters in each word, |
| 316 | |
| 317 | @example |
| 318 | (regexp-substitute/global #f "[a-z]+" "to do and not-do" |
| 319 | 'pre (lambda (m) (string-reverse (match:substring m))) 'post) |
| 320 | @result{} "ot od dna ton-od" |
| 321 | @end example |
| 322 | |
| 323 | Without the @code{post} symbol, just one regexp match is made. For |
| 324 | example the following is the date example from |
| 325 | @code{regexp-substitute} above, without the need for the separate |
| 326 | @code{string-match} call. |
| 327 | |
| 328 | @lisp |
| 329 | (define date-regex |
| 330 | "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])") |
| 331 | (define s "Date 20020429 12am.") |
| 332 | (regexp-substitute/global #f date-regex s |
| 333 | 'pre 2 "-" 3 "-" 1 'post " (" 0 ")") |
| 334 | |
| 335 | @result{} "Date 04-29-2002 12am. (20020429)" |
| 336 | @end lisp |
| 337 | @end deffn |
| 338 | |
| 339 | |
| 340 | @node Match Structures |
| 341 | @subsection Match Structures |
| 342 | |
| 343 | @cindex match structures |
| 344 | |
| 345 | A @dfn{match structure} is the object returned by @code{string-match} and |
| 346 | @code{regexp-exec}. It describes which portion of a string, if any, |
| 347 | matched the given regular expression. Match structures include: a |
| 348 | reference to the string that was checked for matches; the starting and |
| 349 | ending positions of the regexp match; and, if the regexp included any |
| 350 | parenthesized subexpressions, the starting and ending positions of each |
| 351 | submatch. |
| 352 | |
| 353 | In each of the regexp match functions described below, the @code{match} |
| 354 | argument must be a match structure returned by a previous call to |
| 355 | @code{string-match} or @code{regexp-exec}. Most of these functions |
| 356 | return some information about the original target string that was |
| 357 | matched against a regular expression; we will call that string |
| 358 | @var{target} for easy reference. |
| 359 | |
| 360 | @c begin (scm-doc-string "regex.scm" "regexp-match?") |
| 361 | @deffn {Scheme Procedure} regexp-match? obj |
| 362 | Return @code{#t} if @var{obj} is a match structure returned by a |
| 363 | previous call to @code{regexp-exec}, or @code{#f} otherwise. |
| 364 | @end deffn |
| 365 | |
| 366 | @c begin (scm-doc-string "regex.scm" "match:substring") |
| 367 | @deffn {Scheme Procedure} match:substring match [n] |
| 368 | Return the portion of @var{target} matched by subexpression number |
| 369 | @var{n}. Submatch 0 (the default) represents the entire regexp match. |
| 370 | If the regular expression as a whole matched, but the subexpression |
| 371 | number @var{n} did not match, return @code{#f}. |
| 372 | @end deffn |
| 373 | |
| 374 | @lisp |
| 375 | (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) |
| 376 | (match:substring s) |
| 377 | @result{} "2002" |
| 378 | |
| 379 | ;; match starting at offset 6 in the string |
| 380 | (match:substring |
| 381 | (string-match "[0-9][0-9][0-9][0-9]" "blah987654" 6)) |
| 382 | @result{} "7654" |
| 383 | @end lisp |
| 384 | |
| 385 | @c begin (scm-doc-string "regex.scm" "match:start") |
| 386 | @deffn {Scheme Procedure} match:start match [n] |
| 387 | Return the starting position of submatch number @var{n}. |
| 388 | @end deffn |
| 389 | |
| 390 | In the following example, the result is 4, since the match starts at |
| 391 | character index 4: |
| 392 | |
| 393 | @lisp |
| 394 | (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) |
| 395 | (match:start s) |
| 396 | @result{} 4 |
| 397 | @end lisp |
| 398 | |
| 399 | @c begin (scm-doc-string "regex.scm" "match:end") |
| 400 | @deffn {Scheme Procedure} match:end match [n] |
| 401 | Return the ending position of submatch number @var{n}. |
| 402 | @end deffn |
| 403 | |
| 404 | In the following example, the result is 8, since the match runs between |
| 405 | characters 4 and 8 (i.e.@: the ``2002''). |
| 406 | |
| 407 | @lisp |
| 408 | (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) |
| 409 | (match:end s) |
| 410 | @result{} 8 |
| 411 | @end lisp |
| 412 | |
| 413 | @c begin (scm-doc-string "regex.scm" "match:prefix") |
| 414 | @deffn {Scheme Procedure} match:prefix match |
| 415 | Return the unmatched portion of @var{target} preceding the regexp match. |
| 416 | |
| 417 | @lisp |
| 418 | (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) |
| 419 | (match:prefix s) |
| 420 | @result{} "blah" |
| 421 | @end lisp |
| 422 | @end deffn |
| 423 | |
| 424 | @c begin (scm-doc-string "regex.scm" "match:suffix") |
| 425 | @deffn {Scheme Procedure} match:suffix match |
| 426 | Return the unmatched portion of @var{target} following the regexp match. |
| 427 | @end deffn |
| 428 | |
| 429 | @lisp |
| 430 | (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) |
| 431 | (match:suffix s) |
| 432 | @result{} "foo" |
| 433 | @end lisp |
| 434 | |
| 435 | @c begin (scm-doc-string "regex.scm" "match:count") |
| 436 | @deffn {Scheme Procedure} match:count match |
| 437 | Return the number of parenthesized subexpressions from @var{match}. |
| 438 | Note that the entire regular expression match itself counts as a |
| 439 | subexpression, and failed submatches are included in the count. |
| 440 | @end deffn |
| 441 | |
| 442 | @c begin (scm-doc-string "regex.scm" "match:string") |
| 443 | @deffn {Scheme Procedure} match:string match |
| 444 | Return the original @var{target} string. |
| 445 | @end deffn |
| 446 | |
| 447 | @lisp |
| 448 | (define s (string-match "[0-9][0-9][0-9][0-9]" "blah2002foo")) |
| 449 | (match:string s) |
| 450 | @result{} "blah2002foo" |
| 451 | @end lisp |
| 452 | |
| 453 | |
| 454 | @node Backslash Escapes |
| 455 | @subsection Backslash Escapes |
| 456 | |
| 457 | Sometimes you will want a regexp to match characters like @samp{*} or |
| 458 | @samp{$} exactly. For example, to check whether a particular string |
| 459 | represents a menu entry from an Info node, it would be useful to match |
| 460 | it against a regexp like @samp{^* [^:]*::}. However, this won't work; |
| 461 | because the asterisk is a metacharacter, it won't match the @samp{*} at |
| 462 | the beginning of the string. In this case, we want to make the first |
| 463 | asterisk un-magic. |
| 464 | |
| 465 | You can do this by preceding the metacharacter with a backslash |
| 466 | character @samp{\}. (This is also called @dfn{quoting} the |
| 467 | metacharacter, and is known as a @dfn{backslash escape}.) When Guile |
| 468 | sees a backslash in a regular expression, it considers the following |
| 469 | glyph to be an ordinary character, no matter what special meaning it |
| 470 | would ordinarily have. Therefore, we can make the above example work by |
| 471 | changing the regexp to @samp{^\* [^:]*::}. The @samp{\*} sequence tells |
| 472 | the regular expression engine to match only a single asterisk in the |
| 473 | target string. |
| 474 | |
| 475 | Since the backslash is itself a metacharacter, you may force a regexp to |
| 476 | match a backslash in the target string by preceding the backslash with |
| 477 | itself. For example, to find variable references in a @TeX{} program, |
| 478 | you might want to find occurrences of the string @samp{\let\} followed |
| 479 | by any number of alphabetic characters. The regular expression |
| 480 | @samp{\\let\\[A-Za-z]*} would do this: the double backslashes in the |
| 481 | regexp each match a single backslash in the target string. |
| 482 | |
| 483 | @c begin (scm-doc-string "regex.scm" "regexp-quote") |
| 484 | @deffn {Scheme Procedure} regexp-quote str |
| 485 | Quote each special character found in @var{str} with a backslash, and |
| 486 | return the resulting string. |
| 487 | @end deffn |
| 488 | |
| 489 | @strong{Very important:} Using backslash escapes in Guile source code |
| 490 | (as in Emacs Lisp or C) can be tricky, because the backslash character |
| 491 | has special meaning for the Guile reader. For example, if Guile |
| 492 | encounters the character sequence @samp{\n} in the middle of a string |
| 493 | while processing Scheme code, it replaces those characters with a |
| 494 | newline character. Similarly, the character sequence @samp{\t} is |
| 495 | replaced by a horizontal tab. Several of these @dfn{escape sequences} |
| 496 | are processed by the Guile reader before your code is executed. |
| 497 | Unrecognized escape sequences are ignored: if the characters @samp{\*} |
| 498 | appear in a string, they will be translated to the single character |
| 499 | @samp{*}. |
| 500 | |
| 501 | This translation is obviously undesirable for regular expressions, since |
| 502 | we want to be able to include backslashes in a string in order to |
| 503 | escape regexp metacharacters. Therefore, to make sure that a backslash |
| 504 | is preserved in a string in your Guile program, you must use @emph{two} |
| 505 | consecutive backslashes: |
| 506 | |
| 507 | @lisp |
| 508 | (define Info-menu-entry-pattern (make-regexp "^\\* [^:]*")) |
| 509 | @end lisp |
| 510 | |
| 511 | The string in this example is preprocessed by the Guile reader before |
| 512 | any code is executed. The resulting argument to @code{make-regexp} is |
| 513 | the string @samp{^\* [^:]*}, which is what we really want. |
| 514 | |
| 515 | This also means that in order to write a regular expression that matches |
| 516 | a single backslash character, the regular expression string in the |
| 517 | source code must include @emph{four} backslashes. Each consecutive pair |
| 518 | of backslashes gets translated by the Guile reader to a single |
| 519 | backslash, and the resulting double-backslash is interpreted by the |
| 520 | regexp engine as matching a single backslash character. Hence: |
| 521 | |
| 522 | @lisp |
| 523 | (define tex-variable-pattern (make-regexp "\\\\let\\\\=[A-Za-z]*")) |
| 524 | @end lisp |
| 525 | |
| 526 | The reason for the unwieldiness of this syntax is historical. Both |
| 527 | regular expression pattern matchers and Unix string processing systems |
| 528 | have traditionally used backslashes with the special meanings |
| 529 | described above. The POSIX regular expression specification and ANSI C |
| 530 | standard both require these semantics. Attempting to abandon either |
| 531 | convention would cause other kinds of compatibility problems, possibly |
| 532 | more severe ones. Therefore, without extending the Scheme reader to |
| 533 | support strings with different quoting conventions (an ungainly and |
| 534 | confusing extension when implemented in other languages), we must adhere |
| 535 | to this cumbersome escape syntax. |