doc/ref/api-peg.texi

   1 @c -*-texinfo-*-
   2 @c This is part of the GNU Guile Reference Manual.
   3 @c Copyright (C) 2006, 2010, 2011
   4 @c   Free Software Foundation, Inc.
   5 @c See the file guile.texi for copying conditions.
   6
   7 @node PEG Parsing
   8 @section PEG Parsing
   9
  10 Parsing Expression Grammars (PEGs) are a way of specifying formal
  11 languages for text processing.  They can be used either for matching
  12 (like regular expressions) or for building recursive descent parsers
  13 (like lex/yacc).  Guile uses a superset of PEG syntax that allows more
  14 control over what information is preserved during parsing.
  15
  16 Wikipedia has a clear and concise introduction to PEGs if you want to
  17 familiarize yourself with the syntax:
  18 @url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
  19
  20 The module works by compiling PEGs down to lambda expressions.  These
  21 can either be stored in variables at compile-time by the define macros
  22 (@code{define-peg-pattern} and @code{define-peg-string-patterns}) or calculated
  23 explicitly at runtime with the compile functions
  24 (@code{compile-peg-pattern} and @code{peg-string-compile}).
  25
  26 They can then be used for either parsing (@code{match-pattern}) or searching
  27 (@code{search-for-pattern}).  For convenience, @code{search-for-pattern}
  28 also takes pattern literals in case you want to inline a simple search
  29 (people often use regular expressions this way).
  30
  31 The rest of this documentation consists of a syntax reference, an API
  32 reference, and a tutorial.
  33
  34 @menu
  35 * PEG Syntax Reference::
  36 * PEG API Reference::
  37 * PEG Tutorial::
  38 * PEG Internals::
  39 @end menu
  40
  41 @node PEG Syntax Reference
  42 @subsection PEG Syntax Reference
  43
  44 @subsubheading Normal PEG Syntax:
  45
  46 @deftp {PEG Pattern} sequence a b
  47 Parses @var{a}.  If this succeeds, continues to parse @var{b} from the
  48 end of the text parsed as @var{a}.  Succeeds if both @var{a} and
  49 @var{b} succeed.
  50
  51 @code{"a b"}
  52
  53 @code{(and a b)}
  54 @end deftp
  55
  56 @deftp {PEG Pattern} {ordered choice} a b
  57 Parses @var{a}.  If this fails, backtracks and parses @var{b}.
  58 Succeeds if either @var{a} or @var{b} succeeds.
  59
  60 @code{"a/b"}
  61
  62 @code{(or a b)}
  63 @end deftp
  64
  65 @deftp {PEG Pattern} {zero or more} a
  66 Parses @var{a} as many times in a row as it can, starting each @var{a}
  67 at the end of the text parsed by the previous @var{a}.  Always
  68 succeeds.
  69
  70 @code{"a*"}
  71
  72 @code{(* a)}
  73 @end deftp
  74
  75 @deftp {PEG Pattern} {one or more} a
  76 Parses @var{a} as many times in a row as it can, starting each @var{a}
  77 at the end of the text parsed by the previous @var{a}.  Succeeds if at
  78 least one @var{a} was parsed.
  79
  80 @code{"a+"}
  81
  82 @code{(+ a)}
  83 @end deftp
  84
  85 @deftp {PEG Pattern} optional a
  86 Tries to parse @var{a}.  Succeeds if @var{a} succeeds.
  87
  88 @code{"a?"}
  89
  90 @code{(? a)}
  91 @end deftp
  92
  93 @deftp {PEG Pattern} {followed by} a
  94 Makes sure it is possible to parse @var{a}, but does not actually parse
  95 it.  Succeeds if @var{a} would succeed.
  96
  97 @code{"&a"}
  98
  99 @code{(followed-by a)}
 100 @end deftp
 101
 102 @deftp {PEG Pattern} {not followed by} a
 103 Makes sure it is impossible to parse @var{a}, but does not actually
 104 parse it.  Succeeds if @var{a} would fail.
 105
 106 @code{"!a"}
 107
 108 @code{(not-followed-by a)}
 109 @end deftp
 110
 111 @deftp {PEG Pattern} {string literal} ``abc''
 112 Parses the string @var{"abc"}.  Succeeds if that parsing succeeds.
 113
 114 @code{"'abc'"}
 115
 116 @code{"abc"}
 117 @end deftp
 118
 119 @deftp {PEG Pattern} {any character}
 120 Parses any single character.  Succeeds unless there is no more text to
 121 be parsed.
 122
 123 @code{"."}
 124
 125 @code{peg-any}
 126 @end deftp
 127
 128 @deftp {PEG Pattern} {character class} a b
 129 Alternative syntax for ``Ordered Choice @var{a} @var{b}'' if @var{a} and
 130 @var{b} are characters.
 131
 132 @code{"[ab]"}
 133
 134 @code{(or "a" "b")}
 135 @end deftp
 136
 137 @deftp {PEG Pattern} {range of characters} a z
 138 Parses any character falling between @var{a} and @var{z}.
 139
 140 @code{"[a-z]"}
 141
 142 @code{(range #\a #\z)}
 143 @end deftp
 144
 145 Example:
 146
 147 @example
 148 "(a !b / c &d*) 'e'+"
 149 @end example
 150
 151 Would be:
 152
 153 @lisp
 154 (and
 155  (or
 156   (and a (not-followed-by b))
 157   (and c (followed-by (* d))))
 158  (+ "e"))
 159 @end lisp
 160
 161 @subsubheading Extended Syntax
 162
 163 There is some extra syntax for S-expressions.
 164
 165 @deftp {PEG Pattern} ignore a
 166 Ignore the text matching @var{a}
 167 @end deftp
 168
 169 @deftp {PEG Pattern} capture a
 170 Capture the text matching @var{a}.
 171 @end deftp
 172
 173 @deftp {PEG Pattern} peg a
 174 Embed the PEG pattern @var{a} using string syntax.
 175 @end deftp
 176
 177 Example:
 178
 179 @example
 180 "!a / 'b'"
 181 @end example
 182
 183 Is equivalent to
 184
 185 @lisp
 186 (or (peg "!a") "b")
 187 @end lisp
 188
 189 and
 190
 191 @lisp
 192 (or (not-followed-by a) "b")
 193 @end lisp
 194
 195 @node PEG API Reference
 196 @subsection PEG API Reference
 197
 198 @subsubheading Define Macros
 199
 200 The most straightforward way to define a PEG is by using one of the
 201 define macros (both of these macroexpand into @code{define}
 202 expressions).  These macros bind parsing functions to variables.  These
 203 parsing functions may be invoked by @code{match-pattern} or
 204 @code{search-for-pattern}, which return a PEG match record.  Raw data can be
 205 retrieved from this record with the PEG match deconstructor functions.
 206 More complicated (and perhaps enlightening) examples can be found in the
 207 tutorial.
 208
 209 @deffn {Scheme Macro} define-peg-string-patterns peg-string
 210 Defines all the nonterminals in the PEG @var{peg-string}.  More
 211 precisely, @code{define-peg-string-patterns} takes a superset of PEGs.  A normal PEG
 212 has a @code{<-} between the nonterminal and the pattern.
 213 @code{define-peg-string-patterns} uses this symbol to determine what information it
 214 should propagate up the parse tree.  The normal @code{<-} propagates the
 215 matched text up the parse tree, @code{<--} propagates the matched text
 216 up the parse tree tagged with the name of the nonterminal, and @code{<}
 217 discards that matched text and propagates nothing up the parse tree.
 218 Also, nonterminals may consist of any alphanumeric character or a ``-''
 219 character (in normal PEGs nonterminals can only be alphabetic).
 220
 221 For example, if we:
 222 @lisp
 223 (define-peg-string-patterns
 224   "as <- 'a'+
 225 bs <- 'b'+
 226 as-or-bs <- as/bs")
 227 (define-peg-string-patterns
 228   "as-tag <-- 'a'+
 229 bs-tag <-- 'b'+
 230 as-or-bs-tag <-- as-tag/bs-tag")
 231 @end lisp
 232 Then:
 233 @lisp
 234 (match-pattern as-or-bs "aabbcc") @result{}
 235 #<peg start: 0 end: 2 string: aabbcc tree: aa>
 236 (match-pattern as-or-bs-tag "aabbcc") @result{}
 237 #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
 238 @end lisp
 239
 240 Note that in doing this, we have bound 6 variables at the toplevel
 241 (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
 242 @var{as-or-bs-tag}).
 243 @end deffn
 244
 245 @deffn {Scheme Macro} define-peg-pattern name capture-type peg-sexp
 246 Defines a single nonterminal @var{name}.  @var{capture-type} determines
 247 how much information is passed up the parse tree.  @var{peg-sexp} is a
 248 PEG in S-expression form.
 249
 250 Possible values for capture-type:
 251
 252 @table @code
 253 @item all
 254 passes the matched text up the parse tree tagged with the name of the
 255 nonterminal.
 256 @item body
 257 passes the matched text up the parse tree.
 258 @item none
 259 passes nothing up the parse tree.
 260 @end table
 261
 262 For Example, if we:
 263 @lisp
 264 (define-peg-pattern as body (+ "a"))
 265 (define-peg-pattern bs body (+ "b"))
 266 (define-peg-pattern as-or-bs body (or as bs))
 267 (define-peg-pattern as-tag all (+ "a"))
 268 (define-peg-pattern bs-tag all (+ "b"))
 269 (define-peg-pattern as-or-bs-tag all (or as-tag bs-tag))
 270 @end lisp
 271 Then:
 272 @lisp
 273 (match-pattern as-or-bs "aabbcc") @result{}
 274 #<peg start: 0 end: 2 string: aabbcc tree: aa>
 275 (match-pattern as-or-bs-tag "aabbcc") @result{}
 276 #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
 277 @end lisp
 278
 279 Note that in doing this, we have bound 6 variables at the toplevel
 280 (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
 281 @var{as-or-bs-tag}).
 282 @end deffn
 283
 284 @subsubheading Compile Functions
 285 It is sometimes useful to be able to compile anonymous PEG patterns at
 286 runtime.  These functions let you do that using either syntax.
 287
 288 @deffn {Scheme Procedure} peg-string-compile peg-string capture-type
 289 Compiles the PEG pattern in @var{peg-string} propagating according to
 290 @var{capture-type} (capture-type can be any of the values from
 291 @code{define-peg-pattern}).
 292 @end deffn
 293
 294
 295 @deffn {Scheme Procedure} compile-peg-pattern peg-sexp capture-type
 296 Compiles the PEG pattern in @var{peg-sexp} propagating according to
 297 @var{capture-type} (capture-type can be any of the values from
 298 @code{define-peg-pattern}).
 299 @end deffn
 300
 301 The functions return syntax objects, which can be useful if you want to
 302 use them in macros. If all you want is to define a new nonterminal, you
 303 can do the following:
 304
 305 @lisp
 306 (define exp '(+ "a"))
 307 (define as (compile (compile-peg-pattern exp 'body)))
 308 @end lisp
 309
 310 You can use this nonterminal with all of the regular PEG functions:
 311
 312 @lisp
 313 (match-pattern as "aaaaa") @result{}
 314 #<peg start: 0 end: 5 string: bbbbb tree: bbbbb>
 315 @end lisp
 316
 317 @subsubheading Parsing & Matching Functions
 318
 319 For our purposes, ``parsing'' means parsing a string into a tree
 320 starting from the first character, while ``matching'' means searching
 321 through the string for a substring.  In practice, the only difference
 322 between the two functions is that @code{match-pattern} gives up if it can't
 323 find a valid substring starting at index 0 and @code{search-for-pattern} keeps
 324 looking.  They are both equally capable of ``parsing'' and ``matching''
 325 given those constraints.
 326
 327 @deffn {Scheme Procedure} match-pattern nonterm string
 328 Parses @var{string} using the PEG stored in @var{nonterm}.  If no match
 329 was found, @code{match-pattern} returns false.  If a match was found, a PEG
 330 match record is returned.
 331
 332 The @code{capture-type} argument to @code{define-peg-pattern} allows you to
 333 choose what information to hold on to while parsing.  The options are:
 334
 335 @table @code
 336 @item all
 337 tag the matched text with the nonterminal
 338 @item body
 339 just the matched text
 340 @item none
 341 nothing
 342 @end table
 343
 344 @lisp
 345 (define-peg-pattern as all (+ "a"))
 346 (match-pattern as "aabbcc") @result{}
 347 #<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
 348
 349 (define-peg-pattern as body (+ "a"))
 350 (match-pattern as "aabbcc") @result{}
 351 #<peg start: 0 end: 2 string: aabbcc tree: aa>
 352
 353 (define-peg-pattern as none (+ "a"))
 354 (match-pattern as "aabbcc") @result{}
 355 #<peg start: 0 end: 2 string: aabbcc tree: ()>
 356
 357 (define-peg-pattern bs body (+ "b"))
 358 (match-pattern bs "aabbcc") @result{}
 359 #f
 360 @end lisp
 361 @end deffn
 362
 363 @deffn {Scheme Macro} search-for-pattern nonterm-or-peg string
 364 Searches through @var{string} looking for a matching subexpression.
 365 @var{nonterm-or-peg} can either be a nonterminal or a literal PEG
 366 pattern.  When a literal PEG pattern is provided, @code{search-for-pattern} works
 367 very similarly to the regular expression searches many hackers are used
 368 to.  If no match was found, @code{search-for-pattern} returns false.  If a match
 369 was found, a PEG match record is returned.
 370
 371 @lisp
 372 (define-peg-pattern as body (+ "a"))
 373 (search-for-pattern as "aabbcc") @result{}
 374 #<peg start: 0 end: 2 string: aabbcc tree: aa>
 375 (search-for-pattern (+ "a") "aabbcc") @result{}
 376 #<peg start: 0 end: 2 string: aabbcc tree: aa>
 377 (search-for-pattern "'a'+" "aabbcc") @result{}
 378 #<peg start: 0 end: 2 string: aabbcc tree: aa>
 379
 380 (define-peg-pattern as all (+ "a"))
 381 (search-for-pattern as "aabbcc") @result{}
 382 #<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
 383
 384 (define-peg-pattern bs body (+ "b"))
 385 (search-for-pattern bs "aabbcc") @result{}
 386 #<peg start: 2 end: 4 string: aabbcc tree: bb>
 387 (search-for-pattern (+ "b") "aabbcc") @result{}
 388 #<peg start: 2 end: 4 string: aabbcc tree: bb>
 389 (search-for-pattern "'b'+" "aabbcc") @result{}
 390 #<peg start: 2 end: 4 string: aabbcc tree: bb>
 391
 392 (define-peg-pattern zs body (+ "z"))
 393 (search-for-pattern zs "aabbcc") @result{}
 394 #f
 395 (search-for-pattern (+ "z") "aabbcc") @result{}
 396 #f
 397 (search-for-pattern "'z'+" "aabbcc") @result{}
 398 #f
 399 @end lisp
 400 @end deffn
 401
 402 @subsubheading PEG Match Records
 403 The @code{match-pattern} and @code{search-for-pattern} functions both return PEG
 404 match records.  Actual information can be extracted from these with the
 405 following functions.
 406
 407 @deffn {Scheme Procedure} peg:string match-record
 408 Returns the original string that was parsed in the creation of
 409 @code{match-record}.
 410 @end deffn
 411
 412 @deffn {Scheme Procedure} peg:start match-record
 413 Returns the index of the first parsed character in the original string
 414 (from @code{peg:string}).  If this is the same as @code{peg:end},
 415 nothing was parsed.
 416 @end deffn
 417
 418 @deffn {Scheme Procedure} peg:end match-record
 419 Returns one more than the index of the last parsed character in the
 420 original string (from @code{peg:string}).  If this is the same as
 421 @code{peg:start}, nothing was parsed.
 422 @end deffn
 423
 424 @deffn {Scheme Procedure} peg:substring match-record
 425 Returns the substring parsed by @code{match-record}.  This is equivalent to
 426 @code{(substring (peg:string match-record) (peg:start match-record) (peg:end
 427 match-record))}.
 428 @end deffn
 429
 430 @deffn {Scheme Procedure} peg:tree match-record
 431 Returns the tree parsed by @code{match-record}.
 432 @end deffn
 433
 434 @deffn {Scheme Procedure} peg-record? match-record
 435 Returns true if @code{match-record} is a PEG match record, or false
 436 otherwise.
 437 @end deffn
 438
 439 Example:
 440 @lisp
 441 (define-peg-pattern bs all (peg "'b'+"))
 442
 443 (search-for-pattern bs "aabbcc") @result{}
 444 #<peg start: 2 end: 4 string: aabbcc tree: (bs bb)>
 445
 446 (let ((pm (search-for-pattern bs "aabbcc")))
 447    `((string ,(peg:string pm))
 448      (start ,(peg:start pm))
 449      (end ,(peg:end pm))
 450      (substring ,(peg:substring pm))
 451      (tree ,(peg:tree pm))
 452      (record? ,(peg-record? pm)))) @result{}
 453 ((string "aabbcc")
 454  (start 2)
 455  (end 4)
 456  (substring "bb")
 457  (tree (bs "bb"))
 458  (record? #t))
 459 @end lisp
 460
 461 @subsubheading Miscellaneous
 462
 463 @deffn {Scheme Procedure} context-flatten tst lst
 464 Takes a predicate @var{tst} and a list @var{lst}.  Flattens @var{lst}
 465 until all elements are either atoms or satisfy @var{tst}.  If @var{lst}
 466 itself satisfies @var{tst}, @code{(list lst)} is returned (this is a
 467 flat list whose only element satisfies @var{tst}).
 468
 469 @lisp
 470 (context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{}
 471 (2 2 (1 1 (2 2)) 2 2 (1 1))
 472 (context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{}
 473 ((1 1 (1 1 (2 2)) (2 2 (1 1))))
 474 @end lisp
 475
 476 If you're wondering why this is here, take a look at the tutorial.
 477 @end deffn
 478
 479 @deffn {Scheme Procedure} keyword-flatten terms lst
 480 A less general form of @code{context-flatten}.  Takes a list of terminal
 481 atoms @code{terms} and flattens @var{lst} until all elements are either
 482 atoms, or lists which have an atom from @code{terms} as their first
 483 element.
 484 @lisp
 485 (keyword-flatten '(a b) '(c a b (a c) (b c) (c (b a) (c a)))) @result{}
 486 (c a b (a c) (b c) c (b a) c a)
 487 @end lisp
 488
 489 If you're wondering why this is here, take a look at the tutorial.
 490 @end deffn
 491
 492 @node PEG Tutorial
 493 @subsection PEG Tutorial
 494
 495 @subsubheading Parsing /etc/passwd
 496 This example will show how to parse /etc/passwd using PEGs.
 497
 498 First we define an example /etc/passwd file:
 499
 500 @lisp
 501 (define *etc-passwd*
 502   "root:x:0:0:root:/root:/bin/bash
 503 daemon:x:1:1:daemon:/usr/sbin:/bin/sh
 504 bin:x:2:2:bin:/bin:/bin/sh
 505 sys:x:3:3:sys:/dev:/bin/sh
 506 nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
 507 messagebus:x:103:107::/var/run/dbus:/bin/false
 508 ")
 509 @end lisp
 510
 511 As a first pass at this, we might want to have all the entries in
 512 /etc/passwd in a list.
 513
 514 Doing this with string-based PEG syntax would look like this:
 515 @lisp
 516 (define-peg-string-patterns
 517   "passwd <- entry* !.
 518 entry <-- (! NL .)* NL*
 519 NL < '\n'")
 520 @end lisp
 521
 522 A @code{passwd} file is 0 or more entries (@code{entry*}) until the end
 523 of the file (@code{!.} (@code{.} is any character, so @code{!.} means
 524 ``not anything'')).  We want to capture the data in the nonterminal
 525 @code{passwd}, but not tag it with the name, so we use @code{<-}.
 526
 527 An entry is a series of 0 or more characters that aren't newlines
 528 (@code{(! NL .)*}) followed by 0 or more newlines (@code{NL*}).  We want
 529 to tag all the entries with @code{entry}, so we use @code{<--}.
 530
 531 A newline is just a literal newline (@code{'\n'}).  We don't want a
 532 bunch of newlines cluttering up the output, so we use @code{<} to throw
 533 away the captured data.
 534
 535 Here is the same PEG defined using S-expressions:
 536 @lisp
 537 (define-peg-pattern passwd body (and (* entry) (not-followed-by peg-any)))
 538 (define-peg-pattern entry all (and (* (and (not-followed-by NL) peg-any))
 539                                (* NL)))
 540 (define-peg-pattern NL none "\n")
 541 @end lisp
 542
 543 Obviously this is much more verbose.  On the other hand, it's more
 544 explicit, and thus easier to build automatically.  However, there are
 545 some tricks that make S-expressions easier to use in some cases.  One is
 546 the @code{ignore} keyword; the string syntax has no way to say ``throw
 547 away this text'' except breaking it out into a separate nonterminal.
 548 For instance, to throw away the newlines we had to define @code{NL}.  In
 549 the S-expression syntax, we could have simply written @code{(ignore
 550 "\n")}.  Also, for the cases where string syntax is really much cleaner,
 551 the @code{peg} keyword can be used to embed string syntax in
 552 S-expression syntax.  For instance, we could have written:
 553
 554 @lisp
 555 (define-peg-pattern passwd body (peg "entry* !."))
 556 @end lisp
 557
 558 However we define it, parsing @code{*etc-passwd*} with the @code{passwd}
 559 nonterminal yields the same results:
 560
 561 @lisp
 562 (peg:tree (match-pattern passwd *etc-passwd*)) @result{}
 563 ((entry "root:x:0:0:root:/root:/bin/bash")
 564  (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
 565  (entry "bin:x:2:2:bin:/bin:/bin/sh")
 566  (entry "sys:x:3:3:sys:/dev:/bin/sh")
 567  (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
 568  (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
 569 @end lisp
 570
 571 However, here is something to be wary of:
 572
 573 @lisp
 574 (peg:tree (match-pattern passwd "one entry")) @result{}
 575 (entry "one entry")
 576 @end lisp
 577
 578 By default, the parse trees generated by PEGs are compressed as much as
 579 possible without losing information.  It may not look like this is what
 580 you want at first, but uncompressed parse trees are an enormous headache
 581 (there's no easy way to predict how deep particular lists will nest,
 582 there are empty lists littered everywhere, etc. etc.).  One side-effect
 583 of this, however, is that sometimes the compressor is too aggressive.
 584 No information is discarded when @code{((entry "one entry"))} is
 585 compressed to @code{(entry "one entry")}, but in this particular case it
 586 probably isn't what we want.
 587
 588 There are two functions for easily dealing with this:
 589 @code{keyword-flatten} and @code{context-flatten}.  The
 590 @code{keyword-flatten} function takes a list of keywords and a list to
 591 flatten, then tries to coerce the list such that the first element of
 592 all sublists is one of the keywords.  The @code{context-flatten}
 593 function is similar, but instead of a list of keywords it takes a
 594 predicate that should indicate whether a given sublist is good enough
 595 (refer to the API reference for more details).
 596
 597 What we want here is @code{keyword-flatten}.
 598 @lisp
 599 (keyword-flatten '(entry) (peg:tree (match-pattern passwd *etc-passwd*))) @result{}
 600 ((entry "root:x:0:0:root:/root:/bin/bash")
 601  (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
 602  (entry "bin:x:2:2:bin:/bin:/bin/sh")
 603  (entry "sys:x:3:3:sys:/dev:/bin/sh")
 604  (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
 605  (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
 606 (keyword-flatten '(entry) (peg:tree (match-pattern passwd "one entry"))) @result{}
 607 ((entry "one entry"))
 608 @end lisp
 609
 610 Of course, this is a somewhat contrived example.  In practice we would
 611 probably just tag the @code{passwd} nonterminal to remove the ambiguity
 612 (using either the @code{all} keyword for S-expressions or the @code{<--}
 613 symbol for strings)..
 614
 615 @lisp
 616 (define-peg-pattern tag-passwd all (peg "entry* !."))
 617 (peg:tree (match-pattern tag-passwd *etc-passwd*)) @result{}
 618 (tag-passwd
 619   (entry "root:x:0:0:root:/root:/bin/bash")
 620   (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
 621   (entry "bin:x:2:2:bin:/bin:/bin/sh")
 622   (entry "sys:x:3:3:sys:/dev:/bin/sh")
 623   (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
 624   (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
 625 (peg:tree (match-pattern tag-passwd "one entry"))
 626 (tag-passwd
 627   (entry "one entry"))
 628 @end lisp
 629
 630 If you're ever uncertain about the potential results of parsing
 631 something, remember the two absolute rules:
 632 @enumerate
 633 @item
 634 No parsing information will ever be discarded.
 635 @item
 636 There will never be any lists with fewer than 2 elements.
 637 @end enumerate
 638
 639 For the purposes of (1), "parsing information" means things tagged with
 640 the @code{any} keyword or the @code{<--} symbol.  Plain strings will be
 641 concatenated.
 642
 643 Let's extend this example a bit more and actually pull some useful
 644 information out of the passwd file:
 645
 646 @lisp
 647 (define-peg-string-patterns
 648   "passwd <-- entry* !.
 649 entry <-- login C pass C uid C gid C nameORcomment C homedir C shell NL*
 650 login <-- text
 651 pass <-- text
 652 uid <-- [0-9]*
 653 gid <-- [0-9]*
 654 nameORcomment <-- text
 655 homedir <-- path
 656 shell <-- path
 657 path <-- (SLASH pathELEMENT)*
 658 pathELEMENT <-- (!NL !C  !'/' .)*
 659 text <- (!NL !C  .)*
 660 C < ':'
 661 NL < '\n'
 662 SLASH < '/'")
 663 @end lisp
 664
 665 This produces rather pretty parse trees:
 666 @lisp
 667 (passwd
 668   (entry (login "root")
 669          (pass "x")
 670          (uid "0")
 671          (gid "0")
 672          (nameORcomment "root")
 673          (homedir (path (pathELEMENT "root")))
 674          (shell (path (pathELEMENT "bin") (pathELEMENT "bash"))))
 675   (entry (login "daemon")
 676          (pass "x")
 677          (uid "1")
 678          (gid "1")
 679          (nameORcomment "daemon")
 680          (homedir
 681            (path (pathELEMENT "usr") (pathELEMENT "sbin")))
 682          (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
 683   (entry (login "bin")
 684          (pass "x")
 685          (uid "2")
 686          (gid "2")
 687          (nameORcomment "bin")
 688          (homedir (path (pathELEMENT "bin")))
 689          (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
 690   (entry (login "sys")
 691          (pass "x")
 692          (uid "3")
 693          (gid "3")
 694          (nameORcomment "sys")
 695          (homedir (path (pathELEMENT "dev")))
 696          (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
 697   (entry (login "nobody")
 698          (pass "x")
 699          (uid "65534")
 700          (gid "65534")
 701          (nameORcomment "nobody")
 702          (homedir (path (pathELEMENT "nonexistent")))
 703          (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
 704   (entry (login "messagebus")
 705          (pass "x")
 706          (uid "103")
 707          (gid "107")
 708          nameORcomment
 709          (homedir
 710            (path (pathELEMENT "var")
 711                  (pathELEMENT "run")
 712                  (pathELEMENT "dbus")))
 713          (shell (path (pathELEMENT "bin") (pathELEMENT "false")))))
 714 @end lisp
 715
 716 Notice that when there's no entry in a field (e.g. @code{nameORcomment}
 717 for messagebus) the symbol is inserted.  This is the ``don't throw away
 718 any information'' rule---we succesfully matched a @code{nameORcomment}
 719 of 0 characters (since we used @code{*} when defining it).  This is
 720 usually what you want, because it allows you to e.g. use @code{list-ref}
 721 to pull out elements (since they all have known offsets).
 722
 723 If you'd prefer not to have symbols for empty matches, you can replace
 724 the @code{*} with a @code{+} and add a @code{?} after the
 725 @code{nameORcomment} in @code{entry}.  Then it will try to parse 1 or
 726 more characters, fail (inserting nothing into the parse tree), but
 727 continue because it didn't have to match the nameORcomment to continue.
 728
 729
 730 @subsubheading Embedding Arithmetic Expressions
 731
 732 We can parse simple mathematical expressions with the following PEG:
 733
 734 @lisp
 735 (define-peg-string-patterns
 736   "expr <- sum
 737 sum <-- (product ('+' / '-') sum) / product
 738 product <-- (value ('*' / '/') product) / value
 739 value <-- number / '(' expr ')'
 740 number <-- [0-9]+")
 741 @end lisp
 742
 743 Then:
 744 @lisp
 745 (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2")) @result{}
 746 (sum (product (value (number "1")))
 747      "+"
 748      (sum (product
 749             (value (number "1"))
 750             "/"
 751             (product
 752               (value (number "2"))
 753               "*"
 754               (product (value (number "3")))))
 755           "+"
 756           (sum (product
 757                  (value "("
 758                         (sum (product (value (number "1")))
 759                              "+"
 760                              (sum (product (value (number "1")))))
 761                         ")")
 762                  "/"
 763                  (product (value (number "2")))))))
 764 @end lisp
 765
 766 There is very little wasted effort in this PEG.  The @code{number}
 767 nonterminal has to be tagged because otherwise the numbers might run
 768 together with the arithmetic expressions during the string concatenation
 769 stage of parse-tree compression (the parser will see ``1'' followed by
 770 ``/'' and decide to call it ``1/'').  When in doubt, tag.
 771
 772 It is very easy to turn these parse trees into lisp expressions:
 773
 774 @lisp
 775 (define (parse-sum sum left . rest)
 776   (if (null? rest)
 777       (apply parse-product left)
 778       (list (string->symbol (car rest))
 779             (apply parse-product left)
 780             (apply parse-sum (cadr rest)))))
 781
 782 (define (parse-product product left . rest)
 783   (if (null? rest)
 784       (apply parse-value left)
 785       (list (string->symbol (car rest))
 786             (apply parse-value left)
 787             (apply parse-product (cadr rest)))))
 788
 789 (define (parse-value value first . rest)
 790   (if (null? rest)
 791       (string->number (cadr first))
 792       (apply parse-sum (car rest))))
 793
 794 (define parse-expr parse-sum)
 795 @end lisp
 796
 797 (Notice all these functions look very similar; for a more complicated
 798 PEG, it would be worth abstracting.)
 799
 800 Then:
 801 @lisp
 802 (apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
 803 (+ 1 (+ (/ 1 (* 2 3)) (/ (+ 1 1) 2)))
 804 @end lisp
 805
 806 But wait!  The associativity is wrong!  Where it says @code{(/ 1 (* 2
 807 3))}, it should say @code{(* (/ 1 2) 3)}.
 808
 809 It's tempting to try replacing e.g. @code{"sum <-- (product ('+' / '-')
 810 sum) / product"} with @code{"sum <-- (sum ('+' / '-') product) /
 811 product"}, but this is a Bad Idea.  PEGs don't support left recursion.
 812 To see why, imagine what the parser will do here.  When it tries to
 813 parse @code{sum}, it first has to try and parse @code{sum}.  But to do
 814 that, it first has to try and parse @code{sum}.  This will continue
 815 until the stack gets blown off.
 816
 817 So how does one parse left-associative binary operators with PEGs?
 818 Honestly, this is one of their major shortcomings.  There's no
 819 general-purpose way of doing this, but here the repetition operators are
 820 a good choice:
 821
 822 @lisp
 823 (use-modules (srfi srfi-1))
 824
 825 (define-peg-string-patterns
 826   "expr <- sum
 827 sum <-- (product ('+' / '-'))* product
 828 product <-- (value ('*' / '/'))* value
 829 value <-- number / '(' expr ')'
 830 number <-- [0-9]+")
 831
 832 ;; take a deep breath...
 833 (define (make-left-parser next-func)
 834   (lambda (sum first . rest) ;; general form, comments below assume
 835     ;; that we're dealing with a sum expression
 836     (if (null? rest) ;; form (sum (product ...))
 837       (apply next-func first)
 838       (if (string? (cadr first));; form (sum ((product ...) "+") (product ...))
 839           (list (string->symbol (cadr first))
 840                 (apply next-func (car first))
 841                 (apply next-func (car rest)))
 842           ;; form (sum (((product ...) "+") ((product ...) "+")) (product ...))
 843           (car
 844            (reduce ;; walk through the list and build a left-associative tree
 845             (lambda (l r)
 846               (list (list (cadr r) (car r) (apply next-func (car l)))
 847                     (string->symbol (cadr l))))
 848             'ignore
 849             (append ;; make a list of all the products
 850              ;; the first one should be pre-parsed
 851              (list (list (apply next-func (caar first))
 852                          (string->symbol (cadar first))))
 853              (cdr first)
 854              ;; the last one has to be added in
 855              (list (append rest '("done"))))))))))
 856
 857 (define (parse-value value first . rest)
 858   (if (null? rest)
 859       (string->number (cadr first))
 860       (apply parse-sum (car rest))))
 861 (define parse-product (make-left-parser parse-value))
 862 (define parse-sum (make-left-parser parse-product))
 863 (define parse-expr parse-sum)
 864 @end lisp
 865
 866 Then:
 867 @lisp
 868 (apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
 869 (+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2))
 870 @end lisp
 871
 872 As you can see, this is much uglier (it could be made prettier by using
 873 @code{context-flatten}, but the way it's written above makes it clear
 874 how we deal with the three ways the zero-or-more @code{*} expression can
 875 parse).  Fortunately, most of the time we can get away with only using
 876 right-associativity.
 877
 878 @subsubheading Simplified Functions
 879
 880 For a more tantalizing example, consider the following grammar that
 881 parses (highly) simplified C functions:
 882
 883 @lisp
 884 (define-peg-string-patterns
 885   "cfunc <-- cSP ctype cSP cname cSP cargs cLB cSP cbody cRB
 886 ctype <-- cidentifier
 887 cname <-- cidentifier
 888 cargs <-- cLP (! (cSP cRP) carg cSP (cCOMMA / cRP) cSP)* cSP
 889 carg <-- cSP ctype cSP cname
 890 cbody <-- cstatement *
 891 cidentifier <- [a-zA-z][a-zA-Z0-9_]*
 892 cstatement <-- (!';'.)*cSC cSP
 893 cSC < ';'
 894 cCOMMA < ','
 895 cLP < '('
 896 cRP < ')'
 897 cLB < '@{'
 898 cRB < '@}'
 899 cSP < [ \t\n]*")
 900 @end lisp
 901
 902 Then:
 903 @lisp
 904 (match-pattern cfunc "int square(int a) @{ return a*a;@}") @result{}
 905 (32
 906  (cfunc (ctype "int")
 907         (cname "square")
 908         (cargs (carg (ctype "int") (cname "a")))
 909         (cbody (cstatement "return a*a"))))
 910 @end lisp
 911
 912 And:
 913 @lisp
 914 (match-pattern cfunc "int mod(int a, int b) @{ int c = a/b;return a-b*c; @}") @result{}
 915 (52
 916  (cfunc (ctype "int")
 917         (cname "mod")
 918         (cargs (carg (ctype "int") (cname "a"))
 919                (carg (ctype "int") (cname "b")))
 920         (cbody (cstatement "int c = a/b")
 921                (cstatement "return a- b*c"))))
 922 @end lisp
 923
 924 By wrapping all the @code{carg} nonterminals in a @code{cargs}
 925 nonterminal, we were able to remove any ambiguity in the parsing
 926 structure and avoid having to call @code{context-flatten} on the output
 927 of @code{match-pattern}.  We used the same trick with the @code{cstatement}
 928 nonterminals, wrapping them in a @code{cbody} nonterminal.
 929
 930 The whitespace nonterminal @code{cSP} used here is a (very) useful
 931 instantiation of a common pattern for matching syntactically irrelevant
 932 information.  Since it's tagged with @code{<} and ends with @code{*} it
 933 won't clutter up the parse trees (all the empty lists will be discarded
 934 during the compression step) and it will never cause parsing to fail.
 935
 936 @node PEG Internals
 937 @subsection PEG Internals
 938
 939 A PEG parser takes a string as input and attempts to parse it as a given
 940 nonterminal. The key idea of the PEG implementation is that every
 941 nonterminal is just a function that takes a string as an argument and
 942 attempts to parse that string as its nonterminal. The functions always
 943 start from the beginning, but a parse is considered successful if there
 944 is material left over at the end.
 945
 946 This makes it easy to model different PEG parsing operations. For
 947 instance, consider the PEG grammar @code{"ab"}, which could also be
 948 written @code{(and "a" "b")}. It matches the string ``ab''. Here's how
 949 that might be implemented in the PEG style:
 950
 951 @lisp
 952 (define (match-and-a-b str)
 953   (match-a str)
 954   (match-b str))
 955 @end lisp
 956
 957 As you can see, the use of functions provides an easy way to model
 958 sequencing. In a similar way, one could model @code{(or a b)} with
 959 something like the following:
 960
 961 @lisp
 962 (define (match-or-a-b str)
 963   (or (match-a str) (match-b str)))
 964 @end lisp
 965
 966 Here the semantics of a PEG @code{or} expression map naturally onto
 967 Scheme's @code{or} operator. This function will attempt to run
 968 @code{(match-a str)}, and return its result if it succeeds. Otherwise it
 969 will run @code{(match-b str)}.
 970
 971 Of course, the code above wouldn't quite work. We need some way for the
 972 parsing functions to communicate. The actual interface used is below.
 973
 974 @subsubheading Parsing Function Interface
 975
 976 A parsing function takes three arguments - a string, the length of that
 977 string, and the position in that string it should start parsing at. In
 978 effect, the parsing functions pass around substrings in pieces - the
 979 first argument is a buffer of characters, and the second two give a
 980 range within that buffer that the parsing function should look at.
 981
 982 Parsing functions return either #f, if they failed to match their
 983 nonterminal, or a list whose first element must be an integer
 984 representing the final position in the string they matched and whose cdr
 985 can be any other data the function wishes to return, or '() if it
 986 doesn't have any more data.
 987
 988 The one caveat is that if the extra data it returns is a list, any
 989 adjacent strings in that list will be appended by @code{match-pattern}. For
 990 instance, if a parsing function returns @code{(13 ("a" "b" "c"))},
 991 @code{match-pattern} will take @code{(13 ("abc"))} as its value.
 992
 993 For example, here is a function to match ``ab'' using the actual
 994 interface.
 995
 996 @lisp
 997 (define (match-a-b str len pos)
 998    (and (<= (+ pos 2) len)
 999         (string= str "ab" pos (+ pos 2))
1000         (list (+ pos 2) '()))) ; we return no extra information
1001 @end lisp
1002
1003 The above function can be used to match a string by running
1004 @code{(match-pattern match-a-b "ab")}.
1005
1006 @subsubheading Code Generators and Extensible Syntax
1007
1008 PEG expressions, such as those in a @code{define-peg-pattern} form, are
1009 interpreted internally in two steps.
1010
1011 First, any string PEG is expanded into an s-expression PEG by the code
1012 in the @code{(ice-9 peg string-peg)} module.
1013
1014 Then, then s-expression PEG that results is compiled into a parsing
1015 function by the @code{(ice-9 peg codegen)} module. In particular, the
1016 function @code{compile-peg-pattern} is called on the s-expression. It then
1017 decides what to do based on the form it is passed.
1018
1019 The PEG syntax can be expanded by providing @code{compile-peg-pattern} more
1020 options for what to do with its forms. The extended syntax will be
1021 associated with a symbol, for instance @code{my-parsing-form}, and will
1022 be called on all PEG expressions of the form
1023 @lisp
1024 (my-parsing-form ...)
1025 @end lisp
1026
1027 The parsing function should take two arguments. The first will be a
1028 syntax object containing a list with all of the arguments to the form
1029 (but not the form's name), and the second will be the
1030 @code{capture-type} argument that is passed to @code{define-peg-pattern}.
1031
1032 New functions can be registered by calling @code{(add-peg-compiler!
1033 symbol function)}, where @code{symbol} is the symbol that will indicate
1034 a form of this type and @code{function} is the code generating function
1035 described above. The function @code{add-peg-compiler!} is exported from
1036 the @code{(ice-9 peg codegen)} module.