peg: remove error-val
[bpt/guile.git] / doc / ref / api-peg.texi
CommitLineData
eee0877c
AW
1@c -*-texinfo-*-
2@c This is part of the GNU Guile Reference Manual.
3@c Copyright (C) 2006, 2010
4@c Free Software Foundation, Inc.
5@c See the file guile.texi for copying conditions.
6
7@node PEG Parsing
8@section PEG Parsing
9
10Parsing Expression Grammars (PEGs) are a way of specifying formal languages for text processing. They can be used either for matching (like regular expressions) or for building recursive descent parsers (like lex/yacc). Guile uses a superset of PEG syntax that allows more control over what information is preserved during parsing.
11
12Wikipedia has a clear and concise introduction to PEGs if you want to familiarize yourself with the syntax: @url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
13
14The module works by compiling PEGs down to lambda expressions. These can either be stored in variables at compile-time by the define macros (@code{define-nonterm} and @code{define-grammar}) or calculated explicitly at runtime with the compile functions (@code{peg-sexp-compile} and @code{peg-string-compile}).
15
16They can then be used for either parsing (@code{peg-parse}) or matching (@code{peg-match}). For convenience, @code{peg-match} also takes pattern literals in case you want to inline a simple search (people often use regular expressions this way).
17
18The rest of this documentation consists of a syntax reference, an API reference, and a tutorial.
19
20@menu
21* PEG Syntax Reference::
22* PEG API Reference::
23* PEG Tutorial::
24@end menu
25
26@node PEG Syntax Reference
27@subsection PEG Syntax Reference
28
29@subsubheading Normal PEG Syntax:
30
31Format: @*
32Name @*
33Description @*
34String Syntax @*
35S-expression Syntax @*
36
37Sequence @code{a} @code{b}: @*
38Parses @code{a}. If this succeeds, continues to parse @code{b} from the end of the text parsed as @code{a}. Succeeds if both @code{a} and @code{b} succeed. @*
39@code{"a b"} @*
40@code{(and a b)} @*
41
42Ordered choice @code{a} @code{b}: @*
43Parses @code{a}. If this fails, backtracks and parses @code{b}. Succeeds if either @code{a} or @code{b} succeeds. @*
44@code{"a/b"} @*
45@code{(or a b)} @*
46
47Zero or more @code{a}: @*
48Parses @code{a} as many times in a row as it can, starting each @code{a} at the end of the text parsed by the previous @code{a}. Always succeeds. @*
49@code{"a*"} @*
50@code{(body lit a *)} @*
51
52One or more @code{a}: @*
53Parses @code{a} as many times in a row as it can, starting each @code{a} at the end of the text parsed by the previous @code{a}. Succeeds if at least one @code{a} was parsed. @*
54@code{"a+"} @*
55@code{(body lit a +)} @*
56
57Optional @code{a}: @*
58Tries to parse @code{a}. Succeeds if @code{a} succeeds. @*
59@code{"a?"} @*
60@code{(body lit a ?)} @*
61
62And predicate @code{a}: @*
63Makes sure it is possible to parse @code{a}, but does not actually parse it. Succeeds if @code{a} would succeed. @*
64@code{"&a"} @*
65@code{(body & a 1)} @*
66
67Not predicate @code{a}: @*
68Makes sure it is impossible to parse @code{a}, but does not actually parse it. Succeeds if @code{a} would fail. @*
69@code{"!a"} @*
70@code{(body ! a 1)} @*
71
72String literal @code{"abc"}: @*
73Parses the string @code{"abc"}. Succeeds if that parsing succeeds. @*
74@code{"'abc'"} @*
75@code{"abc"} @*
76
77Any character: @*
78Parses any single character. Succeeds unless there is no more text to be parsed. @*
79@code{"."} @*
80@code{peg-any} @*
81
82Character class @code{a} @code{b}: @*
83Alternative syntax for ``Ordered Choice @code{a} @code{b}'' if @code{a} and @code{b} are characters. @*
84@code{"[ab]"} @*
85@code{(or "a" "b")} @*
86
87Range of characters @code{a} to @code{z}: @*
88Parses any character falling between @code{a} and @code{z}. @*
89@code{"[a-z]"} @*
90@code{(range #\a #\z)} @*
91
92Example: @*
93@code{"(a !b / c &d*) 'e'+"} @*
94Would be:
95@lisp
96(and
97 (or
98 (and a (body ! b 1))
99 (and c (body & d *)))
100 (body lit "e" +))
101@end lisp
102
103@subsubheading Extended Syntax:
104There is some extra syntax for S-expressions.
105
106Format: @*
107Description @*
108S-expression syntax @*
109
110Ignore the text matching @code{a}: @*
111@code{(ignore a)} @*
112
113Capture the text matching @code{a}: @*
114@code{(capture a)} @*
115
116Embed the PEG pattern @code{a} using string syntax: @*
117@code{(peg a)} @*
118
119Example: @*
120@code{"!a / 'b'"} @*
121Would be:
122@lisp
123(or (peg "!a") "b")
124@end lisp
125
126@node PEG API Reference
127@subsection PEG API Reference
128
129@subsubheading Define Macros
130
131The most straightforward way to define a PEG is by using one of the define macros (both of these macroexpand into @code{define} expressions). These macros bind parsing functions to variables. These parsing functions may be invoked by @code{peg-parse} or @code{peg-match}, which return a PEG match record. Raw data can be retrieved from this record with the PEG match deconstructor functions. More complicated (and perhaps enlightening) examples can be found in the tutorial.
132
133@deffn {Scheme Macro} define-grammar peg-string
134Defines all the nonterminals in the PEG @var{peg-string}. More precisely, @code{define-grammar} takes a superset of PEGs. A normal PEG has a @code{<-} between the nonterminal and the pattern. @code{define-grammar} uses this symbol to determine what information it should propagate up the parse tree. The normal @code{<-} propagates the matched text up the parse tree, @code{<--} propagates the matched text up the parse tree tagged with the name of the nonterminal, and @code{<} discards that matched text and propagates nothing up the parse tree. Also, nonterminals may consist of any alphanumeric character or a ``-'' character (in normal PEGs nonterminals can only be alphabetic).
135
136For example, if we:
137@lisp
138(define-grammar
139 "as <- 'a'+
140bs <- 'b'+
141as-or-bs <- as/bs")
142(define-grammar
143 "as-tag <-- 'a'+
144bs-tag <-- 'b'+
145as-or-bs-tag <-- as-tag/bs-tag")
146@end lisp
147Then:
148@lisp
149(peg-parse as-or-bs "aabbcc") @result{}
150#<peg start: 0 end: 2 string: aabbcc tree: aa>
151(peg-parse as-or-bs-tag "aabbcc") @result{}
152#<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
153@end lisp
154
155Note that in doing this, we have bound 6 variables at the toplevel (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and @var{as-or-bs-tag}).
156@end deffn
157
158@deffn {Scheme Macro} define-nonterm name capture-type peg-sexp
159Defines a single nonterminal @var{name}. @var{capture-type} determines how much information is passed up the parse tree. @var{peg-sexp} is a PEG in S-expression form.
160
161Possible values for capture-type: @*
162@code{all}: passes the matched text up the parse tree tagged with the name of the nonterminal. @*
163@code{body}: passes the matched text up the parse tree. @*
164@code{none}: passes nothing up the parse tree.
165
166For Example, if we:
167@lisp
168(define-nonterm as body (body lit "a" +))
169(define-nonterm bs body (body lit "b" +))
170(define-nonterm as-or-bs body (or as bs))
171(define-nonterm as-tag all (body lit "a" +))
172(define-nonterm bs-tag all (body lit "b" +))
173(define-nonterm as-or-bs-tag all (or as-tag bs-tag))
174@end lisp
175Then:
176@lisp
177(peg-parse as-or-bs "aabbcc") @result{}
178#<peg start: 0 end: 2 string: aabbcc tree: aa>
179(peg-parse as-or-bs-tag "aabbcc") @result{}
180#<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
181@end lisp
182
183Note that in doing this, we have bound 6 variables at the toplevel (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and @var{as-or-bs-tag}).
184@end deffn
185
186These are macros, with all that entails. If you've built up a list at runtime and want to define a new PEG from it, you should e.g.:
187@lisp
188(define exp '(body lit "a" +))
189(eval `(define-nonterm as body ,exp) (interaction-environment))
190@end lisp
191The @code{eval} function has a bad reputation with regard to efficiency, but this is mostly because of the extra work that has to be done compiling the expressions, which has to be done anyway when compiling the PEGs at runtime.
192
193@subsubheading
194
195@subsubheading Compile Functions
196It is sometimes useful to be able to compile anonymous PEG patterns at runtime. These functions let you do that using either syntax.
197
198@deffn {Scheme Procedure} peg-string-compile peg-string capture-type
199Compiles the PEG pattern in @var{peg-string} propagating according to @var{capture-type} (capture-type can be any of the values from @code{define-nonterm}).
200@end deffn
201
202
203@deffn {Scheme Procedure} peg-sexp-compile peg-sexp capture-type
204Compiles the PEG pattern in @var{peg-sexp} propagating according to @var{capture-type} (capture-type can be any of the values from @code{define-nonterm}).
205@end deffn
206
207
208@subsubheading Parsing & Matching Functions
209
210For our purposes, ``parsing'' means parsing a string into a tree starting from the first character, while ``matching'' means searching through the string for a substring. In practice, the only difference between the two functions is that @code{peg-parse} gives up if it can't find a valid substring starting at index 0 and @code{peg-match} keeps looking. They are both equally capable of ``parsing'' and ``matching'' given those constraints.
211
212@deffn {Scheme Procedure} peg-parse nonterm string
213Parses @var{string} using the PEG stored in @var{nonterm}. If no match was found, @code{peg-parse} returns false. If a match was found, a PEG match record is returned.
214
215The @code{capture-type} argument to @code{define-nonterm} allows you to choose what information to hold on to while parsing. The options are: @*
216@code{all}: tag the matched text with the nonterminal @*
217@code{body}: just the matched text @*
218@code{none}: nothing @*
219
220@lisp
221(define-nonterm as all (body lit "a" +))
222(peg-parse as "aabbcc") @result{}
223#<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
224
225(define-nonterm as body (body lit "a" +))
226(peg-parse as "aabbcc") @result{}
227#<peg start: 0 end: 2 string: aabbcc tree: aa>
228
229(define-nonterm as none (body lit "a" +))
230(peg-parse as "aabbcc") @result{}
231#<peg start: 0 end: 2 string: aabbcc tree: ()>
232
233(define-nonterm bs body (body lit "b" +))
234(peg-parse bs "aabbcc") @result{}
235#f
236@end lisp
237@end deffn
238
239@deffn {Scheme Macro} peg-match nonterm-or-peg string
240Searches through @var{string} looking for a matching subexpression. @var{nonterm-or-peg} can either be a nonterminal or a literal PEG pattern. When a literal PEG pattern is provided, @code{peg-match} works very similarly to the regular expression searches many hackers are used to. If no match was found, @code{peg-match} returns false. If a match was found, a PEG match record is returned.
241
242@lisp
243(define-nonterm as body (body lit "a" +))
244(peg-match as "aabbcc") @result{}
245#<peg start: 0 end: 2 string: aabbcc tree: aa>
246(peg-match (body lit "a" +) "aabbcc") @result{}
247#<peg start: 0 end: 2 string: aabbcc tree: aa>
248(peg-match "'a'+" "aabbcc") @result{}
249#<peg start: 0 end: 2 string: aabbcc tree: aa>
250
251(define-nonterm as all (body lit "a" +))
252(peg-match as "aabbcc") @result{}
253#<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
254
255(define-nonterm bs body (body lit "b" +))
256(peg-match bs "aabbcc") @result{}
257#<peg start: 2 end: 4 string: aabbcc tree: bb>
258(peg-match (body lit "b" +) "aabbcc") @result{}
259#<peg start: 2 end: 4 string: aabbcc tree: bb>
260(peg-match "'b'+" "aabbcc") @result{}
261#<peg start: 2 end: 4 string: aabbcc tree: bb>
262
263(define-nonterm zs body (body lit "z" +))
264(peg-match zs "aabbcc") @result{}
265#f
266(peg-match (body lit "z" +) "aabbcc") @result{}
267#f
268(peg-match "'z'+" "aabbcc") @result{}
269#f
270@end lisp
271@end deffn
272
273@subsubheading PEG Match Records
274The @code{peg-parse} and @code{peg-match} functions both return PEG match records. Actual information can be extracted from these with the following functions.
275
276@deffn {Scheme Procedure} peg:string peg-match
277Returns the original string that was parsed in the creation of @code{peg-match}.
278@end deffn
279
280@deffn {Scheme Procedure} peg:start peg-match
281Returns the index of the first parsed character in the original string (from @code{peg:string}). If this is the same as @code{peg:end}, nothing was parsed.
282@end deffn
283
284@deffn {Scheme Procedure} peg:end peg-match
285Returns one more than the index of the last parsed character in the original string (from @code{peg:string}). If this is the same as @code{peg:start}, nothing was parsed.
286@end deffn
287
288@deffn {Scheme Procedure} peg:substring peg-match
289Returns the substring parsed by @code{peg-match}. This is equivalent to @code{(substring (peg:string peg-match) (peg:start peg-match) (peg:end peg-match))}.
290@end deffn
291
292@deffn {Scheme Procedure} peg:tree peg-match
293Returns the tree parsed by @code{peg-match}.
294@end deffn
295
296@deffn {Scheme Procedure} peg-record? peg-match
297Returns true if @code{peg-match} is a PEG match record, or false otherwise.
298@end deffn
299
300Example:
301@lisp
302(define-nonterm bs all (peg "'b'+"))
303
304(peg-match bs "aabbcc") @result{}
305#<peg start: 2 end: 4 string: aabbcc tree: (bs bb)>
306
307(let ((pm (peg-match bs "aabbcc")))
308 `((string ,(peg:string pm))
309 (start ,(peg:start pm))
310 (end ,(peg:end pm))
311 (substring ,(peg:substring pm))
312 (tree ,(peg:tree pm))
313 (record? ,(peg-record? pm)))) @result{}
314((string "aabbcc")
315 (start 2)
316 (end 4)
317 (substring "bb")
318 (tree (bs "bb"))
319 (record? #t))
320@end lisp
321
322@subsubheading Miscellaneous
323
324@deffn {Scheme Procedure} context-flatten tst lst
325Takes a predicate @var{tst} and a list @var{lst}. Flattens @var{lst} until all elements are either atoms or satisfy @var{tst}. If @var{lst} itself satisfies @var{tst}, @code{(list lst)} is returned (this is a flat list whose only element satisfies @var{tst}).
326
327@lisp
328(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{}
329(2 2 (1 1 (2 2)) 2 2 (1 1))
330(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{}
331((1 1 (1 1 (2 2)) (2 2 (1 1))))
332@end lisp
333
334If you're wondering why this is here, take a look at the tutorial.
335@end deffn
336
337@deffn {Scheme Procedure} keyword-flatten terms lst
338A less general form of @code{context-flatten}. Takes a list of terminal atoms @code{terms} and flattens @var{lst} until all elements are either atoms, or lists which have an atom from @code{terms} as their first element.
339@lisp
340(keyword-flatten '(a b) '(c a b (a c) (b c) (c (b a) (c a)))) @result{}
341(c a b (a c) (b c) c (b a) c a)
342@end lisp
343
344If you're wondering why this is here, take a look at the tutorial.
345@end deffn
346
347@node PEG Tutorial
348@subsection PEG Tutorial
349
350@subsubheading Parsing /etc/passwd
351This example will show how to parse /etc/passwd using PEGs.
352
353First we define an example /etc/passwd file:
354
355@lisp
356(define *etc-passwd*
357 "root:x:0:0:root:/root:/bin/bash
358daemon:x:1:1:daemon:/usr/sbin:/bin/sh
359bin:x:2:2:bin:/bin:/bin/sh
360sys:x:3:3:sys:/dev:/bin/sh
361nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
362messagebus:x:103:107::/var/run/dbus:/bin/false
363")
364@end lisp
365
366As a first pass at this, we might want to have all the entries in /etc/passwd in a list.
367
368Doing this with string-based PEG syntax would look like this:
369@lisp
370(define-grammar
371 "passwd <- entry* !.
372entry <-- (! NL .)* NL*
373NL < '\n'")
374@end lisp
375A @code{passwd} file is 0 or more entries (@code{entry*}) until the end of the file (@code{!.} (@code{.} is any character, so @code{!.} means ``not anything'')). We want to capture the data in the nonterminal @code{passwd}, but not tag it with the name, so we use @code{<-}.
376An entry is a series of 0 or more characters that aren't newlines (@code{(! NL .)*}) followed by 0 or more newlines (@code{NL*}). We want to tag all the entries with @code{entry}, so we use @code{<--}.
377A newline is just a literal newline (@code{'\n'}). We don't want a bunch of newlines cluttering up the output, so we use @code{<} to throw away the captured data.
378
379Here is the same PEG defined using S-expressions:
380@lisp
381(define-nonterm passwd body (and (body lit entry *) (body ! peg-any 1)))
382(define-nonterm entry all (and (body lit (and (body ! NL 1) peg-any) *)
383 (body lit NL *)))
384(define-nonterm NL none "\n")
385@end lisp
386
387Obviously this is much more verbose. On the other hand, it's more explicit, and thus easier to build automatically. However, there are some tricks that make S-expressions easier to use in some cases. One is the @code{ignore} keyword; the string syntax has no way to say ``throw away this text'' except breaking it out into a separate nonterminal. For instance, to throw away the newlines we had to define @code{NL}. In the S-expression syntax, we could have simply written @code{(ignore "\n")}. Also, for the cases where string syntax is really much cleaner, the @code{peg} keyword can be used to embed string syntax in S-expression syntax. For instance, we could have written:
388@lisp
389(define-nonterm passwd body (peg "entry* !."))
390@end lisp
391
392However we define it, parsing @code{*etc-passwd*} with the @code{passwd} nonterminal yields the same results:
393@lisp
394(peg:tree (peg-parse passwd *etc-passwd*)) @result{}
395((entry "root:x:0:0:root:/root:/bin/bash")
396 (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
397 (entry "bin:x:2:2:bin:/bin:/bin/sh")
398 (entry "sys:x:3:3:sys:/dev:/bin/sh")
399 (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
400 (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
401@end lisp
402
403However, here is something to be wary of:
404@lisp
405(peg:tree (peg-parse passwd "one entry")) @result{}
406(entry "one entry")
407@end lisp
408
409By default, the parse trees generated by PEGs are compressed as much as possible without losing information. It may not look like this is what you want at first, but uncompressed parse trees are an enormous headache (there's no easy way to predict how deep particular lists will nest, there are empty lists littered everywhere, etc. etc.). One side-effect of this, however, is that sometimes the compressor is too aggressive. No information is discarded when @code{((entry "one entry"))} is compressed to @code{(entry "one entry")}, but in this particular case it probably isn't what we want. @*
410
411There are two functions for easily dealing with this: @code{keyword-flatten} and @code{context-flatten}. The @code{keyword-flatten} function takes a list of keywords and a list to flatten, then tries to coerce the list such that the first element of all sublists is one of the keywords. The @code{context-flatten} function is similar, but instead of a list of keywords it takes a predicate that should indicate whether a given sublist is good enough (refer to the API reference for more details). @*
412
413What we want here is @code{keyword-flatten}.
414@lisp
415(keyword-flatten '(entry) (peg:tree (peg-parse passwd *etc-passwd*))) @result{}
416((entry "root:x:0:0:root:/root:/bin/bash")
417 (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
418 (entry "bin:x:2:2:bin:/bin:/bin/sh")
419 (entry "sys:x:3:3:sys:/dev:/bin/sh")
420 (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
421 (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
422(keyword-flatten '(entry) (peg:tree (peg-parse passwd "one entry"))) @result{}
423((entry "one entry"))
424@end lisp
425
426Of course, this is a somewhat contrived example. In practice we would probably just tag the @code{passwd} nonterminal to remove the ambiguity (using either the @code{all} keyword for S-expressions or the @code{<--} symbol for strings)..
427
428@lisp
429(define-nonterm tag-passwd all (peg "entry* !."))
430(peg:tree (peg-parse tag-passwd *etc-passwd*)) @result{}
431(tag-passwd
432 (entry "root:x:0:0:root:/root:/bin/bash")
433 (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
434 (entry "bin:x:2:2:bin:/bin:/bin/sh")
435 (entry "sys:x:3:3:sys:/dev:/bin/sh")
436 (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
437 (entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
438(peg:tree (peg-parse tag-passwd "one entry"))
439(tag-passwd
440 (entry "one entry"))
441@end lisp
442
443If you're ever uncertain about the potential results of parsing something, remember the two absolute rules: @*
4441. No parsing information will ever be discarded. @*
4452. There will never be any lists with fewer than 2 elements. @*
446
447For the purposes of (1), "parsing information" means things tagged with the @code{any} keyword or the @code{<--} symbol. Plain strings will be concatenated. @*
448
449Let's extend this example a bit more and actually pull some useful information out of the passwd file:
450@lisp
451(define-grammar
452 "passwd <-- entry* !.
453entry <-- login C pass C uid C gid C nameORcomment C homedir C shell NL*
454login <-- text
455pass <-- text
456uid <-- [0-9]*
457gid <-- [0-9]*
458nameORcomment <-- text
459homedir <-- path
460shell <-- path
461path <-- (SLASH pathELEMENT)*
462pathELEMENT <-- (!NL !C !'/' .)*
463text <- (!NL !C .)*
464C < ':'
465NL < '\n'
466SLASH < '/'")
467@end lisp
468
469This produces rather pretty parse trees:
470@lisp
471(passwd
472 (entry (login "root")
473 (pass "x")
474 (uid "0")
475 (gid "0")
476 (nameORcomment "root")
477 (homedir (path (pathELEMENT "root")))
478 (shell (path (pathELEMENT "bin") (pathELEMENT "bash"))))
479 (entry (login "daemon")
480 (pass "x")
481 (uid "1")
482 (gid "1")
483 (nameORcomment "daemon")
484 (homedir
485 (path (pathELEMENT "usr") (pathELEMENT "sbin")))
486 (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
487 (entry (login "bin")
488 (pass "x")
489 (uid "2")
490 (gid "2")
491 (nameORcomment "bin")
492 (homedir (path (pathELEMENT "bin")))
493 (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
494 (entry (login "sys")
495 (pass "x")
496 (uid "3")
497 (gid "3")
498 (nameORcomment "sys")
499 (homedir (path (pathELEMENT "dev")))
500 (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
501 (entry (login "nobody")
502 (pass "x")
503 (uid "65534")
504 (gid "65534")
505 (nameORcomment "nobody")
506 (homedir (path (pathELEMENT "nonexistent")))
507 (shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
508 (entry (login "messagebus")
509 (pass "x")
510 (uid "103")
511 (gid "107")
512 nameORcomment
513 (homedir
514 (path (pathELEMENT "var")
515 (pathELEMENT "run")
516 (pathELEMENT "dbus")))
517 (shell (path (pathELEMENT "bin") (pathELEMENT "false")))))
518@end lisp
519
520Notice that when there's no entry in a field (e.g. @code{nameORcomment} for messagebus) the symbol is inserted. This is the ``don't throw away any information'' rule---we succesfully matched a @code{nameORcomment} of 0 characters (since we used @code{*} when defining it). This is usually what you want, because it allows you to e.g. use @code{list-ref} to pull out elements (since they all have known offsets). @*
521
522If you'd prefer not to have symbols for empty matches, you can replace the @code{*} with a @code{+} and add a @code{?} after the @code{nameORcomment} in @code{entry}. Then it will try to parse 1 or more characters, fail (inserting nothing into the parse tree), but continue because it didn't have to match the nameORcomment to continue.
523
524
525@subsubheading Embedding Arithmetic Expressions
526
527We can parse simple mathematical expressions with the following PEG:
528
529@lisp
530(define-grammar
531 "expr <- sum
532sum <-- (product ('+' / '-') sum) / product
533product <-- (value ('*' / '/') product) / value
534value <-- number / '(' expr ')'
535number <-- [0-9]+")
536@end lisp
537
538Then:
539@lisp
540(peg:tree (peg-parse expr "1+1/2*3+(1+1)/2")) @result{}
541(sum (product (value (number "1")))
542 "+"
543 (sum (product
544 (value (number "1"))
545 "/"
546 (product
547 (value (number "2"))
548 "*"
549 (product (value (number "3")))))
550 "+"
551 (sum (product
552 (value "("
553 (sum (product (value (number "1")))
554 "+"
555 (sum (product (value (number "1")))))
556 ")")
557 "/"
558 (product (value (number "2")))))))
559@end lisp
560
561There is very little wasted effort in this PEG. The @code{number} nonterminal has to be tagged because otherwise the numbers might run together with the arithmetic expressions during the string concatenation stage of parse-tree compression (the parser will see ``1'' followed by ``/'' and decide to call it ``1/''). When in doubt, tag.
562
563It is very easy to turn these parse trees into lisp expressions:
564@lisp
565(define (parse-sum sum left . rest)
566 (if (null? rest)
567 (apply parse-product left)
568 (list (string->symbol (car rest))
569 (apply parse-product left)
570 (apply parse-sum (cadr rest)))))
571
572(define (parse-product product left . rest)
573 (if (null? rest)
574 (apply parse-value left)
575 (list (string->symbol (car rest))
576 (apply parse-value left)
577 (apply parse-product (cadr rest)))))
578
579(define (parse-value value first . rest)
580 (if (null? rest)
581 (string->number (cadr first))
582 (apply parse-sum (car rest))))
583
584(define parse-expr parse-sum)
585@end lisp
586(Notice all these functions look very similar; for a more complicated PEG, it would be worth abstracting.)
587
588Then:
589@lisp
590(apply parse-expr (peg:tree (peg-parse expr "1+1/2*3+(1+1)/2"))) @result{}
591(+ 1 (+ (/ 1 (* 2 3)) (/ (+ 1 1) 2)))
592@end lisp
593
594But wait! The associativity is wrong! Where it says @code{(/ 1 (* 2 3))}, it should say @code{(* (/ 1 2) 3)}.
595
596It's tempting to try replacing e.g. @code{"sum <-- (product ('+' / '-') sum) / product"} with @code{"sum <-- (sum ('+' / '-') product) / product"}, but this is a Bad Idea. PEGs don't support left recursion. To see why, imagine what the parser will do here. When it tries to parse @code{sum}, it first has to try and parse @code{sum}. But to do that, it first has to try and parse @code{sum}. This will continue until the stack gets blown off.
597
598So how does one parse left-associative binary operators with PEGs? Honestly, this is one of their major shortcomings. There's no general-purpose way of doing this, but here the repetition operators are a good choice:
599
600@lisp
601(use-modules (srfi srfi-1))
602
603(define-grammar
604 "expr <- sum
605sum <-- (product ('+' / '-'))* product
606product <-- (value ('*' / '/'))* value
607value <-- number / '(' expr ')'
608number <-- [0-9]+")
609
610;; take a deep breath...
611(define (make-left-parser next-func)
612 (lambda (sum first . rest) ;; general form, comments below assume
613 ;; that we're dealing with a sum expression
614 (if (null? rest) ;; form (sum (product ...))
615 (apply next-func first)
616 (if (string? (cadr first));; form (sum ((product ...) "+") (product ...))
617 (list (string->symbol (cadr first))
618 (apply next-func (car first))
619 (apply next-func (car rest)))
620 ;; form (sum (((product ...) "+") ((product ...) "+")) (product ...))
621 (car
622 (reduce ;; walk through the list and build a left-associative tree
623 (lambda (l r)
624 (list (list (cadr r) (car r) (apply next-func (car l)))
625 (string->symbol (cadr l))))
626 'ignore
627 (append ;; make a list of all the products
628 ;; the first one should be pre-parsed
629 (list (list (apply next-func (caar first))
630 (string->symbol (cadar first))))
631 (cdr first)
632 ;; the last one has to be added in
633 (list (append rest '("done"))))))))))
634
635(define (parse-value value first . rest)
636 (if (null? rest)
637 (string->number (cadr first))
638 (apply parse-sum (car rest))))
639(define parse-product (make-left-parser parse-value))
640(define parse-sum (make-left-parser parse-product))
641(define parse-expr parse-sum)
642@end lisp
643
644Then:
645@lisp
646(apply parse-expr (peg:tree (peg-parse expr "1+1/2*3+(1+1)/2"))) @result{}
647(+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2))
648@end lisp
649
650As you can see, this is much uglier (it could be made prettier by using @code{context-flatten}, but the way it's written above makes it clear how we deal with the three ways the zero-or-more @code{*} expression can parse). Fortunately, most of the time we can get away with only using right-associativity.
651
652@subsubheading Simplified Functions
653
654For a more tantalizing example, consider the following grammar that parses (highly) simplified C functions:
655@lisp
656(define-grammar
657 "cfunc <-- cSP ctype cSP cname cSP cargs cLB cSP cbody cRB
658ctype <-- cidentifier
659cname <-- cidentifier
660cargs <-- cLP (! (cSP cRP) carg cSP (cCOMMA / cRP) cSP)* cSP
661carg <-- cSP ctype cSP cname
662cbody <-- cstatement *
663cidentifier <- [a-zA-z][a-zA-Z0-9_]*
664cstatement <-- (!';'.)*cSC cSP
665cSC < ';'
666cCOMMA < ','
667cLP < '('
668cRP < ')'
669cLB < '@{'
670cRB < '@}'
671cSP < [ \t\n]*")
672@end lisp
673
674Then:
675@lisp
676(peg-parse cfunc "int square(int a) @{ return a*a;@}") @result{}
677(32
678 (cfunc (ctype "int")
679 (cname "square")
680 (cargs (carg (ctype "int") (cname "a")))
681 (cbody (cstatement "return a*a"))))
682@end lisp
683
684And:
685@lisp
686(peg-parse cfunc "int mod(int a, int b) @{ int c = a/b;return a-b*c; @}") @result{}
687(52
688 (cfunc (ctype "int")
689 (cname "mod")
690 (cargs (carg (ctype "int") (cname "a"))
691 (carg (ctype "int") (cname "b")))
692 (cbody (cstatement "int c = a/b")
693 (cstatement "return a- b*c"))))
694@end lisp
695
696By wrapping all the @code{carg} nonterminals in a @code{cargs} nonterminal, we were able to remove any ambiguity in the parsing structure and avoid having to call @code{context-flatten} on the output of @code{peg-parse}. We used the same trick with the @code{cstatement} nonterminals, wrapping them in a @code{cbody} nonterminal.
697
698The whitespace nonterminal @code{cSP} used here is a (very) useful instantiation of a common pattern for matching syntactically irrelevant information. Since it's tagged with @code{<} and ends with @code{*} it won't clutter up the parse trees (all the empty lists will be discarded during the compression step) and it will never cause parsing to fail.
699