Commit | Line | Data |
---|---|---|
eee0877c AW |
1 | @c -*-texinfo-*- |
2 | @c This is part of the GNU Guile Reference Manual. | |
718bc349 | 3 | @c Copyright (C) 2006, 2010, 2011 |
eee0877c AW |
4 | @c Free Software Foundation, Inc. |
5 | @c See the file guile.texi for copying conditions. | |
6 | ||
7 | @node PEG Parsing | |
8 | @section PEG Parsing | |
9 | ||
718bc349 AW |
10 | Parsing Expression Grammars (PEGs) are a way of specifying formal |
11 | languages for text processing. They can be used either for matching | |
12 | (like regular expressions) or for building recursive descent parsers | |
13 | (like lex/yacc). Guile uses a superset of PEG syntax that allows more | |
14 | control over what information is preserved during parsing. | |
eee0877c | 15 | |
718bc349 AW |
16 | Wikipedia has a clear and concise introduction to PEGs if you want to |
17 | familiarize yourself with the syntax: | |
18 | @url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}. | |
eee0877c | 19 | |
718bc349 AW |
20 | The module works by compiling PEGs down to lambda expressions. These |
21 | can either be stored in variables at compile-time by the define macros | |
3ebd5786 | 22 | (@code{define-peg-pattern} and @code{define-peg-string-patterns}) or calculated |
718bc349 | 23 | explicitly at runtime with the compile functions |
fee87b82 | 24 | (@code{compile-peg-pattern} and @code{peg-string-compile}). |
eee0877c | 25 | |
8022f502 | 26 | They can then be used for either parsing (@code{match-pattern}) or searching |
d7e2f5e3 NL |
27 | (@code{search-for-pattern}). For convenience, @code{search-for-pattern} |
28 | also takes pattern literals in case you want to inline a simple search | |
29 | (people often use regular expressions this way). | |
eee0877c | 30 | |
718bc349 AW |
31 | The rest of this documentation consists of a syntax reference, an API |
32 | reference, and a tutorial. | |
eee0877c AW |
33 | |
34 | @menu | |
35 | * PEG Syntax Reference:: | |
36 | * PEG API Reference:: | |
37 | * PEG Tutorial:: | |
97c84694 | 38 | * PEG Internals:: |
eee0877c AW |
39 | @end menu |
40 | ||
41 | @node PEG Syntax Reference | |
42 | @subsection PEG Syntax Reference | |
43 | ||
44 | @subsubheading Normal PEG Syntax: | |
45 | ||
718bc349 AW |
46 | @deftp {PEG Pattern} sequence a b |
47 | Parses @var{a}. If this succeeds, continues to parse @var{b} from the | |
48 | end of the text parsed as @var{a}. Succeeds if both @var{a} and | |
49 | @var{b} succeed. | |
50 | ||
51 | @code{"a b"} | |
52 | ||
53 | @code{(and a b)} | |
54 | @end deftp | |
55 | ||
56 | @deftp {PEG Pattern} {ordered choice} a b | |
57 | Parses @var{a}. If this fails, backtracks and parses @var{b}. | |
58 | Succeeds if either @var{a} or @var{b} succeeds. | |
59 | ||
60 | @code{"a/b"} | |
61 | ||
62 | @code{(or a b)} | |
63 | @end deftp | |
64 | ||
65 | @deftp {PEG Pattern} {zero or more} a | |
66 | Parses @var{a} as many times in a row as it can, starting each @var{a} | |
67 | at the end of the text parsed by the previous @var{a}. Always | |
68 | succeeds. | |
69 | ||
70 | @code{"a*"} | |
71 | ||
f310a111 | 72 | @code{(* a)} |
718bc349 AW |
73 | @end deftp |
74 | ||
75 | @deftp {PEG Pattern} {one or more} a | |
76 | Parses @var{a} as many times in a row as it can, starting each @var{a} | |
77 | at the end of the text parsed by the previous @var{a}. Succeeds if at | |
78 | least one @var{a} was parsed. | |
79 | ||
80 | @code{"a+"} | |
81 | ||
3d19969d | 82 | @code{(+ a)} |
718bc349 AW |
83 | @end deftp |
84 | ||
85 | @deftp {PEG Pattern} optional a | |
86 | Tries to parse @var{a}. Succeeds if @var{a} succeeds. | |
87 | ||
88 | @code{"a?"} | |
89 | ||
8e97edd5 | 90 | @code{(? a)} |
718bc349 AW |
91 | @end deftp |
92 | ||
66ba3de0 | 93 | @deftp {PEG Pattern} {followed by} a |
718bc349 AW |
94 | Makes sure it is possible to parse @var{a}, but does not actually parse |
95 | it. Succeeds if @var{a} would succeed. | |
96 | ||
97 | @code{"&a"} | |
98 | ||
66ba3de0 | 99 | @code{(followed-by a)} |
718bc349 AW |
100 | @end deftp |
101 | ||
72287411 | 102 | @deftp {PEG Pattern} {not followed by} a |
718bc349 AW |
103 | Makes sure it is impossible to parse @var{a}, but does not actually |
104 | parse it. Succeeds if @var{a} would fail. | |
105 | ||
106 | @code{"!a"} | |
107 | ||
72287411 | 108 | @code{(not-followed-by a)} |
718bc349 AW |
109 | @end deftp |
110 | ||
111 | @deftp {PEG Pattern} {string literal} ``abc'' | |
112 | Parses the string @var{"abc"}. Succeeds if that parsing succeeds. | |
113 | ||
114 | @code{"'abc'"} | |
115 | ||
116 | @code{"abc"} | |
117 | @end deftp | |
118 | ||
119 | @deftp {PEG Pattern} {any character} | |
120 | Parses any single character. Succeeds unless there is no more text to | |
121 | be parsed. | |
122 | ||
123 | @code{"."} | |
124 | ||
125 | @code{peg-any} | |
126 | @end deftp | |
127 | ||
128 | @deftp {PEG Pattern} {character class} a b | |
129 | Alternative syntax for ``Ordered Choice @var{a} @var{b}'' if @var{a} and | |
130 | @var{b} are characters. | |
131 | ||
132 | @code{"[ab]"} | |
133 | ||
134 | @code{(or "a" "b")} | |
135 | @end deftp | |
136 | ||
137 | @deftp {PEG Pattern} {range of characters} a z | |
138 | Parses any character falling between @var{a} and @var{z}. | |
139 | ||
140 | @code{"[a-z]"} | |
141 | ||
142 | @code{(range #\a #\z)} | |
143 | @end deftp | |
144 | ||
145 | Example: | |
146 | ||
147 | @example | |
148 | "(a !b / c &d*) 'e'+" | |
149 | @end example | |
150 | ||
eee0877c | 151 | Would be: |
718bc349 | 152 | |
eee0877c AW |
153 | @lisp |
154 | (and | |
155 | (or | |
e0d14025 NL |
156 | (and a (not-followed-by b)) |
157 | (and c (followed-by (* d)))) | |
158 | (+ "e")) | |
eee0877c AW |
159 | @end lisp |
160 | ||
718bc349 AW |
161 | @subsubheading Extended Syntax |
162 | ||
eee0877c AW |
163 | There is some extra syntax for S-expressions. |
164 | ||
718bc349 AW |
165 | @deftp {PEG Pattern} ignore a |
166 | Ignore the text matching @var{a} | |
167 | @end deftp | |
eee0877c | 168 | |
718bc349 AW |
169 | @deftp {PEG Pattern} capture a |
170 | Capture the text matching @var{a}. | |
171 | @end deftp | |
eee0877c | 172 | |
718bc349 AW |
173 | @deftp {PEG Pattern} peg a |
174 | Embed the PEG pattern @var{a} using string syntax. | |
175 | @end deftp | |
176 | ||
177 | Example: | |
eee0877c | 178 | |
718bc349 AW |
179 | @example |
180 | "!a / 'b'" | |
181 | @end example | |
eee0877c | 182 | |
e0d14025 | 183 | Is equivalent to |
718bc349 | 184 | |
eee0877c AW |
185 | @lisp |
186 | (or (peg "!a") "b") | |
187 | @end lisp | |
188 | ||
e0d14025 NL |
189 | and |
190 | ||
191 | @lisp | |
192 | (or (not-followed-by a) "b") | |
193 | @end lisp | |
194 | ||
eee0877c AW |
195 | @node PEG API Reference |
196 | @subsection PEG API Reference | |
197 | ||
198 | @subsubheading Define Macros | |
199 | ||
718bc349 AW |
200 | The most straightforward way to define a PEG is by using one of the |
201 | define macros (both of these macroexpand into @code{define} | |
202 | expressions). These macros bind parsing functions to variables. These | |
8022f502 | 203 | parsing functions may be invoked by @code{match-pattern} or |
d7e2f5e3 | 204 | @code{search-for-pattern}, which return a PEG match record. Raw data can be |
718bc349 AW |
205 | retrieved from this record with the PEG match deconstructor functions. |
206 | More complicated (and perhaps enlightening) examples can be found in the | |
207 | tutorial. | |
eee0877c | 208 | |
3ebd5786 | 209 | @deffn {Scheme Macro} define-peg-string-patterns peg-string |
718bc349 | 210 | Defines all the nonterminals in the PEG @var{peg-string}. More |
3ebd5786 | 211 | precisely, @code{define-peg-string-patterns} takes a superset of PEGs. A normal PEG |
718bc349 | 212 | has a @code{<-} between the nonterminal and the pattern. |
3ebd5786 | 213 | @code{define-peg-string-patterns} uses this symbol to determine what information it |
718bc349 AW |
214 | should propagate up the parse tree. The normal @code{<-} propagates the |
215 | matched text up the parse tree, @code{<--} propagates the matched text | |
216 | up the parse tree tagged with the name of the nonterminal, and @code{<} | |
217 | discards that matched text and propagates nothing up the parse tree. | |
218 | Also, nonterminals may consist of any alphanumeric character or a ``-'' | |
219 | character (in normal PEGs nonterminals can only be alphabetic). | |
eee0877c AW |
220 | |
221 | For example, if we: | |
222 | @lisp | |
3ebd5786 | 223 | (define-peg-string-patterns |
eee0877c AW |
224 | "as <- 'a'+ |
225 | bs <- 'b'+ | |
226 | as-or-bs <- as/bs") | |
3ebd5786 | 227 | (define-peg-string-patterns |
eee0877c AW |
228 | "as-tag <-- 'a'+ |
229 | bs-tag <-- 'b'+ | |
230 | as-or-bs-tag <-- as-tag/bs-tag") | |
231 | @end lisp | |
232 | Then: | |
233 | @lisp | |
8022f502 | 234 | (match-pattern as-or-bs "aabbcc") @result{} |
eee0877c | 235 | #<peg start: 0 end: 2 string: aabbcc tree: aa> |
8022f502 | 236 | (match-pattern as-or-bs-tag "aabbcc") @result{} |
eee0877c AW |
237 | #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))> |
238 | @end lisp | |
239 | ||
718bc349 AW |
240 | Note that in doing this, we have bound 6 variables at the toplevel |
241 | (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and | |
242 | @var{as-or-bs-tag}). | |
eee0877c AW |
243 | @end deffn |
244 | ||
40ebbd64 | 245 | @deffn {Scheme Macro} define-peg-pattern name capture-type peg-sexp |
718bc349 AW |
246 | Defines a single nonterminal @var{name}. @var{capture-type} determines |
247 | how much information is passed up the parse tree. @var{peg-sexp} is a | |
248 | PEG in S-expression form. | |
249 | ||
250 | Possible values for capture-type: | |
251 | ||
252 | @table @code | |
253 | @item all | |
254 | passes the matched text up the parse tree tagged with the name of the | |
255 | nonterminal. | |
256 | @item body | |
257 | passes the matched text up the parse tree. | |
258 | @item none | |
259 | passes nothing up the parse tree. | |
260 | @end table | |
eee0877c AW |
261 | |
262 | For Example, if we: | |
263 | @lisp | |
40ebbd64 NL |
264 | (define-peg-pattern as body (+ "a")) |
265 | (define-peg-pattern bs body (+ "b")) | |
266 | (define-peg-pattern as-or-bs body (or as bs)) | |
267 | (define-peg-pattern as-tag all (+ "a")) | |
268 | (define-peg-pattern bs-tag all (+ "b")) | |
269 | (define-peg-pattern as-or-bs-tag all (or as-tag bs-tag)) | |
eee0877c AW |
270 | @end lisp |
271 | Then: | |
272 | @lisp | |
8022f502 | 273 | (match-pattern as-or-bs "aabbcc") @result{} |
eee0877c | 274 | #<peg start: 0 end: 2 string: aabbcc tree: aa> |
8022f502 | 275 | (match-pattern as-or-bs-tag "aabbcc") @result{} |
eee0877c AW |
276 | #<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))> |
277 | @end lisp | |
278 | ||
718bc349 AW |
279 | Note that in doing this, we have bound 6 variables at the toplevel |
280 | (@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and | |
281 | @var{as-or-bs-tag}). | |
eee0877c AW |
282 | @end deffn |
283 | ||
eee0877c | 284 | @subsubheading Compile Functions |
718bc349 AW |
285 | It is sometimes useful to be able to compile anonymous PEG patterns at |
286 | runtime. These functions let you do that using either syntax. | |
eee0877c AW |
287 | |
288 | @deffn {Scheme Procedure} peg-string-compile peg-string capture-type | |
718bc349 AW |
289 | Compiles the PEG pattern in @var{peg-string} propagating according to |
290 | @var{capture-type} (capture-type can be any of the values from | |
40ebbd64 | 291 | @code{define-peg-pattern}). |
eee0877c AW |
292 | @end deffn |
293 | ||
294 | ||
fee87b82 | 295 | @deffn {Scheme Procedure} compile-peg-pattern peg-sexp capture-type |
718bc349 AW |
296 | Compiles the PEG pattern in @var{peg-sexp} propagating according to |
297 | @var{capture-type} (capture-type can be any of the values from | |
40ebbd64 | 298 | @code{define-peg-pattern}). |
eee0877c AW |
299 | @end deffn |
300 | ||
ecaa261a NL |
301 | The functions return syntax objects, which can be useful if you want to |
302 | use them in macros. If all you want is to define a new nonterminal, you | |
303 | can do the following: | |
304 | ||
305 | @lisp | |
306 | (define exp '(+ "a")) | |
fee87b82 | 307 | (define as (compile (compile-peg-pattern exp 'body))) |
ecaa261a NL |
308 | @end lisp |
309 | ||
310 | You can use this nonterminal with all of the regular PEG functions: | |
311 | ||
312 | @lisp | |
8022f502 | 313 | (match-pattern as "aaaaa") @result{} |
ecaa261a NL |
314 | #<peg start: 0 end: 5 string: bbbbb tree: bbbbb> |
315 | @end lisp | |
eee0877c AW |
316 | |
317 | @subsubheading Parsing & Matching Functions | |
318 | ||
718bc349 AW |
319 | For our purposes, ``parsing'' means parsing a string into a tree |
320 | starting from the first character, while ``matching'' means searching | |
321 | through the string for a substring. In practice, the only difference | |
8022f502 | 322 | between the two functions is that @code{match-pattern} gives up if it can't |
d7e2f5e3 | 323 | find a valid substring starting at index 0 and @code{search-for-pattern} keeps |
718bc349 AW |
324 | looking. They are both equally capable of ``parsing'' and ``matching'' |
325 | given those constraints. | |
eee0877c | 326 | |
8022f502 | 327 | @deffn {Scheme Procedure} match-pattern nonterm string |
718bc349 | 328 | Parses @var{string} using the PEG stored in @var{nonterm}. If no match |
8022f502 | 329 | was found, @code{match-pattern} returns false. If a match was found, a PEG |
718bc349 AW |
330 | match record is returned. |
331 | ||
40ebbd64 | 332 | The @code{capture-type} argument to @code{define-peg-pattern} allows you to |
718bc349 AW |
333 | choose what information to hold on to while parsing. The options are: |
334 | ||
335 | @table @code | |
336 | @item all | |
337 | tag the matched text with the nonterminal | |
338 | @item body | |
339 | just the matched text | |
340 | @item none | |
341 | nothing | |
342 | @end table | |
eee0877c AW |
343 | |
344 | @lisp | |
40ebbd64 | 345 | (define-peg-pattern as all (+ "a")) |
8022f502 | 346 | (match-pattern as "aabbcc") @result{} |
eee0877c AW |
347 | #<peg start: 0 end: 2 string: aabbcc tree: (as aa)> |
348 | ||
40ebbd64 | 349 | (define-peg-pattern as body (+ "a")) |
8022f502 | 350 | (match-pattern as "aabbcc") @result{} |
eee0877c AW |
351 | #<peg start: 0 end: 2 string: aabbcc tree: aa> |
352 | ||
40ebbd64 | 353 | (define-peg-pattern as none (+ "a")) |
8022f502 | 354 | (match-pattern as "aabbcc") @result{} |
eee0877c AW |
355 | #<peg start: 0 end: 2 string: aabbcc tree: ()> |
356 | ||
40ebbd64 | 357 | (define-peg-pattern bs body (+ "b")) |
8022f502 | 358 | (match-pattern bs "aabbcc") @result{} |
eee0877c AW |
359 | #f |
360 | @end lisp | |
361 | @end deffn | |
362 | ||
d7e2f5e3 | 363 | @deffn {Scheme Macro} search-for-pattern nonterm-or-peg string |
718bc349 AW |
364 | Searches through @var{string} looking for a matching subexpression. |
365 | @var{nonterm-or-peg} can either be a nonterminal or a literal PEG | |
d7e2f5e3 | 366 | pattern. When a literal PEG pattern is provided, @code{search-for-pattern} works |
718bc349 | 367 | very similarly to the regular expression searches many hackers are used |
d7e2f5e3 | 368 | to. If no match was found, @code{search-for-pattern} returns false. If a match |
718bc349 | 369 | was found, a PEG match record is returned. |
eee0877c AW |
370 | |
371 | @lisp | |
40ebbd64 | 372 | (define-peg-pattern as body (+ "a")) |
d7e2f5e3 | 373 | (search-for-pattern as "aabbcc") @result{} |
eee0877c | 374 | #<peg start: 0 end: 2 string: aabbcc tree: aa> |
d7e2f5e3 | 375 | (search-for-pattern (+ "a") "aabbcc") @result{} |
eee0877c | 376 | #<peg start: 0 end: 2 string: aabbcc tree: aa> |
d7e2f5e3 | 377 | (search-for-pattern "'a'+" "aabbcc") @result{} |
eee0877c AW |
378 | #<peg start: 0 end: 2 string: aabbcc tree: aa> |
379 | ||
40ebbd64 | 380 | (define-peg-pattern as all (+ "a")) |
d7e2f5e3 | 381 | (search-for-pattern as "aabbcc") @result{} |
eee0877c AW |
382 | #<peg start: 0 end: 2 string: aabbcc tree: (as aa)> |
383 | ||
40ebbd64 | 384 | (define-peg-pattern bs body (+ "b")) |
d7e2f5e3 | 385 | (search-for-pattern bs "aabbcc") @result{} |
eee0877c | 386 | #<peg start: 2 end: 4 string: aabbcc tree: bb> |
d7e2f5e3 | 387 | (search-for-pattern (+ "b") "aabbcc") @result{} |
eee0877c | 388 | #<peg start: 2 end: 4 string: aabbcc tree: bb> |
d7e2f5e3 | 389 | (search-for-pattern "'b'+" "aabbcc") @result{} |
eee0877c AW |
390 | #<peg start: 2 end: 4 string: aabbcc tree: bb> |
391 | ||
40ebbd64 | 392 | (define-peg-pattern zs body (+ "z")) |
d7e2f5e3 | 393 | (search-for-pattern zs "aabbcc") @result{} |
eee0877c | 394 | #f |
d7e2f5e3 | 395 | (search-for-pattern (+ "z") "aabbcc") @result{} |
eee0877c | 396 | #f |
d7e2f5e3 | 397 | (search-for-pattern "'z'+" "aabbcc") @result{} |
eee0877c AW |
398 | #f |
399 | @end lisp | |
400 | @end deffn | |
401 | ||
402 | @subsubheading PEG Match Records | |
8022f502 | 403 | The @code{match-pattern} and @code{search-for-pattern} functions both return PEG |
718bc349 AW |
404 | match records. Actual information can be extracted from these with the |
405 | following functions. | |
eee0877c | 406 | |
d7e2f5e3 | 407 | @deffn {Scheme Procedure} peg:string match-record |
718bc349 | 408 | Returns the original string that was parsed in the creation of |
d7e2f5e3 | 409 | @code{match-record}. |
eee0877c AW |
410 | @end deffn |
411 | ||
d7e2f5e3 | 412 | @deffn {Scheme Procedure} peg:start match-record |
718bc349 AW |
413 | Returns the index of the first parsed character in the original string |
414 | (from @code{peg:string}). If this is the same as @code{peg:end}, | |
415 | nothing was parsed. | |
eee0877c AW |
416 | @end deffn |
417 | ||
d7e2f5e3 | 418 | @deffn {Scheme Procedure} peg:end match-record |
718bc349 AW |
419 | Returns one more than the index of the last parsed character in the |
420 | original string (from @code{peg:string}). If this is the same as | |
421 | @code{peg:start}, nothing was parsed. | |
eee0877c AW |
422 | @end deffn |
423 | ||
d7e2f5e3 NL |
424 | @deffn {Scheme Procedure} peg:substring match-record |
425 | Returns the substring parsed by @code{match-record}. This is equivalent to | |
426 | @code{(substring (peg:string match-record) (peg:start match-record) (peg:end | |
427 | match-record))}. | |
eee0877c AW |
428 | @end deffn |
429 | ||
d7e2f5e3 NL |
430 | @deffn {Scheme Procedure} peg:tree match-record |
431 | Returns the tree parsed by @code{match-record}. | |
eee0877c AW |
432 | @end deffn |
433 | ||
d7e2f5e3 NL |
434 | @deffn {Scheme Procedure} peg-record? match-record |
435 | Returns true if @code{match-record} is a PEG match record, or false | |
718bc349 | 436 | otherwise. |
eee0877c AW |
437 | @end deffn |
438 | ||
439 | Example: | |
440 | @lisp | |
40ebbd64 | 441 | (define-peg-pattern bs all (peg "'b'+")) |
eee0877c | 442 | |
d7e2f5e3 | 443 | (search-for-pattern bs "aabbcc") @result{} |
eee0877c AW |
444 | #<peg start: 2 end: 4 string: aabbcc tree: (bs bb)> |
445 | ||
d7e2f5e3 | 446 | (let ((pm (search-for-pattern bs "aabbcc"))) |
eee0877c AW |
447 | `((string ,(peg:string pm)) |
448 | (start ,(peg:start pm)) | |
449 | (end ,(peg:end pm)) | |
450 | (substring ,(peg:substring pm)) | |
451 | (tree ,(peg:tree pm)) | |
452 | (record? ,(peg-record? pm)))) @result{} | |
453 | ((string "aabbcc") | |
454 | (start 2) | |
455 | (end 4) | |
456 | (substring "bb") | |
457 | (tree (bs "bb")) | |
458 | (record? #t)) | |
459 | @end lisp | |
460 | ||
461 | @subsubheading Miscellaneous | |
462 | ||
463 | @deffn {Scheme Procedure} context-flatten tst lst | |
718bc349 AW |
464 | Takes a predicate @var{tst} and a list @var{lst}. Flattens @var{lst} |
465 | until all elements are either atoms or satisfy @var{tst}. If @var{lst} | |
466 | itself satisfies @var{tst}, @code{(list lst)} is returned (this is a | |
467 | flat list whose only element satisfies @var{tst}). | |
eee0877c AW |
468 | |
469 | @lisp | |
470 | (context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{} | |
471 | (2 2 (1 1 (2 2)) 2 2 (1 1)) | |
472 | (context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{} | |
473 | ((1 1 (1 1 (2 2)) (2 2 (1 1)))) | |
474 | @end lisp | |
475 | ||
476 | If you're wondering why this is here, take a look at the tutorial. | |
477 | @end deffn | |
478 | ||
479 | @deffn {Scheme Procedure} keyword-flatten terms lst | |
718bc349 AW |
480 | A less general form of @code{context-flatten}. Takes a list of terminal |
481 | atoms @code{terms} and flattens @var{lst} until all elements are either | |
482 | atoms, or lists which have an atom from @code{terms} as their first | |
483 | element. | |
eee0877c AW |
484 | @lisp |
485 | (keyword-flatten '(a b) '(c a b (a c) (b c) (c (b a) (c a)))) @result{} | |
486 | (c a b (a c) (b c) c (b a) c a) | |
487 | @end lisp | |
488 | ||
489 | If you're wondering why this is here, take a look at the tutorial. | |
490 | @end deffn | |
491 | ||
492 | @node PEG Tutorial | |
493 | @subsection PEG Tutorial | |
494 | ||
495 | @subsubheading Parsing /etc/passwd | |
496 | This example will show how to parse /etc/passwd using PEGs. | |
497 | ||
498 | First we define an example /etc/passwd file: | |
499 | ||
500 | @lisp | |
501 | (define *etc-passwd* | |
502 | "root:x:0:0:root:/root:/bin/bash | |
503 | daemon:x:1:1:daemon:/usr/sbin:/bin/sh | |
504 | bin:x:2:2:bin:/bin:/bin/sh | |
505 | sys:x:3:3:sys:/dev:/bin/sh | |
506 | nobody:x:65534:65534:nobody:/nonexistent:/bin/sh | |
507 | messagebus:x:103:107::/var/run/dbus:/bin/false | |
508 | ") | |
509 | @end lisp | |
510 | ||
718bc349 AW |
511 | As a first pass at this, we might want to have all the entries in |
512 | /etc/passwd in a list. | |
eee0877c AW |
513 | |
514 | Doing this with string-based PEG syntax would look like this: | |
515 | @lisp | |
3ebd5786 | 516 | (define-peg-string-patterns |
eee0877c AW |
517 | "passwd <- entry* !. |
518 | entry <-- (! NL .)* NL* | |
519 | NL < '\n'") | |
520 | @end lisp | |
718bc349 AW |
521 | |
522 | A @code{passwd} file is 0 or more entries (@code{entry*}) until the end | |
523 | of the file (@code{!.} (@code{.} is any character, so @code{!.} means | |
524 | ``not anything'')). We want to capture the data in the nonterminal | |
525 | @code{passwd}, but not tag it with the name, so we use @code{<-}. | |
526 | ||
527 | An entry is a series of 0 or more characters that aren't newlines | |
528 | (@code{(! NL .)*}) followed by 0 or more newlines (@code{NL*}). We want | |
529 | to tag all the entries with @code{entry}, so we use @code{<--}. | |
530 | ||
531 | A newline is just a literal newline (@code{'\n'}). We don't want a | |
532 | bunch of newlines cluttering up the output, so we use @code{<} to throw | |
533 | away the captured data. | |
eee0877c AW |
534 | |
535 | Here is the same PEG defined using S-expressions: | |
536 | @lisp | |
40ebbd64 NL |
537 | (define-peg-pattern passwd body (and (* entry) (not-followed-by peg-any))) |
538 | (define-peg-pattern entry all (and (* (and (not-followed-by NL) peg-any)) | |
e0d14025 | 539 | (* NL))) |
40ebbd64 | 540 | (define-peg-pattern NL none "\n") |
eee0877c AW |
541 | @end lisp |
542 | ||
718bc349 AW |
543 | Obviously this is much more verbose. On the other hand, it's more |
544 | explicit, and thus easier to build automatically. However, there are | |
545 | some tricks that make S-expressions easier to use in some cases. One is | |
546 | the @code{ignore} keyword; the string syntax has no way to say ``throw | |
547 | away this text'' except breaking it out into a separate nonterminal. | |
548 | For instance, to throw away the newlines we had to define @code{NL}. In | |
549 | the S-expression syntax, we could have simply written @code{(ignore | |
550 | "\n")}. Also, for the cases where string syntax is really much cleaner, | |
551 | the @code{peg} keyword can be used to embed string syntax in | |
552 | S-expression syntax. For instance, we could have written: | |
553 | ||
eee0877c | 554 | @lisp |
40ebbd64 | 555 | (define-peg-pattern passwd body (peg "entry* !.")) |
eee0877c AW |
556 | @end lisp |
557 | ||
718bc349 AW |
558 | However we define it, parsing @code{*etc-passwd*} with the @code{passwd} |
559 | nonterminal yields the same results: | |
560 | ||
eee0877c | 561 | @lisp |
8022f502 | 562 | (peg:tree (match-pattern passwd *etc-passwd*)) @result{} |
eee0877c AW |
563 | ((entry "root:x:0:0:root:/root:/bin/bash") |
564 | (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh") | |
565 | (entry "bin:x:2:2:bin:/bin:/bin/sh") | |
566 | (entry "sys:x:3:3:sys:/dev:/bin/sh") | |
567 | (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh") | |
568 | (entry "messagebus:x:103:107::/var/run/dbus:/bin/false")) | |
569 | @end lisp | |
570 | ||
571 | However, here is something to be wary of: | |
718bc349 | 572 | |
eee0877c | 573 | @lisp |
8022f502 | 574 | (peg:tree (match-pattern passwd "one entry")) @result{} |
eee0877c AW |
575 | (entry "one entry") |
576 | @end lisp | |
577 | ||
718bc349 AW |
578 | By default, the parse trees generated by PEGs are compressed as much as |
579 | possible without losing information. It may not look like this is what | |
580 | you want at first, but uncompressed parse trees are an enormous headache | |
581 | (there's no easy way to predict how deep particular lists will nest, | |
582 | there are empty lists littered everywhere, etc. etc.). One side-effect | |
583 | of this, however, is that sometimes the compressor is too aggressive. | |
584 | No information is discarded when @code{((entry "one entry"))} is | |
585 | compressed to @code{(entry "one entry")}, but in this particular case it | |
586 | probably isn't what we want. | |
587 | ||
588 | There are two functions for easily dealing with this: | |
589 | @code{keyword-flatten} and @code{context-flatten}. The | |
590 | @code{keyword-flatten} function takes a list of keywords and a list to | |
591 | flatten, then tries to coerce the list such that the first element of | |
592 | all sublists is one of the keywords. The @code{context-flatten} | |
593 | function is similar, but instead of a list of keywords it takes a | |
594 | predicate that should indicate whether a given sublist is good enough | |
595 | (refer to the API reference for more details). | |
eee0877c AW |
596 | |
597 | What we want here is @code{keyword-flatten}. | |
598 | @lisp | |
8022f502 | 599 | (keyword-flatten '(entry) (peg:tree (match-pattern passwd *etc-passwd*))) @result{} |
eee0877c AW |
600 | ((entry "root:x:0:0:root:/root:/bin/bash") |
601 | (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh") | |
602 | (entry "bin:x:2:2:bin:/bin:/bin/sh") | |
603 | (entry "sys:x:3:3:sys:/dev:/bin/sh") | |
604 | (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh") | |
605 | (entry "messagebus:x:103:107::/var/run/dbus:/bin/false")) | |
8022f502 | 606 | (keyword-flatten '(entry) (peg:tree (match-pattern passwd "one entry"))) @result{} |
eee0877c AW |
607 | ((entry "one entry")) |
608 | @end lisp | |
609 | ||
718bc349 AW |
610 | Of course, this is a somewhat contrived example. In practice we would |
611 | probably just tag the @code{passwd} nonterminal to remove the ambiguity | |
612 | (using either the @code{all} keyword for S-expressions or the @code{<--} | |
613 | symbol for strings).. | |
eee0877c AW |
614 | |
615 | @lisp | |
40ebbd64 | 616 | (define-peg-pattern tag-passwd all (peg "entry* !.")) |
8022f502 | 617 | (peg:tree (match-pattern tag-passwd *etc-passwd*)) @result{} |
eee0877c AW |
618 | (tag-passwd |
619 | (entry "root:x:0:0:root:/root:/bin/bash") | |
620 | (entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh") | |
621 | (entry "bin:x:2:2:bin:/bin:/bin/sh") | |
622 | (entry "sys:x:3:3:sys:/dev:/bin/sh") | |
623 | (entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh") | |
624 | (entry "messagebus:x:103:107::/var/run/dbus:/bin/false")) | |
8022f502 | 625 | (peg:tree (match-pattern tag-passwd "one entry")) |
eee0877c AW |
626 | (tag-passwd |
627 | (entry "one entry")) | |
628 | @end lisp | |
629 | ||
718bc349 AW |
630 | If you're ever uncertain about the potential results of parsing |
631 | something, remember the two absolute rules: | |
632 | @enumerate | |
633 | @item | |
634 | No parsing information will ever be discarded. | |
635 | @item | |
636 | There will never be any lists with fewer than 2 elements. | |
637 | @end enumerate | |
638 | ||
639 | For the purposes of (1), "parsing information" means things tagged with | |
640 | the @code{any} keyword or the @code{<--} symbol. Plain strings will be | |
641 | concatenated. | |
eee0877c | 642 | |
718bc349 AW |
643 | Let's extend this example a bit more and actually pull some useful |
644 | information out of the passwd file: | |
eee0877c | 645 | |
eee0877c | 646 | @lisp |
3ebd5786 | 647 | (define-peg-string-patterns |
eee0877c AW |
648 | "passwd <-- entry* !. |
649 | entry <-- login C pass C uid C gid C nameORcomment C homedir C shell NL* | |
650 | login <-- text | |
651 | pass <-- text | |
652 | uid <-- [0-9]* | |
653 | gid <-- [0-9]* | |
654 | nameORcomment <-- text | |
655 | homedir <-- path | |
656 | shell <-- path | |
657 | path <-- (SLASH pathELEMENT)* | |
658 | pathELEMENT <-- (!NL !C !'/' .)* | |
659 | text <- (!NL !C .)* | |
660 | C < ':' | |
661 | NL < '\n' | |
662 | SLASH < '/'") | |
663 | @end lisp | |
664 | ||
665 | This produces rather pretty parse trees: | |
666 | @lisp | |
667 | (passwd | |
668 | (entry (login "root") | |
669 | (pass "x") | |
670 | (uid "0") | |
671 | (gid "0") | |
672 | (nameORcomment "root") | |
673 | (homedir (path (pathELEMENT "root"))) | |
674 | (shell (path (pathELEMENT "bin") (pathELEMENT "bash")))) | |
675 | (entry (login "daemon") | |
676 | (pass "x") | |
677 | (uid "1") | |
678 | (gid "1") | |
679 | (nameORcomment "daemon") | |
680 | (homedir | |
681 | (path (pathELEMENT "usr") (pathELEMENT "sbin"))) | |
682 | (shell (path (pathELEMENT "bin") (pathELEMENT "sh")))) | |
683 | (entry (login "bin") | |
684 | (pass "x") | |
685 | (uid "2") | |
686 | (gid "2") | |
687 | (nameORcomment "bin") | |
688 | (homedir (path (pathELEMENT "bin"))) | |
689 | (shell (path (pathELEMENT "bin") (pathELEMENT "sh")))) | |
690 | (entry (login "sys") | |
691 | (pass "x") | |
692 | (uid "3") | |
693 | (gid "3") | |
694 | (nameORcomment "sys") | |
695 | (homedir (path (pathELEMENT "dev"))) | |
696 | (shell (path (pathELEMENT "bin") (pathELEMENT "sh")))) | |
697 | (entry (login "nobody") | |
698 | (pass "x") | |
699 | (uid "65534") | |
700 | (gid "65534") | |
701 | (nameORcomment "nobody") | |
702 | (homedir (path (pathELEMENT "nonexistent"))) | |
703 | (shell (path (pathELEMENT "bin") (pathELEMENT "sh")))) | |
704 | (entry (login "messagebus") | |
705 | (pass "x") | |
706 | (uid "103") | |
707 | (gid "107") | |
708 | nameORcomment | |
709 | (homedir | |
710 | (path (pathELEMENT "var") | |
711 | (pathELEMENT "run") | |
712 | (pathELEMENT "dbus"))) | |
713 | (shell (path (pathELEMENT "bin") (pathELEMENT "false"))))) | |
714 | @end lisp | |
715 | ||
718bc349 AW |
716 | Notice that when there's no entry in a field (e.g. @code{nameORcomment} |
717 | for messagebus) the symbol is inserted. This is the ``don't throw away | |
718 | any information'' rule---we succesfully matched a @code{nameORcomment} | |
719 | of 0 characters (since we used @code{*} when defining it). This is | |
720 | usually what you want, because it allows you to e.g. use @code{list-ref} | |
721 | to pull out elements (since they all have known offsets). | |
eee0877c | 722 | |
718bc349 AW |
723 | If you'd prefer not to have symbols for empty matches, you can replace |
724 | the @code{*} with a @code{+} and add a @code{?} after the | |
725 | @code{nameORcomment} in @code{entry}. Then it will try to parse 1 or | |
726 | more characters, fail (inserting nothing into the parse tree), but | |
727 | continue because it didn't have to match the nameORcomment to continue. | |
eee0877c AW |
728 | |
729 | ||
730 | @subsubheading Embedding Arithmetic Expressions | |
731 | ||
732 | We can parse simple mathematical expressions with the following PEG: | |
733 | ||
734 | @lisp | |
3ebd5786 | 735 | (define-peg-string-patterns |
eee0877c AW |
736 | "expr <- sum |
737 | sum <-- (product ('+' / '-') sum) / product | |
738 | product <-- (value ('*' / '/') product) / value | |
739 | value <-- number / '(' expr ')' | |
740 | number <-- [0-9]+") | |
741 | @end lisp | |
742 | ||
743 | Then: | |
744 | @lisp | |
8022f502 | 745 | (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2")) @result{} |
eee0877c AW |
746 | (sum (product (value (number "1"))) |
747 | "+" | |
748 | (sum (product | |
749 | (value (number "1")) | |
750 | "/" | |
751 | (product | |
752 | (value (number "2")) | |
753 | "*" | |
754 | (product (value (number "3"))))) | |
755 | "+" | |
756 | (sum (product | |
757 | (value "(" | |
758 | (sum (product (value (number "1"))) | |
759 | "+" | |
760 | (sum (product (value (number "1"))))) | |
761 | ")") | |
762 | "/" | |
763 | (product (value (number "2"))))))) | |
764 | @end lisp | |
765 | ||
718bc349 AW |
766 | There is very little wasted effort in this PEG. The @code{number} |
767 | nonterminal has to be tagged because otherwise the numbers might run | |
768 | together with the arithmetic expressions during the string concatenation | |
769 | stage of parse-tree compression (the parser will see ``1'' followed by | |
770 | ``/'' and decide to call it ``1/''). When in doubt, tag. | |
eee0877c AW |
771 | |
772 | It is very easy to turn these parse trees into lisp expressions: | |
718bc349 | 773 | |
eee0877c AW |
774 | @lisp |
775 | (define (parse-sum sum left . rest) | |
776 | (if (null? rest) | |
777 | (apply parse-product left) | |
778 | (list (string->symbol (car rest)) | |
779 | (apply parse-product left) | |
780 | (apply parse-sum (cadr rest))))) | |
781 | ||
782 | (define (parse-product product left . rest) | |
783 | (if (null? rest) | |
784 | (apply parse-value left) | |
785 | (list (string->symbol (car rest)) | |
786 | (apply parse-value left) | |
787 | (apply parse-product (cadr rest))))) | |
788 | ||
789 | (define (parse-value value first . rest) | |
790 | (if (null? rest) | |
791 | (string->number (cadr first)) | |
792 | (apply parse-sum (car rest)))) | |
793 | ||
794 | (define parse-expr parse-sum) | |
795 | @end lisp | |
718bc349 AW |
796 | |
797 | (Notice all these functions look very similar; for a more complicated | |
798 | PEG, it would be worth abstracting.) | |
eee0877c AW |
799 | |
800 | Then: | |
801 | @lisp | |
8022f502 | 802 | (apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{} |
eee0877c AW |
803 | (+ 1 (+ (/ 1 (* 2 3)) (/ (+ 1 1) 2))) |
804 | @end lisp | |
805 | ||
718bc349 AW |
806 | But wait! The associativity is wrong! Where it says @code{(/ 1 (* 2 |
807 | 3))}, it should say @code{(* (/ 1 2) 3)}. | |
eee0877c | 808 | |
718bc349 AW |
809 | It's tempting to try replacing e.g. @code{"sum <-- (product ('+' / '-') |
810 | sum) / product"} with @code{"sum <-- (sum ('+' / '-') product) / | |
811 | product"}, but this is a Bad Idea. PEGs don't support left recursion. | |
812 | To see why, imagine what the parser will do here. When it tries to | |
813 | parse @code{sum}, it first has to try and parse @code{sum}. But to do | |
814 | that, it first has to try and parse @code{sum}. This will continue | |
815 | until the stack gets blown off. | |
eee0877c | 816 | |
718bc349 AW |
817 | So how does one parse left-associative binary operators with PEGs? |
818 | Honestly, this is one of their major shortcomings. There's no | |
819 | general-purpose way of doing this, but here the repetition operators are | |
820 | a good choice: | |
eee0877c AW |
821 | |
822 | @lisp | |
823 | (use-modules (srfi srfi-1)) | |
824 | ||
3ebd5786 | 825 | (define-peg-string-patterns |
eee0877c AW |
826 | "expr <- sum |
827 | sum <-- (product ('+' / '-'))* product | |
828 | product <-- (value ('*' / '/'))* value | |
829 | value <-- number / '(' expr ')' | |
830 | number <-- [0-9]+") | |
831 | ||
832 | ;; take a deep breath... | |
833 | (define (make-left-parser next-func) | |
834 | (lambda (sum first . rest) ;; general form, comments below assume | |
835 | ;; that we're dealing with a sum expression | |
836 | (if (null? rest) ;; form (sum (product ...)) | |
837 | (apply next-func first) | |
838 | (if (string? (cadr first));; form (sum ((product ...) "+") (product ...)) | |
839 | (list (string->symbol (cadr first)) | |
840 | (apply next-func (car first)) | |
841 | (apply next-func (car rest))) | |
842 | ;; form (sum (((product ...) "+") ((product ...) "+")) (product ...)) | |
843 | (car | |
844 | (reduce ;; walk through the list and build a left-associative tree | |
845 | (lambda (l r) | |
846 | (list (list (cadr r) (car r) (apply next-func (car l))) | |
847 | (string->symbol (cadr l)))) | |
848 | 'ignore | |
849 | (append ;; make a list of all the products | |
850 | ;; the first one should be pre-parsed | |
851 | (list (list (apply next-func (caar first)) | |
852 | (string->symbol (cadar first)))) | |
853 | (cdr first) | |
854 | ;; the last one has to be added in | |
855 | (list (append rest '("done")))))))))) | |
856 | ||
857 | (define (parse-value value first . rest) | |
858 | (if (null? rest) | |
859 | (string->number (cadr first)) | |
860 | (apply parse-sum (car rest)))) | |
861 | (define parse-product (make-left-parser parse-value)) | |
862 | (define parse-sum (make-left-parser parse-product)) | |
863 | (define parse-expr parse-sum) | |
864 | @end lisp | |
865 | ||
866 | Then: | |
867 | @lisp | |
8022f502 | 868 | (apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{} |
eee0877c AW |
869 | (+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2)) |
870 | @end lisp | |
871 | ||
718bc349 AW |
872 | As you can see, this is much uglier (it could be made prettier by using |
873 | @code{context-flatten}, but the way it's written above makes it clear | |
874 | how we deal with the three ways the zero-or-more @code{*} expression can | |
875 | parse). Fortunately, most of the time we can get away with only using | |
876 | right-associativity. | |
eee0877c AW |
877 | |
878 | @subsubheading Simplified Functions | |
879 | ||
718bc349 AW |
880 | For a more tantalizing example, consider the following grammar that |
881 | parses (highly) simplified C functions: | |
882 | ||
eee0877c | 883 | @lisp |
3ebd5786 | 884 | (define-peg-string-patterns |
eee0877c AW |
885 | "cfunc <-- cSP ctype cSP cname cSP cargs cLB cSP cbody cRB |
886 | ctype <-- cidentifier | |
887 | cname <-- cidentifier | |
888 | cargs <-- cLP (! (cSP cRP) carg cSP (cCOMMA / cRP) cSP)* cSP | |
889 | carg <-- cSP ctype cSP cname | |
890 | cbody <-- cstatement * | |
891 | cidentifier <- [a-zA-z][a-zA-Z0-9_]* | |
892 | cstatement <-- (!';'.)*cSC cSP | |
893 | cSC < ';' | |
894 | cCOMMA < ',' | |
895 | cLP < '(' | |
896 | cRP < ')' | |
897 | cLB < '@{' | |
898 | cRB < '@}' | |
899 | cSP < [ \t\n]*") | |
900 | @end lisp | |
901 | ||
902 | Then: | |
903 | @lisp | |
8022f502 | 904 | (match-pattern cfunc "int square(int a) @{ return a*a;@}") @result{} |
eee0877c AW |
905 | (32 |
906 | (cfunc (ctype "int") | |
907 | (cname "square") | |
908 | (cargs (carg (ctype "int") (cname "a"))) | |
909 | (cbody (cstatement "return a*a")))) | |
910 | @end lisp | |
911 | ||
912 | And: | |
913 | @lisp | |
8022f502 | 914 | (match-pattern cfunc "int mod(int a, int b) @{ int c = a/b;return a-b*c; @}") @result{} |
eee0877c AW |
915 | (52 |
916 | (cfunc (ctype "int") | |
917 | (cname "mod") | |
918 | (cargs (carg (ctype "int") (cname "a")) | |
919 | (carg (ctype "int") (cname "b"))) | |
920 | (cbody (cstatement "int c = a/b") | |
921 | (cstatement "return a- b*c")))) | |
922 | @end lisp | |
923 | ||
718bc349 AW |
924 | By wrapping all the @code{carg} nonterminals in a @code{cargs} |
925 | nonterminal, we were able to remove any ambiguity in the parsing | |
926 | structure and avoid having to call @code{context-flatten} on the output | |
8022f502 | 927 | of @code{match-pattern}. We used the same trick with the @code{cstatement} |
718bc349 AW |
928 | nonterminals, wrapping them in a @code{cbody} nonterminal. |
929 | ||
930 | The whitespace nonterminal @code{cSP} used here is a (very) useful | |
931 | instantiation of a common pattern for matching syntactically irrelevant | |
932 | information. Since it's tagged with @code{<} and ends with @code{*} it | |
933 | won't clutter up the parse trees (all the empty lists will be discarded | |
934 | during the compression step) and it will never cause parsing to fail. | |
97c84694 NL |
935 | |
936 | @node PEG Internals | |
937 | @subsection PEG Internals | |
938 | ||
939 | A PEG parser takes a string as input and attempts to parse it as a given | |
940 | nonterminal. The key idea of the PEG implementation is that every | |
941 | nonterminal is just a function that takes a string as an argument and | |
942 | attempts to parse that string as its nonterminal. The functions always | |
943 | start from the beginning, but a parse is considered successful if there | |
944 | is material left over at the end. | |
945 | ||
946 | This makes it easy to model different PEG parsing operations. For | |
947 | instance, consider the PEG grammar @code{"ab"}, which could also be | |
948 | written @code{(and "a" "b")}. It matches the string ``ab''. Here's how | |
949 | that might be implemented in the PEG style: | |
950 | ||
951 | @lisp | |
952 | (define (match-and-a-b str) | |
953 | (match-a str) | |
954 | (match-b str)) | |
955 | @end lisp | |
956 | ||
957 | As you can see, the use of functions provides an easy way to model | |
958 | sequencing. In a similar way, one could model @code{(or a b)} with | |
959 | something like the following: | |
960 | ||
961 | @lisp | |
962 | (define (match-or-a-b str) | |
963 | (or (match-a str) (match-b str))) | |
964 | @end lisp | |
965 | ||
966 | Here the semantics of a PEG @code{or} expression map naturally onto | |
967 | Scheme's @code{or} operator. This function will attempt to run | |
968 | @code{(match-a str)}, and return its result if it succeeds. Otherwise it | |
969 | will run @code{(match-b str)}. | |
970 | ||
971 | Of course, the code above wouldn't quite work. We need some way for the | |
972 | parsing functions to communicate. The actual interface used is below. | |
973 | ||
974 | @subsubheading Parsing Function Interface | |
975 | ||
976 | A parsing function takes three arguments - a string, the length of that | |
977 | string, and the position in that string it should start parsing at. In | |
978 | effect, the parsing functions pass around substrings in pieces - the | |
979 | first argument is a buffer of characters, and the second two give a | |
980 | range within that buffer that the parsing function should look at. | |
981 | ||
982 | Parsing functions return either #f, if they failed to match their | |
983 | nonterminal, or a list whose first element must be an integer | |
984 | representing the final position in the string they matched and whose cdr | |
985 | can be any other data the function wishes to return, or '() if it | |
986 | doesn't have any more data. | |
987 | ||
988 | The one caveat is that if the extra data it returns is a list, any | |
8022f502 | 989 | adjacent strings in that list will be appended by @code{match-pattern}. For |
97c84694 | 990 | instance, if a parsing function returns @code{(13 ("a" "b" "c"))}, |
8022f502 | 991 | @code{match-pattern} will take @code{(13 ("abc"))} as its value. |
97c84694 NL |
992 | |
993 | For example, here is a function to match ``ab'' using the actual | |
994 | interface. | |
995 | ||
996 | @lisp | |
997 | (define (match-a-b str len pos) | |
998 | (and (<= (+ pos 2) len) | |
999 | (string= str "ab" pos (+ pos 2)) | |
1000 | (list (+ pos 2) '()))) ; we return no extra information | |
1001 | @end lisp | |
1002 | ||
1003 | The above function can be used to match a string by running | |
8022f502 | 1004 | @code{(match-pattern match-a-b "ab")}. |
94e8517c NL |
1005 | |
1006 | @subsubheading Code Generators and Extensible Syntax | |
1007 | ||
40ebbd64 | 1008 | PEG expressions, such as those in a @code{define-peg-pattern} form, are |
94e8517c NL |
1009 | interpreted internally in two steps. |
1010 | ||
1011 | First, any string PEG is expanded into an s-expression PEG by the code | |
1012 | in the @code{(ice-9 peg string-peg)} module. | |
1013 | ||
1014 | Then, then s-expression PEG that results is compiled into a parsing | |
1015 | function by the @code{(ice-9 peg codegen)} module. In particular, the | |
fee87b82 | 1016 | function @code{compile-peg-pattern} is called on the s-expression. It then |
94e8517c NL |
1017 | decides what to do based on the form it is passed. |
1018 | ||
fee87b82 | 1019 | The PEG syntax can be expanded by providing @code{compile-peg-pattern} more |
94e8517c NL |
1020 | options for what to do with its forms. The extended syntax will be |
1021 | associated with a symbol, for instance @code{my-parsing-form}, and will | |
1022 | be called on all PEG expressions of the form | |
1023 | @lisp | |
1024 | (my-parsing-form ...) | |
1025 | @end lisp | |
1026 | ||
1027 | The parsing function should take two arguments. The first will be a | |
1028 | syntax object containing a list with all of the arguments to the form | |
1029 | (but not the form's name), and the second will be the | |
40ebbd64 | 1030 | @code{capture-type} argument that is passed to @code{define-peg-pattern}. |
94e8517c NL |
1031 | |
1032 | New functions can be registered by calling @code{(add-peg-compiler! | |
1033 | symbol function)}, where @code{symbol} is the symbol that will indicate | |
1034 | a form of this type and @code{function} is the code generating function | |
1035 | described above. The function @code{add-peg-compiler!} is exported from | |
1036 | the @code{(ice-9 peg codegen)} module. |