HCoop Git - bpt/guile.git/blame_incremental

... / ...

Commit	Line	Data
	1	@c --texinfo--
	2	@c This is part of the GNU Guile Reference Manual.
	3	@c Copyright (C) 2006, 2010, 2011
	4	@c Free Software Foundation, Inc.
	5	@c See the file guile.texi for copying conditions.
	6
	7	@node PEG Parsing
	8	@section PEG Parsing
	9
	10	Parsing Expression Grammars (PEGs) are a way of specifying formal
	11	languages for text processing. They can be used either for matching
	12	(like regular expressions) or for building recursive descent parsers
	13	(like lex/yacc). Guile uses a superset of PEG syntax that allows more
	14	control over what information is preserved during parsing.
	15
	16	Wikipedia has a clear and concise introduction to PEGs if you want to
	17	familiarize yourself with the syntax:
	18	@url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
	19
	20	The module works by compiling PEGs down to lambda expressions. These
	21	can either be stored in variables at compile-time by the define macros
	22	(@code{define-peg-pattern} and @code{define-peg-string-patterns}) or calculated
	23	explicitly at runtime with the compile functions
	24	(@code{compile-peg-pattern} and @code{peg-string-compile}).
	25
	26	They can then be used for either parsing (@code{match-pattern}) or searching
	27	(@code{search-for-pattern}). For convenience, @code{search-for-pattern}
	28	also takes pattern literals in case you want to inline a simple search
	29	(people often use regular expressions this way).
	30
	31	The rest of this documentation consists of a syntax reference, an API
	32	reference, and a tutorial.
	33
	34	@menu
	35	* PEG Syntax Reference::
	36	* PEG API Reference::
	37	* PEG Tutorial::
	38	* PEG Internals::
	39	@end menu
	40
	41	@node PEG Syntax Reference
	42	@subsection PEG Syntax Reference
	43
	44	@subsubheading Normal PEG Syntax:
	45
	46	@deftp {PEG Pattern} sequence a b
	47	Parses @var{a}. If this succeeds, continues to parse @var{b} from the
	48	end of the text parsed as @var{a}. Succeeds if both @var{a} and
	49	@var{b} succeed.
	50
	51	@code{"a b"}
	52
	53	@code{(and a b)}
	54	@end deftp
	55
	56	@deftp {PEG Pattern} {ordered choice} a b
	57	Parses @var{a}. If this fails, backtracks and parses @var{b}.
	58	Succeeds if either @var{a} or @var{b} succeeds.
	59
	60	@code{"a/b"}
	61
	62	@code{(or a b)}
	63	@end deftp
	64
	65	@deftp {PEG Pattern} {zero or more} a
	66	Parses @var{a} as many times in a row as it can, starting each @var{a}
	67	at the end of the text parsed by the previous @var{a}. Always
	68	succeeds.
	69
	70	@code{"a*"}
	71
	72	@code{(* a)}
	73	@end deftp
	74
	75	@deftp {PEG Pattern} {one or more} a
	76	Parses @var{a} as many times in a row as it can, starting each @var{a}
	77	at the end of the text parsed by the previous @var{a}. Succeeds if at
	78	least one @var{a} was parsed.
	79
	80	@code{"a+"}
	81
	82	@code{(+ a)}
	83	@end deftp
	84
	85	@deftp {PEG Pattern} optional a
	86	Tries to parse @var{a}. Succeeds if @var{a} succeeds.
	87
	88	@code{"a?"}
	89
	90	@code{(? a)}
	91	@end deftp
	92
	93	@deftp {PEG Pattern} {followed by} a
	94	Makes sure it is possible to parse @var{a}, but does not actually parse
	95	it. Succeeds if @var{a} would succeed.
	96
	97	@code{"&a"}
	98
	99	@code{(followed-by a)}
	100	@end deftp
	101
	102	@deftp {PEG Pattern} {not followed by} a
	103	Makes sure it is impossible to parse @var{a}, but does not actually
	104	parse it. Succeeds if @var{a} would fail.
	105
	106	@code{"!a"}
	107
	108	@code{(not-followed-by a)}
	109	@end deftp
	110
	111	@deftp {PEG Pattern} {string literal} ``abc''
	112	Parses the string @var{"abc"}. Succeeds if that parsing succeeds.
	113
	114	@code{"'abc'"}
	115
	116	@code{"abc"}
	117	@end deftp
	118
	119	@deftp {PEG Pattern} {any character}
	120	Parses any single character. Succeeds unless there is no more text to
	121	be parsed.
	122
	123	@code{"."}
	124
	125	@code{peg-any}
	126	@end deftp
	127
	128	@deftp {PEG Pattern} {character class} a b
	129	Alternative syntax for ``Ordered Choice @var{a} @var{b}'' if @var{a} and
	130	@var{b} are characters.
	131
	132	@code{"[ab]"}
	133
	134	@code{(or "a" "b")}
	135	@end deftp
	136
	137	@deftp {PEG Pattern} {range of characters} a z
	138	Parses any character falling between @var{a} and @var{z}.
	139
	140	@code{"[a-z]"}
	141
	142	@code{(range #\a #\z)}
	143	@end deftp
	144
	145	Example:
	146
	147	@example
	148	"(a !b / c &d*) 'e'+"
	149	@end example
	150
	151	Would be:
	152
	153	@lisp
	154	(and
	155	(or
	156	(and a (not-followed-by b))
	157	(and c (followed-by (* d))))
	158	(+ "e"))
	159	@end lisp
	160
	161	@subsubheading Extended Syntax
	162
	163	There is some extra syntax for S-expressions.
	164
	165	@deftp {PEG Pattern} ignore a
	166	Ignore the text matching @var{a}
	167	@end deftp
	168
	169	@deftp {PEG Pattern} capture a
	170	Capture the text matching @var{a}.
	171	@end deftp
	172
	173	@deftp {PEG Pattern} peg a
	174	Embed the PEG pattern @var{a} using string syntax.
	175	@end deftp
	176
	177	Example:
	178
	179	@example
	180	"!a / 'b'"
	181	@end example
	182
	183	Is equivalent to
	184
	185	@lisp
	186	(or (peg "!a") "b")
	187	@end lisp
	188
	189	and
	190
	191	@lisp
	192	(or (not-followed-by a) "b")
	193	@end lisp
	194
	195	@node PEG API Reference
	196	@subsection PEG API Reference
	197
	198	@subsubheading Define Macros
	199
	200	The most straightforward way to define a PEG is by using one of the
	201	define macros (both of these macroexpand into @code{define}
	202	expressions). These macros bind parsing functions to variables. These
	203	parsing functions may be invoked by @code{match-pattern} or
	204	@code{search-for-pattern}, which return a PEG match record. Raw data can be
	205	retrieved from this record with the PEG match deconstructor functions.
	206	More complicated (and perhaps enlightening) examples can be found in the
	207	tutorial.
	208
	209	@deffn {Scheme Macro} define-peg-string-patterns peg-string
	210	Defines all the nonterminals in the PEG @var{peg-string}. More
	211	precisely, @code{define-peg-string-patterns} takes a superset of PEGs. A normal PEG
	212	has a @code{<-} between the nonterminal and the pattern.
	213	@code{define-peg-string-patterns} uses this symbol to determine what information it
	214	should propagate up the parse tree. The normal @code{<-} propagates the
	215	matched text up the parse tree, @code{<--} propagates the matched text
	216	up the parse tree tagged with the name of the nonterminal, and @code{<}
	217	discards that matched text and propagates nothing up the parse tree.
	218	Also, nonterminals may consist of any alphanumeric character or a ``-''
	219	character (in normal PEGs nonterminals can only be alphabetic).
	220
	221	For example, if we:
	222	@lisp
	223	(define-peg-string-patterns
	224	"as <- 'a'+
	225	bs <- 'b'+
	226	as-or-bs <- as/bs")
	227	(define-peg-string-patterns
	228	"as-tag <-- 'a'+
	229	bs-tag <-- 'b'+
	230	as-or-bs-tag <-- as-tag/bs-tag")
	231	@end lisp
	232	Then:
	233	@lisp
	234	(match-pattern as-or-bs "aabbcc") @result{}
	235	#<peg start: 0 end: 2 string: aabbcc tree: aa>
	236	(match-pattern as-or-bs-tag "aabbcc") @result{}
	237	#<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
	238	@end lisp
	239
	240	Note that in doing this, we have bound 6 variables at the toplevel
	241	(@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
	242	@var{as-or-bs-tag}).
	243	@end deffn
	244
	245	@deffn {Scheme Macro} define-peg-pattern name capture-type peg-sexp
	246	Defines a single nonterminal @var{name}. @var{capture-type} determines
	247	how much information is passed up the parse tree. @var{peg-sexp} is a
	248	PEG in S-expression form.
	249
	250	Possible values for capture-type:
	251
	252	@table @code
	253	@item all
	254	passes the matched text up the parse tree tagged with the name of the
	255	nonterminal.
	256	@item body
	257	passes the matched text up the parse tree.
	258	@item none
	259	passes nothing up the parse tree.
	260	@end table
	261
	262	For Example, if we:
	263	@lisp
	264	(define-peg-pattern as body (+ "a"))
	265	(define-peg-pattern bs body (+ "b"))
	266	(define-peg-pattern as-or-bs body (or as bs))
	267	(define-peg-pattern as-tag all (+ "a"))
	268	(define-peg-pattern bs-tag all (+ "b"))
	269	(define-peg-pattern as-or-bs-tag all (or as-tag bs-tag))
	270	@end lisp
	271	Then:
	272	@lisp
	273	(match-pattern as-or-bs "aabbcc") @result{}
	274	#<peg start: 0 end: 2 string: aabbcc tree: aa>
	275	(match-pattern as-or-bs-tag "aabbcc") @result{}
	276	#<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
	277	@end lisp
	278
	279	Note that in doing this, we have bound 6 variables at the toplevel
	280	(@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
	281	@var{as-or-bs-tag}).
	282	@end deffn
	283
	284	@subsubheading Compile Functions
	285	It is sometimes useful to be able to compile anonymous PEG patterns at
	286	runtime. These functions let you do that using either syntax.
	287
	288	@deffn {Scheme Procedure} peg-string-compile peg-string capture-type
	289	Compiles the PEG pattern in @var{peg-string} propagating according to
	290	@var{capture-type} (capture-type can be any of the values from
	291	@code{define-peg-pattern}).
	292	@end deffn
	293
	294
	295	@deffn {Scheme Procedure} compile-peg-pattern peg-sexp capture-type
	296	Compiles the PEG pattern in @var{peg-sexp} propagating according to
	297	@var{capture-type} (capture-type can be any of the values from
	298	@code{define-peg-pattern}).
	299	@end deffn
	300
	301	The functions return syntax objects, which can be useful if you want to
	302	use them in macros. If all you want is to define a new nonterminal, you
	303	can do the following:
	304
	305	@lisp
	306	(define exp '(+ "a"))
	307	(define as (compile (compile-peg-pattern exp 'body)))
	308	@end lisp
	309
	310	You can use this nonterminal with all of the regular PEG functions:
	311
	312	@lisp
	313	(match-pattern as "aaaaa") @result{}
	314	#<peg start: 0 end: 5 string: bbbbb tree: bbbbb>
	315	@end lisp
	316
	317	@subsubheading Parsing & Matching Functions
	318
	319	For our purposes, ``parsing'' means parsing a string into a tree
	320	starting from the first character, while ``matching'' means searching
	321	through the string for a substring. In practice, the only difference
	322	between the two functions is that @code{match-pattern} gives up if it can't
	323	find a valid substring starting at index 0 and @code{search-for-pattern} keeps
	324	looking. They are both equally capable of ``parsing'' and ``matching''
	325	given those constraints.
	326
	327	@deffn {Scheme Procedure} match-pattern nonterm string
	328	Parses @var{string} using the PEG stored in @var{nonterm}. If no match
	329	was found, @code{match-pattern} returns false. If a match was found, a PEG
	330	match record is returned.
	331
	332	The @code{capture-type} argument to @code{define-peg-pattern} allows you to
	333	choose what information to hold on to while parsing. The options are:
	334
	335	@table @code
	336	@item all
	337	tag the matched text with the nonterminal
	338	@item body
	339	just the matched text
	340	@item none
	341	nothing
	342	@end table
	343
	344	@lisp
	345	(define-peg-pattern as all (+ "a"))
	346	(match-pattern as "aabbcc") @result{}
	347	#<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
	348
	349	(define-peg-pattern as body (+ "a"))
	350	(match-pattern as "aabbcc") @result{}
	351	#<peg start: 0 end: 2 string: aabbcc tree: aa>
	352
	353	(define-peg-pattern as none (+ "a"))
	354	(match-pattern as "aabbcc") @result{}
	355	#<peg start: 0 end: 2 string: aabbcc tree: ()>
	356
	357	(define-peg-pattern bs body (+ "b"))
	358	(match-pattern bs "aabbcc") @result{}
	359	#f
	360	@end lisp
	361	@end deffn
	362
	363	@deffn {Scheme Macro} search-for-pattern nonterm-or-peg string
	364	Searches through @var{string} looking for a matching subexpression.
	365	@var{nonterm-or-peg} can either be a nonterminal or a literal PEG
	366	pattern. When a literal PEG pattern is provided, @code{search-for-pattern} works
	367	very similarly to the regular expression searches many hackers are used
	368	to. If no match was found, @code{search-for-pattern} returns false. If a match
	369	was found, a PEG match record is returned.
	370
	371	@lisp
	372	(define-peg-pattern as body (+ "a"))
	373	(search-for-pattern as "aabbcc") @result{}
	374	#<peg start: 0 end: 2 string: aabbcc tree: aa>
	375	(search-for-pattern (+ "a") "aabbcc") @result{}
	376	#<peg start: 0 end: 2 string: aabbcc tree: aa>
	377	(search-for-pattern "'a'+" "aabbcc") @result{}
	378	#<peg start: 0 end: 2 string: aabbcc tree: aa>
	379
	380	(define-peg-pattern as all (+ "a"))
	381	(search-for-pattern as "aabbcc") @result{}
	382	#<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
	383
	384	(define-peg-pattern bs body (+ "b"))
	385	(search-for-pattern bs "aabbcc") @result{}
	386	#<peg start: 2 end: 4 string: aabbcc tree: bb>
	387	(search-for-pattern (+ "b") "aabbcc") @result{}
	388	#<peg start: 2 end: 4 string: aabbcc tree: bb>
	389	(search-for-pattern "'b'+" "aabbcc") @result{}
	390	#<peg start: 2 end: 4 string: aabbcc tree: bb>
	391
	392	(define-peg-pattern zs body (+ "z"))
	393	(search-for-pattern zs "aabbcc") @result{}
	394	#f
	395	(search-for-pattern (+ "z") "aabbcc") @result{}
	396	#f
	397	(search-for-pattern "'z'+" "aabbcc") @result{}
	398	#f
	399	@end lisp
	400	@end deffn
	401
	402	@subsubheading PEG Match Records
	403	The @code{match-pattern} and @code{search-for-pattern} functions both return PEG
	404	match records. Actual information can be extracted from these with the
	405	following functions.
	406
	407	@deffn {Scheme Procedure} peg:string match-record
	408	Returns the original string that was parsed in the creation of
	409	@code{match-record}.
	410	@end deffn
	411
	412	@deffn {Scheme Procedure} peg:start match-record
	413	Returns the index of the first parsed character in the original string
	414	(from @code{peg:string}). If this is the same as @code{peg:end},
	415	nothing was parsed.
	416	@end deffn
	417
	418	@deffn {Scheme Procedure} peg:end match-record
	419	Returns one more than the index of the last parsed character in the
	420	original string (from @code{peg:string}). If this is the same as
	421	@code{peg:start}, nothing was parsed.
	422	@end deffn
	423
	424	@deffn {Scheme Procedure} peg:substring match-record
	425	Returns the substring parsed by @code{match-record}. This is equivalent to
	426	@code{(substring (peg:string match-record) (peg:start match-record) (peg:end
	427	match-record))}.
	428	@end deffn
	429
	430	@deffn {Scheme Procedure} peg:tree match-record
	431	Returns the tree parsed by @code{match-record}.
	432	@end deffn
	433
	434	@deffn {Scheme Procedure} peg-record? match-record
	435	Returns true if @code{match-record} is a PEG match record, or false
	436	otherwise.
	437	@end deffn
	438
	439	Example:
	440	@lisp
	441	(define-peg-pattern bs all (peg "'b'+"))
	442
	443	(search-for-pattern bs "aabbcc") @result{}
	444	#<peg start: 2 end: 4 string: aabbcc tree: (bs bb)>
	445
	446	(let ((pm (search-for-pattern bs "aabbcc")))
	447	`((string ,(peg:string pm))
	448	(start ,(peg:start pm))
	449	(end ,(peg:end pm))
	450	(substring ,(peg:substring pm))
	451	(tree ,(peg:tree pm))
	452	(record? ,(peg-record? pm)))) @result{}
	453	((string "aabbcc")
	454	(start 2)
	455	(end 4)
	456	(substring "bb")
	457	(tree (bs "bb"))
	458	(record? #t))
	459	@end lisp
	460
	461	@subsubheading Miscellaneous
	462
	463	@deffn {Scheme Procedure} context-flatten tst lst
	464	Takes a predicate @var{tst} and a list @var{lst}. Flattens @var{lst}
	465	until all elements are either atoms or satisfy @var{tst}. If @var{lst}
	466	itself satisfies @var{tst}, @code{(list lst)} is returned (this is a
	467	flat list whose only element satisfies @var{tst}).
	468
	469	@lisp
	470	(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{}
	471	(2 2 (1 1 (2 2)) 2 2 (1 1))
	472	(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{}
	473	((1 1 (1 1 (2 2)) (2 2 (1 1))))
	474	@end lisp
	475
	476	If you're wondering why this is here, take a look at the tutorial.
	477	@end deffn
	478
	479	@deffn {Scheme Procedure} keyword-flatten terms lst
	480	A less general form of @code{context-flatten}. Takes a list of terminal
	481	atoms @code{terms} and flattens @var{lst} until all elements are either
	482	atoms, or lists which have an atom from @code{terms} as their first
	483	element.
	484	@lisp
	485	(keyword-flatten '(a b) '(c a b (a c) (b c) (c (b a) (c a)))) @result{}
	486	(c a b (a c) (b c) c (b a) c a)
	487	@end lisp
	488
	489	If you're wondering why this is here, take a look at the tutorial.
	490	@end deffn
	491
	492	@node PEG Tutorial
	493	@subsection PEG Tutorial
	494
	495	@subsubheading Parsing /etc/passwd
	496	This example will show how to parse /etc/passwd using PEGs.
	497
	498	First we define an example /etc/passwd file:
	499
	500	@lisp
	501	(define etc-passwd
	502	"root:x:0:0:root:/root:/bin/bash
	503	daemon:x:1:1:daemon:/usr/sbin:/bin/sh
	504	bin:x:2:2:bin:/bin:/bin/sh
	505	sys:x:3:3:sys:/dev:/bin/sh
	506	nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
	507	messagebus:x:103:107::/var/run/dbus:/bin/false
	508	")
	509	@end lisp
	510
	511	As a first pass at this, we might want to have all the entries in
	512	/etc/passwd in a list.
	513
	514	Doing this with string-based PEG syntax would look like this:
	515	@lisp
	516	(define-peg-string-patterns
	517	"passwd <- entry* !.
	518	entry <-- (! NL .)* NL*
	519	NL < '\n'")
	520	@end lisp
	521
	522	A @code{passwd} file is 0 or more entries (@code{entry*}) until the end
	523	of the file (@code{!.} (@code{.} is any character, so @code{!.} means
	524	``not anything'')). We want to capture the data in the nonterminal
	525	@code{passwd}, but not tag it with the name, so we use @code{<-}.
	526
	527	An entry is a series of 0 or more characters that aren't newlines
	528	(@code{(! NL .)}) followed by 0 or more newlines (@code{NL}). We want
	529	to tag all the entries with @code{entry}, so we use @code{<--}.
	530
	531	A newline is just a literal newline (@code{'\n'}). We don't want a
	532	bunch of newlines cluttering up the output, so we use @code{<} to throw
	533	away the captured data.
	534
	535	Here is the same PEG defined using S-expressions:
	536	@lisp
	537	(define-peg-pattern passwd body (and (* entry) (not-followed-by peg-any)))
	538	(define-peg-pattern entry all (and (* (and (not-followed-by NL) peg-any))
	539	(* NL)))
	540	(define-peg-pattern NL none "\n")
	541	@end lisp
	542
	543	Obviously this is much more verbose. On the other hand, it's more
	544	explicit, and thus easier to build automatically. However, there are
	545	some tricks that make S-expressions easier to use in some cases. One is
	546	the @code{ignore} keyword; the string syntax has no way to say ``throw
	547	away this text'' except breaking it out into a separate nonterminal.
	548	For instance, to throw away the newlines we had to define @code{NL}. In
	549	the S-expression syntax, we could have simply written @code{(ignore
	550	"\n")}. Also, for the cases where string syntax is really much cleaner,
	551	the @code{peg} keyword can be used to embed string syntax in
	552	S-expression syntax. For instance, we could have written:
	553
	554	@lisp
	555	(define-peg-pattern passwd body (peg "entry* !."))
	556	@end lisp
	557
	558	However we define it, parsing @code{etc-passwd} with the @code{passwd}
	559	nonterminal yields the same results:
	560
	561	@lisp
	562	(peg:tree (match-pattern passwd etc-passwd)) @result{}
	563	((entry "root:x:0:0:root:/root:/bin/bash")
	564	(entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
	565	(entry "bin:x:2:2:bin:/bin:/bin/sh")
	566	(entry "sys:x:3:3:sys:/dev:/bin/sh")
	567	(entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
	568	(entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
	569	@end lisp
	570
	571	However, here is something to be wary of:
	572
	573	@lisp
	574	(peg:tree (match-pattern passwd "one entry")) @result{}
	575	(entry "one entry")
	576	@end lisp
	577
	578	By default, the parse trees generated by PEGs are compressed as much as
	579	possible without losing information. It may not look like this is what
	580	you want at first, but uncompressed parse trees are an enormous headache
	581	(there's no easy way to predict how deep particular lists will nest,
	582	there are empty lists littered everywhere, etc. etc.). One side-effect
	583	of this, however, is that sometimes the compressor is too aggressive.
	584	No information is discarded when @code{((entry "one entry"))} is
	585	compressed to @code{(entry "one entry")}, but in this particular case it
	586	probably isn't what we want.
	587
	588	There are two functions for easily dealing with this:
	589	@code{keyword-flatten} and @code{context-flatten}. The
	590	@code{keyword-flatten} function takes a list of keywords and a list to
	591	flatten, then tries to coerce the list such that the first element of
	592	all sublists is one of the keywords. The @code{context-flatten}
	593	function is similar, but instead of a list of keywords it takes a
	594	predicate that should indicate whether a given sublist is good enough
	595	(refer to the API reference for more details).
	596
	597	What we want here is @code{keyword-flatten}.
	598	@lisp
	599	(keyword-flatten '(entry) (peg:tree (match-pattern passwd etc-passwd))) @result{}
	600	((entry "root:x:0:0:root:/root:/bin/bash")
	601	(entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
	602	(entry "bin:x:2:2:bin:/bin:/bin/sh")
	603	(entry "sys:x:3:3:sys:/dev:/bin/sh")
	604	(entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
	605	(entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
	606	(keyword-flatten '(entry) (peg:tree (match-pattern passwd "one entry"))) @result{}
	607	((entry "one entry"))
	608	@end lisp
	609
	610	Of course, this is a somewhat contrived example. In practice we would
	611	probably just tag the @code{passwd} nonterminal to remove the ambiguity
	612	(using either the @code{all} keyword for S-expressions or the @code{<--}
	613	symbol for strings)..
	614
	615	@lisp
	616	(define-peg-pattern tag-passwd all (peg "entry* !."))
	617	(peg:tree (match-pattern tag-passwd etc-passwd)) @result{}
	618	(tag-passwd
	619	(entry "root:x:0:0:root:/root:/bin/bash")
	620	(entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
	621	(entry "bin:x:2:2:bin:/bin:/bin/sh")
	622	(entry "sys:x:3:3:sys:/dev:/bin/sh")
	623	(entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
	624	(entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
	625	(peg:tree (match-pattern tag-passwd "one entry"))
	626	(tag-passwd
	627	(entry "one entry"))
	628	@end lisp
	629
	630	If you're ever uncertain about the potential results of parsing
	631	something, remember the two absolute rules:
	632	@enumerate
	633	@item
	634	No parsing information will ever be discarded.
	635	@item
	636	There will never be any lists with fewer than 2 elements.
	637	@end enumerate
	638
	639	For the purposes of (1), "parsing information" means things tagged with
	640	the @code{any} keyword or the @code{<--} symbol. Plain strings will be
	641	concatenated.
	642
	643	Let's extend this example a bit more and actually pull some useful
	644	information out of the passwd file:
	645
	646	@lisp
	647	(define-peg-string-patterns
	648	"passwd <-- entry* !.
	649	entry <-- login C pass C uid C gid C nameORcomment C homedir C shell NL*
	650	login <-- text
	651	pass <-- text
	652	uid <-- [0-9]*
	653	gid <-- [0-9]*
	654	nameORcomment <-- text
	655	homedir <-- path
	656	shell <-- path
	657	path <-- (SLASH pathELEMENT)*
	658	pathELEMENT <-- (!NL !C !'/' .)*
	659	text <- (!NL !C .)*
	660	C < ':'
	661	NL < '\n'
	662	SLASH < '/'")
	663	@end lisp
	664
	665	This produces rather pretty parse trees:
	666	@lisp
	667	(passwd
	668	(entry (login "root")
	669	(pass "x")
	670	(uid "0")
	671	(gid "0")
	672	(nameORcomment "root")
	673	(homedir (path (pathELEMENT "root")))
	674	(shell (path (pathELEMENT "bin") (pathELEMENT "bash"))))
	675	(entry (login "daemon")
	676	(pass "x")
	677	(uid "1")
	678	(gid "1")
	679	(nameORcomment "daemon")
	680	(homedir
	681	(path (pathELEMENT "usr") (pathELEMENT "sbin")))
	682	(shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
	683	(entry (login "bin")
	684	(pass "x")
	685	(uid "2")
	686	(gid "2")
	687	(nameORcomment "bin")
	688	(homedir (path (pathELEMENT "bin")))
	689	(shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
	690	(entry (login "sys")
	691	(pass "x")
	692	(uid "3")
	693	(gid "3")
	694	(nameORcomment "sys")
	695	(homedir (path (pathELEMENT "dev")))
	696	(shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
	697	(entry (login "nobody")
	698	(pass "x")
	699	(uid "65534")
	700	(gid "65534")
	701	(nameORcomment "nobody")
	702	(homedir (path (pathELEMENT "nonexistent")))
	703	(shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
	704	(entry (login "messagebus")
	705	(pass "x")
	706	(uid "103")
	707	(gid "107")
	708	nameORcomment
	709	(homedir
	710	(path (pathELEMENT "var")
	711	(pathELEMENT "run")
	712	(pathELEMENT "dbus")))
	713	(shell (path (pathELEMENT "bin") (pathELEMENT "false")))))
	714	@end lisp
	715
	716	Notice that when there's no entry in a field (e.g. @code{nameORcomment}
	717	for messagebus) the symbol is inserted. This is the ``don't throw away
	718	any information'' rule---we succesfully matched a @code{nameORcomment}
	719	of 0 characters (since we used @code{*} when defining it). This is
	720	usually what you want, because it allows you to e.g. use @code{list-ref}
	721	to pull out elements (since they all have known offsets).
	722
	723	If you'd prefer not to have symbols for empty matches, you can replace
	724	the @code{*} with a @code{+} and add a @code{?} after the
	725	@code{nameORcomment} in @code{entry}. Then it will try to parse 1 or
	726	more characters, fail (inserting nothing into the parse tree), but
	727	continue because it didn't have to match the nameORcomment to continue.
	728
	729
	730	@subsubheading Embedding Arithmetic Expressions
	731
	732	We can parse simple mathematical expressions with the following PEG:
	733
	734	@lisp
	735	(define-peg-string-patterns
	736	"expr <- sum
	737	sum <-- (product ('+' / '-') sum) / product
	738	product <-- (value ('*' / '/') product) / value
	739	value <-- number / '(' expr ')'
	740	number <-- [0-9]+")
	741	@end lisp
	742
	743	Then:
	744	@lisp
	745	(peg:tree (match-pattern expr "1+1/2*3+(1+1)/2")) @result{}
	746	(sum (product (value (number "1")))
	747	"+"
	748	(sum (product
	749	(value (number "1"))
	750	"/"
	751	(product
	752	(value (number "2"))
	753	"*"
	754	(product (value (number "3")))))
	755	"+"
	756	(sum (product
	757	(value "("
	758	(sum (product (value (number "1")))
	759	"+"
	760	(sum (product (value (number "1")))))
	761	")")
	762	"/"
	763	(product (value (number "2")))))))
	764	@end lisp
	765
	766	There is very little wasted effort in this PEG. The @code{number}
	767	nonterminal has to be tagged because otherwise the numbers might run
	768	together with the arithmetic expressions during the string concatenation
	769	stage of parse-tree compression (the parser will see ``1'' followed by
	770	``/'' and decide to call it ``1/''). When in doubt, tag.
	771
	772	It is very easy to turn these parse trees into lisp expressions:
	773
	774	@lisp
	775	(define (parse-sum sum left . rest)
	776	(if (null? rest)
	777	(apply parse-product left)
	778	(list (string->symbol (car rest))
	779	(apply parse-product left)
	780	(apply parse-sum (cadr rest)))))
	781
	782	(define (parse-product product left . rest)
	783	(if (null? rest)
	784	(apply parse-value left)
	785	(list (string->symbol (car rest))
	786	(apply parse-value left)
	787	(apply parse-product (cadr rest)))))
	788
	789	(define (parse-value value first . rest)
	790	(if (null? rest)
	791	(string->number (cadr first))
	792	(apply parse-sum (car rest))))
	793
	794	(define parse-expr parse-sum)
	795	@end lisp
	796
	797	(Notice all these functions look very similar; for a more complicated
	798	PEG, it would be worth abstracting.)
	799
	800	Then:
	801	@lisp
	802	(apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
	803	(+ 1 (+ (/ 1 (* 2 3)) (/ (+ 1 1) 2)))
	804	@end lisp
	805
	806	But wait! The associativity is wrong! Where it says @code{(/ 1 (* 2
	807	3))}, it should say @code{(* (/ 1 2) 3)}.
	808
	809	It's tempting to try replacing e.g. @code{"sum <-- (product ('+' / '-')
	810	sum) / product"} with @code{"sum <-- (sum ('+' / '-') product) /
	811	product"}, but this is a Bad Idea. PEGs don't support left recursion.
	812	To see why, imagine what the parser will do here. When it tries to
	813	parse @code{sum}, it first has to try and parse @code{sum}. But to do
	814	that, it first has to try and parse @code{sum}. This will continue
	815	until the stack gets blown off.
	816
	817	So how does one parse left-associative binary operators with PEGs?
	818	Honestly, this is one of their major shortcomings. There's no
	819	general-purpose way of doing this, but here the repetition operators are
	820	a good choice:
	821
	822	@lisp
	823	(use-modules (srfi srfi-1))
	824
	825	(define-peg-string-patterns
	826	"expr <- sum
	827	sum <-- (product ('+' / '-'))* product
	828	product <-- (value ('' / '/')) value
	829	value <-- number / '(' expr ')'
	830	number <-- [0-9]+")
	831
	832	;; take a deep breath...
	833	(define (make-left-parser next-func)
	834	(lambda (sum first . rest) ;; general form, comments below assume
	835	;; that we're dealing with a sum expression
	836	(if (null? rest) ;; form (sum (product ...))
	837	(apply next-func first)
	838	(if (string? (cadr first));; form (sum ((product ...) "+") (product ...))
	839	(list (string->symbol (cadr first))
	840	(apply next-func (car first))
	841	(apply next-func (car rest)))
	842	;; form (sum (((product ...) "+") ((product ...) "+")) (product ...))
	843	(car
	844	(reduce ;; walk through the list and build a left-associative tree
	845	(lambda (l r)
	846	(list (list (cadr r) (car r) (apply next-func (car l)))
	847	(string->symbol (cadr l))))
	848	'ignore
	849	(append ;; make a list of all the products
	850	;; the first one should be pre-parsed
	851	(list (list (apply next-func (caar first))
	852	(string->symbol (cadar first))))
	853	(cdr first)
	854	;; the last one has to be added in
	855	(list (append rest '("done"))))))))))
	856
	857	(define (parse-value value first . rest)
	858	(if (null? rest)
	859	(string->number (cadr first))
	860	(apply parse-sum (car rest))))
	861	(define parse-product (make-left-parser parse-value))
	862	(define parse-sum (make-left-parser parse-product))
	863	(define parse-expr parse-sum)
	864	@end lisp
	865
	866	Then:
	867	@lisp
	868	(apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
	869	(+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2))
	870	@end lisp
	871
	872	As you can see, this is much uglier (it could be made prettier by using
	873	@code{context-flatten}, but the way it's written above makes it clear
	874	how we deal with the three ways the zero-or-more @code{*} expression can
	875	parse). Fortunately, most of the time we can get away with only using
	876	right-associativity.
	877
	878	@subsubheading Simplified Functions
	879
	880	For a more tantalizing example, consider the following grammar that
	881	parses (highly) simplified C functions:
	882
	883	@lisp
	884	(define-peg-string-patterns
	885	"cfunc <-- cSP ctype cSP cname cSP cargs cLB cSP cbody cRB
	886	ctype <-- cidentifier
	887	cname <-- cidentifier
	888	cargs <-- cLP (! (cSP cRP) carg cSP (cCOMMA / cRP) cSP)* cSP
	889	carg <-- cSP ctype cSP cname
	890	cbody <-- cstatement *
	891	cidentifier <- [a-zA-z][a-zA-Z0-9_]*
	892	cstatement <-- (!';'.)*cSC cSP
	893	cSC < ';'
	894	cCOMMA < ','
	895	cLP < '('
	896	cRP < ')'
	897	cLB < '@{'
	898	cRB < '@}'
	899	cSP < [ \t\n]*")
	900	@end lisp
	901
	902	Then:
	903	@lisp
	904	(match-pattern cfunc "int square(int a) @{ return a*a;@}") @result{}
	905	(32
	906	(cfunc (ctype "int")
	907	(cname "square")
	908	(cargs (carg (ctype "int") (cname "a")))
	909	(cbody (cstatement "return a*a"))))
	910	@end lisp
	911
	912	And:
	913	@lisp
	914	(match-pattern cfunc "int mod(int a, int b) @{ int c = a/b;return a-b*c; @}") @result{}
	915	(52
	916	(cfunc (ctype "int")
	917	(cname "mod")
	918	(cargs (carg (ctype "int") (cname "a"))
	919	(carg (ctype "int") (cname "b")))
	920	(cbody (cstatement "int c = a/b")
	921	(cstatement "return a- b*c"))))
	922	@end lisp
	923
	924	By wrapping all the @code{carg} nonterminals in a @code{cargs}
	925	nonterminal, we were able to remove any ambiguity in the parsing
	926	structure and avoid having to call @code{context-flatten} on the output
	927	of @code{match-pattern}. We used the same trick with the @code{cstatement}
	928	nonterminals, wrapping them in a @code{cbody} nonterminal.
	929
	930	The whitespace nonterminal @code{cSP} used here is a (very) useful
	931	instantiation of a common pattern for matching syntactically irrelevant
	932	information. Since it's tagged with @code{<} and ends with @code{*} it
	933	won't clutter up the parse trees (all the empty lists will be discarded
	934	during the compression step) and it will never cause parsing to fail.
	935
	936	@node PEG Internals
	937	@subsection PEG Internals
	938
	939	A PEG parser takes a string as input and attempts to parse it as a given
	940	nonterminal. The key idea of the PEG implementation is that every
	941	nonterminal is just a function that takes a string as an argument and
	942	attempts to parse that string as its nonterminal. The functions always
	943	start from the beginning, but a parse is considered successful if there
	944	is material left over at the end.
	945
	946	This makes it easy to model different PEG parsing operations. For
	947	instance, consider the PEG grammar @code{"ab"}, which could also be
	948	written @code{(and "a" "b")}. It matches the string ``ab''. Here's how
	949	that might be implemented in the PEG style:
	950
	951	@lisp
	952	(define (match-and-a-b str)
	953	(match-a str)
	954	(match-b str))
	955	@end lisp
	956
	957	As you can see, the use of functions provides an easy way to model
	958	sequencing. In a similar way, one could model @code{(or a b)} with
	959	something like the following:
	960
	961	@lisp
	962	(define (match-or-a-b str)
	963	(or (match-a str) (match-b str)))
	964	@end lisp
	965
	966	Here the semantics of a PEG @code{or} expression map naturally onto
	967	Scheme's @code{or} operator. This function will attempt to run
	968	@code{(match-a str)}, and return its result if it succeeds. Otherwise it
	969	will run @code{(match-b str)}.
	970
	971	Of course, the code above wouldn't quite work. We need some way for the
	972	parsing functions to communicate. The actual interface used is below.
	973
	974	@subsubheading Parsing Function Interface
	975
	976	A parsing function takes three arguments - a string, the length of that
	977	string, and the position in that string it should start parsing at. In
	978	effect, the parsing functions pass around substrings in pieces - the
	979	first argument is a buffer of characters, and the second two give a
	980	range within that buffer that the parsing function should look at.
	981
	982	Parsing functions return either #f, if they failed to match their
	983	nonterminal, or a list whose first element must be an integer
	984	representing the final position in the string they matched and whose cdr
	985	can be any other data the function wishes to return, or '() if it
	986	doesn't have any more data.
	987
	988	The one caveat is that if the extra data it returns is a list, any
	989	adjacent strings in that list will be appended by @code{match-pattern}. For
	990	instance, if a parsing function returns @code{(13 ("a" "b" "c"))},
	991	@code{match-pattern} will take @code{(13 ("abc"))} as its value.
	992
	993	For example, here is a function to match ``ab'' using the actual
	994	interface.
	995
	996	@lisp
	997	(define (match-a-b str len pos)
	998	(and (<= (+ pos 2) len)
	999	(string= str "ab" pos (+ pos 2))
	1000	(list (+ pos 2) '()))) ; we return no extra information
	1001	@end lisp
	1002
	1003	The above function can be used to match a string by running
	1004	@code{(match-pattern match-a-b "ab")}.
	1005
	1006	@subsubheading Code Generators and Extensible Syntax
	1007
	1008	PEG expressions, such as those in a @code{define-peg-pattern} form, are
	1009	interpreted internally in two steps.
	1010
	1011	First, any string PEG is expanded into an s-expression PEG by the code
	1012	in the @code{(ice-9 peg string-peg)} module.
	1013
	1014	Then, then s-expression PEG that results is compiled into a parsing
	1015	function by the @code{(ice-9 peg codegen)} module. In particular, the
	1016	function @code{compile-peg-pattern} is called on the s-expression. It then
	1017	decides what to do based on the form it is passed.
	1018
	1019	The PEG syntax can be expanded by providing @code{compile-peg-pattern} more
	1020	options for what to do with its forms. The extended syntax will be
	1021	associated with a symbol, for instance @code{my-parsing-form}, and will
	1022	be called on all PEG expressions of the form
	1023	@lisp
	1024	(my-parsing-form ...)
	1025	@end lisp
	1026
	1027	The parsing function should take two arguments. The first will be a
	1028	syntax object containing a list with all of the arguments to the form
	1029	(but not the form's name), and the second will be the
	1030	@code{capture-type} argument that is passed to @code{define-peg-pattern}.
	1031
	1032	New functions can be registered by calling @code{(add-peg-compiler!
	1033	symbol function)}, where @code{symbol} is the symbol that will indicate
	1034	a form of this type and @code{function} is the code generating function
	1035	described above. The function @code{add-peg-compiler!} is exported from
	1036	the @code{(ice-9 peg codegen)} module.