Import wisent manual from CEDET trunk
[bpt/emacs.git] / doc / misc / wisent.texi
CommitLineData
9e7abd17
EL
1\input texinfo @c -*-texinfo-*-
2@c %**start of header
3@setfilename wisent.info
4@set TITLE Wisent Parser Development
5@set AUTHOR Eric M. Ludlam, David Ponce, and Richard Y. Kim
6@settitle @value{TITLE}
7
8@c *************************************************************************
9@c @ Header
10@c *************************************************************************
11
12@c Merge all indexes into a single index for now.
13@c We can always separate them later into two or more as needed.
14@syncodeindex vr cp
15@syncodeindex fn cp
16@syncodeindex ky cp
17@syncodeindex pg cp
18@syncodeindex tp cp
19
20@c @footnotestyle separate
21@c @paragraphindent 2
22@c @@smallbook
23@c %**end of header
24
25@copying
26This manual documents the Wisent parser generator.
27
28Copyright @copyright{} 2001, 2002, 2003, 2004, 2007 David Ponce
29
30Some texts are borrowed or adapted from the manual of Bison version
311.35. The text in section entitled ``Understanding the automaton'' is
32adapted from the section ``Understanding Your Parser'' in the manual
33of Bison version 1.49.
34
35Copyright @copyright{} 1988, 1989, 1990, 1991, 1992, 1993, 1995, 1998,
361999, 2000, 2001, 2002, 2003, 2004 Free Software Foundation, Inc.
37
38@quotation
39Permission is granted to copy, distribute and/or modify this document
40under the terms of the GNU Free Documentation License, Version 1.1 or
41any later version published by the Free Software Foundation; with the
42Invariant Sections being list their titles, with the Front-Cover Texts
43being list, and with the Back-Cover Texts being list. A copy of the
44license is included in the section entitled ``GNU Free Documentation
45License''.
46@end quotation
47@end copying
48
49@ifinfo
50@dircategory Emacs
51@direntry
52* Semantic Wisent parser development: (wisent).
53@end direntry
54@end ifinfo
55
56@iftex
57@finalout
58@end iftex
59
60@c @setchapternewpage odd
61@c @setchapternewpage off
62
63@ifinfo
64This file documents Application Development with Semantic.
65@emph{Infrastructure for parser based text analysis in Emacs}
66
67Copyright @copyright{} 2001, 2002, 2003, 2004 @value{AUTHOR}
68@end ifinfo
69
70@titlepage
71@sp 10
72@title @value{TITLE}
73@author by @value{AUTHOR}
74@vskip 0pt plus 1 fill
75Copyright @copyright{} 2001, 2002, 2003, 2004 @value{AUTHOR}
76@page
77@vskip 0pt plus 1 fill
78@insertcopying
79@end titlepage
80@page
81
82@c MACRO inclusion
83@include semanticheader.texi
84@paragraphindent none
85
86
87@c *************************************************************************
88@c @ Document
89@c *************************************************************************
90@contents
91
92@node top
93@top @value{TITLE}
94
95Wisent (the European Bison ;-) is an Emacs Lisp implementation of the
96GNU Compiler Compiler Bison.
97
98This manual describes how to use Wisent to develop grammars for
99programming languages, and how to use grammars to parse language
100source in Emacs buffers.
101
102It also describes how Wisent is used with the @semantic{} tool set
103described in the @ref{Top, Semantic Manual, Semantic Manual, semantic}.
104
105@menu
106* Wisent Overview::
107* Wisent Grammar::
108* Wisent Parsing::
109* Wisent Semantic::
110* GNU Free Documentation License::
111* Index::
112@end menu
113
114@node Wisent Overview
115@chapter Wisent Overview
116
117@dfn{Wisent} (the European Bison) is an implementation in Emacs Lisp
118of the GNU Compiler Compiler Bison. Its code is a port of the C code
119of GNU Bison 1.28 & 1.31.
120
121For more details on the basic concepts for understanding Wisent, it is
122worthwhile to read the @ref{Top, Bison Manual, bison}.
123@ifhtml
124@uref{http://www.gnu.org/manual/bison/html_node/index.html}.
125@end ifhtml
126
127Wisent can generate compilers compatible with the @semantic{} tool set.
128See the @ref{Top, Semantic Manual, , semantic}.
129
130It benefits from these Bison features:
131
132@itemize @bullet
133@item
134It uses a fast but not so space-efficient encoding for the parse
135tables, described in Corbett's PhD thesis from Berkeley:
136@quotation
137@cite{Static Semantics in Compiler Error Recovery}@*
138June 1985, Report No. UCB/CSD 85/251.
139@end quotation
140
141@item
142For generating the lookahead sets, Wisent uses the well-known
143technique of F. DeRemer and A. Pennello they described in:
144@quotation
145@cite{Efficient Construction of LALR(1) Lookahead Sets}@*
146October 1982, ACM TOPLS Vol 4 No 4.
147@end quotation
148
149@item
150Wisent resolves shift/reduce conflicts using operator precedence and
151associativity.
152
153@item
154Parser error recovery is accomplished using rules which match the
155special token @code{error}.
156@end itemize
157
158Nevertheless there are some fundamental differences between Bison and
159Wisent.
160
161@itemize
162@item
163Wisent is intended to be used in Emacs. It reads and produces Emacs
164Lisp data structures. All the additional code used in grammars is
165Emacs Lisp code.
166
167@item
168Contrary to Bison, Wisent does not generate a parser which combines
169Emacs Lisp code and grammar constructs. They exist separately.
170Wisent reads the grammar from a Lisp data structure and then generates
171grammar constructs as tables. Afterward, the derived tables can be
172included and byte-compiled in separate Emacs Lisp files, and be used
173at a later time by the Wisent's parser engine.
174
175@item
176Wisent allows multiple start nonterminals and allows a call to the
177parsing function to be made for a particular start nonterminal. For
178example, this is particularly useful to parse a region of an Emacs
179buffer. @semantic{} heavily depends on the availability of this feature.
180@end itemize
181
182@node Wisent Grammar
183@chapter Wisent Grammar
184
185@cindex context-free grammar
186@cindex rule
187In order for Wisent to parse a language, it must be described by a
188@dfn{context-free grammar}. That is a grammar specified as rules that
189can be applied regardless of context. For more information, see
190@ref{Language and Grammar, , , bison}, in the Bison manual.
191
192@cindex terminal
193@cindex nonterminal
194The formal grammar is formulated using @dfn{terminal} and
195@dfn{nonterminal} items. Terminals can be Emacs Lisp symbols or
196characters, and nonterminals are symbols only.
197
198@cindex token
199Terminals (also known as @dfn{tokens}) represent the lexical
200elements of the language like numbers, strings, etc..
201
202For example @samp{PLUS} can represent the operator @samp{+}.
203
204Nonterminal symbols are described by rules:
205
206@example
207@group
208RESULT @equiv{} COMPONENTS@dots{}
209@end group
210@end example
211
212@samp{RESULT} is a nonterminal that this rule describes and
213@samp{COMPONENTS} are various terminals and nonterminals that are put
214together by this rule.
215
216For example, this rule:
217
218@example
219@group
220exp @equiv{} exp PLUS exp
221@end group
222@end example
223
224Says that two groupings of type @samp{exp}, with a @samp{PLUS} token
225in between, can be combined into a larger grouping of type @samp{exp}.
226
227@menu
228* Grammar format::
229* Example::
230* Compiling a grammar::
231* Conflicts::
232@end menu
233
234@node Grammar format, Example, Wisent Grammar, Wisent Grammar
235@comment node-name, next, previous, up
236@section Grammar format
237
238@cindex grammar format
239To be acceptable by Wisent a context-free grammar must respect a
240particular format. That is, must be represented as an Emacs Lisp list
241of the form:
242
243@code{(@var{terminals} @var{assocs} . @var{non-terminals})}
244
245@table @var
246@item terminals
247Is the list of terminal symbols used in the grammar.
248
249@cindex associativity
250@item assocs
251Specify the associativity of @var{terminals}. It is @code{nil} when
252there is no associativity defined, or an alist of
253@w{@code{(@var{assoc-type} . @var{assoc-value})}} elements.
254
255@var{assoc-type} must be one of the @code{default-prec},
256@code{nonassoc}, @code{left} or @code{right} symbols. When
257@var{assoc-type} is @code{default-prec}, @var{assoc-value} must be
258@code{nil} or @code{t} (the default). Otherwise it is a list of
259tokens which must have been previously declared in @var{terminals}.
260
261For details, see @ref{Contextual Precedence, , , bison}, in the
262Bison manual.
263
264@item non-terminals
265Is the list of nonterminal definitions. Each definition has the form:
266
267@code{(@var{nonterm} . @var{rules})}
268
269Where @var{nonterm} is the nonterminal symbol defined and
270@var{rules} the list of rules that describe this nonterminal. Each
271rule is a list:
272
273@code{(@var{components} [@var{precedence}] [@var{action}])}
274
275Where:
276
277@table @var
278@item components
279Is a list of various terminals and nonterminals that are put together
280by this rule.
281
282For example,
283
284@example
285@group
286(exp ((exp ?+ exp)) ;; exp: exp '+' exp
287 ) ;; ;
288@end group
289@end example
290
291Says that two groupings of type @samp{exp}, with a @samp{+} token in
292between, can be combined into a larger grouping of type @samp{exp}.
293
294@cindex grammar coding conventions
295By convention, a nonterminal symbol should be in lower case, such as
296@samp{exp}, @samp{stmt} or @samp{declaration}. Terminal symbols
297should be upper case to distinguish them from nonterminals: for
298example, @samp{INTEGER}, @samp{IDENTIFIER}, @samp{IF} or
299@samp{RETURN}. A terminal symbol that represents a particular keyword
300in the language is conventionally the same as that keyword converted
301to upper case. The terminal symbol @code{error} is reserved for error
302recovery.
303
304@cindex middle-rule actions
305Scattered among the components can be @dfn{middle-rule} actions.
306Usually only @var{action} is provided (@pxref{action}).
307
308If @var{components} in a rule is @code{nil}, it means that the rule
309can match the empty string. For example, here is how to define a
310comma-separated sequence of zero or more @samp{exp} groupings:
311
312@example
313@group
314(expseq (nil) ;; expseq: ;; empty
315 ((expseq1)) ;; | expseq1
316 ) ;; ;
317
318(expseq1 ((exp)) ;; expseq1: exp
319 ((expseq1 ?, exp)) ;; | expseq1 ',' exp
320 ) ;; ;
321@end group
322@end example
323
324@cindex precedence level
325@item precedence
326Assign the rule the precedence of the given terminal item, overriding
327the precedence that would be deduced for it, that is the one of the
328last terminal in it. Notice that only terminals declared in
329@var{assocs} have a precedence level. The altered rule precedence
330then affects how conflicts involving that rule are resolved.
331
332@var{precedence} is an optional vector of one terminal item.
333
334Here is how @var{precedence} solves the problem of unary minus.
335First, declare a precedence for a fictitious terminal symbol named
336@code{UMINUS}. There are no tokens of this type, but the symbol
337serves to stand for its precedence:
338
339@example
340@dots{}
341((default-prec t) ;; This is the default
342 (left '+' '-')
343 (left '*')
344 (left UMINUS))
345@end example
346
347Now the precedence of @code{UMINUS} can be used in specific rules:
348
349@example
350@group
351(exp @dots{} ;; exp: @dots{}
352 ((exp ?- exp)) ;; | exp '-' exp
353 @dots{} ;; @dots{}
354 ((?- exp) [UMINUS]) ;; | '-' exp %prec UMINUS
355 @dots{} ;; @dots{}
356 ) ;; ;
357@end group
358@end example
359
360If you forget to append @code{[UMINUS]} to the rule for unary minus,
361Wisent silently assumes that minus has its usual precedence. This
362kind of problem can be tricky to debug, since one typically discovers
363the mistake only by testing the code.
364
365Using @code{(default-prec nil)} declaration makes it easier to
366discover this kind of problem systematically. It causes rules that
367lack a @var{precedence} modifier to have no precedence, even if the
368last terminal symbol mentioned in their components has a declared
369precedence.
370
371If @code{(default-prec nil)} is in effect, you must specify
372@var{precedence} for all rules that participate in precedence conflict
373resolution. Then you will see any shift/reduce conflict until you
374tell Wisent how to resolve it, either by changing your grammar or by
375adding an explicit precedence. This will probably add declarations to
376the grammar, but it helps to protect against incorrect rule
377precedences.
378
379The effect of @code{(default-prec nil)} can be reversed by giving
380@code{(default-prec t)}, which is the default.
381
382For more details, see @ref{Contextual Precedence, , , bison}, in the
383Bison manual.
384
385It is important to understand that @var{assocs} declarations defines
386associativity but also assign a precedence level to terminals. All
387terminals declared in the same @code{left}, @code{right} or
388@code{nonassoc} association get the same precedence level. The
389precedence level is increased at each new association.
390
391On the other hand, @var{precedence} explicitly assign the precedence
392level of the given terminal to a rule.
393
394@cindex semantic actions
395@item @anchor{action}action
396An action is an optional Emacs Lisp function call, like this:
397
398@code{(identity $1)}
399
400The result of an action determines the semantic value of a rule.
401
402From an implementation standpoint, the function call will be embedded
403in a lambda expression, and several useful local variables will be
404defined:
405
406@table @code
407@vindex $N
408@item $@var{n}
409Where @var{n} is a positive integer. Like in Bison, the value of
410@code{$@var{n}} is the semantic value of the @var{n}th element of
411@var{components}, starting from 1. It can be of any Lisp data
412type.
413
414@vindex $region@var{n}
415@item $regionN
416Where @var{n} is a positive integer. For each @code{$@var{n}}
417variable defined there is a corresponding @code{$region@var{n}}
418variable. Its value is a pair @code{(@var{start-pos} .
419@var{end-pos})} that represent the start and end positions (in the
420lexical input stream) of the @code{$@var{n}} value. It can be
421@code{nil} when the component positions are not available, like for an
422empty string component for example.
423
424@vindex $region
425@item $region
426Its value is the leftmost and rightmost positions of input data
427matched by all @var{components} in the rule. This is a pair
428@code{(@var{leftmost-pos} . @var{rightmost-pos})}. It can be
429@code{nil} when components positions are not available.
430
431@vindex $nterm
432@item $nterm
433This variable is initialized with the nonterminal symbol
434(@var{nonterm}) the rule belongs to. It could be useful to improve
435error reporting or debugging. It is also used to automatically
436provide incremental re-parse entry points for @semantic{} tags
437(@pxref{Wisent Semantic}).
438
439@vindex $action
440@item $action
441The value of @code{$action} is the symbolic name of the current
442semantic action (@pxref{Debugging actions}).
443@end table
444
445When an action is not specified a default value is supplied, it is
446@code{(identity $1)}. This means that the default semantic value of a
447rule is the value of its first component. Excepted for a rule
448matching the empty string, for which the default action is to return
449@code{nil}.
450@end table
451@end table
452
453@node Example, Compiling a grammar, Grammar format, Wisent Grammar
454@comment node-name, next, previous, up
455@section Example
456
457@cindex grammar example
458Here is an example to parse simple infix arithmetic expressions. See
459@ref{Infix Calc, , , bison}, in the Bison manual for details.
460
461@lisp
462@group
463'(
464 ;; Terminals
465 (NUM)
466
467 ;; Terminal associativity & precedence
468 ((nonassoc ?=)
469 (left ?- ?+)
470 (left ?* ?/)
471 (left NEG)
472 (right ?^))
473
474 ;; Rules
475 (input
476 ((line))
477 ((input line)
478 (format "%s %s" $1 $2))
479 )
480
481 (line
482 ((?;)
483 (progn ";"))
484 ((exp ?;)
485 (format "%s;" $1))
486 ((error ?;)
487 (progn "Error;")))
488 )
489
490 (exp
491 ((NUM)
492 (string-to-number $1))
493 ((exp ?= exp)
494 (= $1 $3))
495 ((exp ?+ exp)
496 (+ $1 $3))
497 ((exp ?- exp)
498 (- $1 $3))
499 ((exp ?* exp)
500 (* $1 $3))
501 ((exp ?/ exp)
502 (/ $1 $3))
503 ((?- exp) [NEG]
504 (- $2))
505 ((exp ?^ exp)
506 (expt $1 $3))
507 ((?\( exp ?\))
508 (progn $2))
509 )
510 )
511@end group
512@end lisp
513
514In the bison-like @dfn{WY} format (@pxref{Wisent Semantic}) the
515grammar looks like this:
516
517@example
518@group
519%token <number> NUM
520
521%nonassoc '=' ;; comparison
522%left '-' '+'
523%left '*' '/'
524%left NEG ;; negation--unary minus
525%right '^' ;; exponentiation
526
527%%
528
529input:
530 line
531 | input line
532 (format "%s %s" $1 $2)
533 ;
534
535line:
536 ';'
537 @{";"@}
538 | exp ';'
539 (format "%s;" $1)
540 | error ';'
541 @{"Error;"@}
542 ;
543
544exp:
545 NUM
546 (string-to-number $1)
547 | exp '=' exp
548 (= $1 $3)
549 | exp '+' exp
550 (+ $1 $3)
551 | exp '-' exp
552 (- $1 $3)
553 | exp '*' exp
554 (* $1 $3)
555 | exp '/' exp
556 (/ $1 $3)
557 | '-' exp %prec NEG
558 (- $2)
559 | exp '^' exp
560 (expt $1 $3)
561 | '(' exp ')'
562 @{$2@}
563 ;
564
565%%
566@end group
567@end example
568
569@node Compiling a grammar, Conflicts, Example, Wisent Grammar
570@comment node-name, next, previous, up
571@section Compiling a grammar
572
573@cindex automaton
574After providing a context-free grammar in a suitable format, it must
575be translated into a set of tables (an @dfn{automaton}) that will be
576used to derive the parser. Like Bison, Wisent translates grammars that
577must be @dfn{LALR(1)}.
578
579@cindex LALR(1) grammar
580@cindex look-ahead token
581A grammar is @acronym{LALR(1)} if it is possible to tell how to parse
582any portion of an input string with just a single token of look-ahead:
583the @dfn{look-ahead token}. See @ref{Language and Grammar, , ,
584bison}, in the Bison manual for more information.
585
586@cindex grammar compilation
587Grammar translation (compilation) is achieved by the function:
588
589@cindex compiling a grammar
590@vindex wisent-single-start-flag
591@findex wisent-compile-grammar
592@defun wisent-compile-grammar grammar &optional start-list
593Compile @var{grammar} and return an @acronym{LALR(1)} automaton.
594
595Optional argument @var{start-list} is a list of start symbols
596(nonterminals). If @code{nil} the first nonterminal defined in the
597grammar is the default start symbol. If @var{start-list} contains
598only one element, it defines the start symbol. If @var{start-list}
599contains more than one element, all are defined as potential start
600symbols, unless @code{wisent-single-start-flag} is non-@code{nil}. In
601that case the first element of @var{start-list} defines the start
602symbol and others are ignored.
603
604The @acronym{LALR(1)} automaton is a vector of the form:
605
606@code{[@var{actions gotos starts functions}]}
607
608@table @var
609@item actions
610A state/token matrix telling the parser what to do at every state
611based on the current look-ahead token. That is shift, reduce, accept
612or error. See also @ref{Wisent Parsing}.
613
614@item gotos
615A state/nonterminal matrix telling the parser the next state to go to
616after reducing with each rule.
617
618@item starts
619An alist which maps the allowed start symbols (nonterminals) to
620lexical tokens that will be first shifted into the parser stack.
621
622@item functions
623An obarray of semantic action symbols. A semantic action is actually
624an Emacs Lisp function (lambda expression).
625@end table
626@end defun
627
628@node Conflicts, , Compiling a grammar, Wisent Grammar
629@comment node-name, next, previous, up
630@section Conflicts
631
632Normally, a grammar should produce an automaton where at each state
633the parser has only one action to do (@pxref{Wisent Parsing}).
634
635@cindex ambiguous grammar
636In certain cases, a grammar can produce an automaton where, at some
637states, there are more than one action possible. Such a grammar is
638@dfn{ambiguous}, and generates @dfn{conflicts}.
639
640@cindex deterministic automaton
641The parser can't be driven by an automaton which isn't completely
642@dfn{deterministic}, that is which contains conflicts. It is
643necessary to resolve the conflicts to eliminate them. Wisent resolves
644conflicts like Bison does.
645
646@cindex grammar conflicts
647@cindex conflicts resolution
648There are two sorts of conflicts:
649
650@table @dfn
651@cindex shift/reduce conflicts
652@item shift/reduce conflicts
653When either a shift or a reduction would be valid at the same state.
654
655Such conflicts are resolved by choosing to shift, unless otherwise
656directed by operator precedence declarations.
657See @ref{Shift/Reduce , , , bison}, in the Bison manual for more
658information.
659
660@cindex reduce/reduce conflicts
661@item reduce/reduce conflicts
662That occurs if there are two or more rules that apply to the same
663sequence of input. This usually indicates a serious error in the
664grammar.
665
666Such conflicts are resolved by choosing to use the rule that appears
667first in the grammar, but it is very risky to rely on this. Every
668reduce/reduce conflict must be studied and usually eliminated. See
669@ref{Reduce/Reduce , , , bison}, in the Bison manual for more
670information.
671@end table
672
673@menu
674* Grammar Debugging::
675* Understanding the automaton::
676@end menu
677
678@node Grammar Debugging
679@subsection Grammar debugging
680
681@cindex grammar debugging
682@cindex grammar verbose description
683To help writing a new grammar, @code{wisent-compile-grammar} can
684produce a verbose report containing a detailed description of the
685grammar and parser (equivalent to what Bison reports with the
686@option{--verbose} option).
687
688To enable the verbose report you can set to non-@code{nil} the
689variable:
690
691@vindex wisent-verbose-flag
692@deffn Option wisent-verbose-flag
693non-@code{nil} means to report verbose information on generated parser.
694@end deffn
695
696Or interactively use the command:
697
698@findex wisent-toggle-verbose-flag
699@deffn Command wisent-toggle-verbose-flag
700Toggle whether to report verbose information on generated parser.
701@end deffn
702
703The verbose report is printed in the temporary buffer
704@code{*wisent-log*} when running interactively, or in file
705@file{wisent.output} when running in batch mode. Different
706reports are separated from each other by a line like this:
707
708@example
709@group
710*** Wisent @var{source-file} - 2002-06-27 17:33
711@end group
712@end example
713
714where @var{source-file} is the name of the Emacs Lisp file from which
715the grammar was read. See @ref{Understanding the automaton}, for
716details on the verbose report.
717
718@table @strong
719@item Please Note
720To help debugging the grammar compiler itself, you can set this
721variable to print the content of some internal data structures:
722
723@vindex wisent-debug-flag
724@defvar wisent-debug-flag
725non-@code{nil} means enable some debug stuff.
726@end defvar
727@end table
728
729@node Understanding the automaton
730@subsection Understanding the automaton
731
732@cindex understanding the automaton
733This section (took from the manual of Bison 1.49) describes how to use
734the verbose report printed by @code{wisent-compile-grammar} to
735understand the generated automaton, to tune or fix a grammar.
736
737We will use the following example:
738
739@example
740@group
741(let ((wisent-verbose-flag t)) ;; Print a verbose report!
742 (wisent-compile-grammar
743 '((NUM STR) ; %token NUM STR
744
745 ((left ?+ ?-) ; %left '+' '-';
746 (left ?*)) ; %left '*'
747
748 (exp ; exp:
749 ((exp ?+ exp)) ; exp '+' exp
750 ((exp ?- exp)) ; | exp '-' exp
751 ((exp ?* exp)) ; | exp '*' exp
752 ((exp ?/ exp)) ; | exp '/' exp
753 ((NUM)) ; | NUM
754 ) ; ;
755
756 (useless ; useless:
757 ((STR)) ; STR
758 ) ; ;
759 )
760 'nil) ; no %start declarations
761 )
762@end group
763@end example
764
765When evaluating the above expression, grammar compilation first issues
766the following two clear messages:
767
768@example
769@group
770Grammar contains 1 useless nonterminals and 1 useless rules
771Grammar contains 7 shift/reduce conflicts
772@end group
773@end example
774
775The @samp{*wisent-log*} buffer details things!
776
777The first section reports conflicts that were solved using precedence
778and/or associativity:
779
780@example
781@group
782Conflict in state 7 between rule 1 and token '+' resolved as reduce.
783Conflict in state 7 between rule 1 and token '-' resolved as reduce.
784Conflict in state 7 between rule 1 and token '*' resolved as shift.
785Conflict in state 8 between rule 2 and token '+' resolved as reduce.
786Conflict in state 8 between rule 2 and token '-' resolved as reduce.
787Conflict in state 8 between rule 2 and token '*' resolved as shift.
788Conflict in state 9 between rule 3 and token '+' resolved as reduce.
789Conflict in state 9 between rule 3 and token '-' resolved as reduce.
790Conflict in state 9 between rule 3 and token '*' resolved as reduce.
791@end group
792@end example
793
794The next section reports useless tokens, nonterminal and rules (note
795that useless tokens might be used by the scanner):
796
797@example
798@group
799Useless nonterminals:
800
801 useless
802
803
804Terminals which are not used:
805
806 STR
807
808
809Useless rules:
810
811#6 useless: STR;
812@end group
813@end example
814
815The next section lists states that still have conflicts:
816
817@example
818@group
819State 7 contains 1 shift/reduce conflict.
820State 8 contains 1 shift/reduce conflict.
821State 9 contains 1 shift/reduce conflict.
822State 10 contains 4 shift/reduce conflicts.
823@end group
824@end example
825
826The next section reproduces the grammar used:
827
828@example
829@group
830Grammar
831
832 Number, Rule
833 1 exp -> exp '+' exp
834 2 exp -> exp '-' exp
835 3 exp -> exp '*' exp
836 4 exp -> exp '/' exp
837 5 exp -> NUM
838@end group
839@end example
840
841And reports the uses of the symbols:
842
843@example
844@group
845Terminals, with rules where they appear
846
847$EOI (-1)
848error (1)
849NUM (2) 5
850STR (3) 6
851'+' (4) 1
852'-' (5) 2
853'*' (6) 3
854'/' (7) 4
855
856
857Nonterminals, with rules where they appear
858
859exp (8)
860 on left: 1 2 3 4 5, on right: 1 2 3 4
861@end group
862@end example
863
864The report then details the automaton itself, describing each state
865with it set of @dfn{items}, also known as @dfn{pointed rules}. Each
866item is a production rule together with a point (marked by @samp{.})
867that the input cursor.
868
869@example
870@group
871state 0
872
873 NUM shift, and go to state 1
874
875 exp go to state 2
876@end group
877@end example
878
879State 0 corresponds to being at the very beginning of the parsing, in
880the initial rule, right before the start symbol (@samp{exp}). When
881the parser returns to this state right after having reduced a rule
882that produced an @samp{exp}, it jumps to state 2. If there is no such
883transition on a nonterminal symbol, and the lookahead is a @samp{NUM},
884then this token is shifted on the parse stack, and the control flow
885jumps to state 1. Any other lookahead triggers a parse error.
886
887In the state 1...
888
889@example
890@group
891state 1
892
893 exp -> NUM . (rule 5)
894
895 $default reduce using rule 5 (exp)
896@end group
897@end example
898
899the rule 5, @samp{exp: NUM;}, is completed. Whatever the lookahead
900(@samp{$default}), the parser will reduce it. If it was coming from
901state 0, then, after this reduction it will return to state 0, and
902will jump to state 2 (@samp{exp: go to state 2}).
903
904@example
905@group
906state 2
907
908 exp -> exp . '+' exp (rule 1)
909 exp -> exp . '-' exp (rule 2)
910 exp -> exp . '*' exp (rule 3)
911 exp -> exp . '/' exp (rule 4)
912
913 $EOI shift, and go to state 11
914 '+' shift, and go to state 3
915 '-' shift, and go to state 4
916 '*' shift, and go to state 5
917 '/' shift, and go to state 6
918@end group
919@end example
920
921In state 2, the automaton can only shift a symbol. For instance,
922because of the item @samp{exp -> exp . '+' exp}, if the lookahead if
923@samp{+}, it will be shifted on the parse stack, and the automaton
924control will jump to state 3, corresponding to the item
925@samp{exp -> exp . '+' exp}:
926
927@example
928@group
929state 3
930
931 exp -> exp '+' . exp (rule 1)
932
933 NUM shift, and go to state 1
934
935 exp go to state 7
936@end group
937@end example
938
939Since there is no default action, any other token than those listed
940above will trigger a parse error.
941
942The interpretation of states 4 to 6 is straightforward:
943
944@example
945@group
946state 4
947
948 exp -> exp '-' . exp (rule 2)
949
950 NUM shift, and go to state 1
951
952 exp go to state 8
953
954
955
956state 5
957
958 exp -> exp '*' . exp (rule 3)
959
960 NUM shift, and go to state 1
961
962 exp go to state 9
963
964
965
966state 6
967
968 exp -> exp '/' . exp (rule 4)
969
970 NUM shift, and go to state 1
971
972 exp go to state 10
973@end group
974@end example
975
976As was announced in beginning of the report, @samp{State 7 contains 1
977shift/reduce conflict.}:
978
979@example
980@group
981state 7
982
983 exp -> exp . '+' exp (rule 1)
984 exp -> exp '+' exp . (rule 1)
985 exp -> exp . '-' exp (rule 2)
986 exp -> exp . '*' exp (rule 3)
987 exp -> exp . '/' exp (rule 4)
988
989 '*' shift, and go to state 5
990 '/' shift, and go to state 6
991
992 '/' [reduce using rule 1 (exp)]
993 $default reduce using rule 1 (exp)
994@end group
995@end example
996
997Indeed, there are two actions associated to the lookahead @samp{/}:
998either shifting (and going to state 6), or reducing rule 1. The
999conflict means that either the grammar is ambiguous, or the parser
1000lacks information to make the right decision. Indeed the grammar is
1001ambiguous, as, since we did not specify the precedence of @samp{/},
1002the sentence @samp{NUM + NUM / NUM} can be parsed as @samp{NUM + (NUM
1003/ NUM)}, which corresponds to shifting @samp{/}, or as @samp{(NUM +
1004NUM) / NUM}, which corresponds to reducing rule 1.
1005
1006Because in @acronym{LALR(1)} parsing a single decision can be made,
1007Wisent arbitrarily chose to disable the reduction, see
1008@ref{Conflicts}. Discarded actions are reported in between square
1009brackets.
1010
1011Note that all the previous states had a single possible action: either
1012shifting the next token and going to the corresponding state, or
1013reducing a single rule. In the other cases, i.e., when shifting
1014@emph{and} reducing is possible or when @emph{several} reductions are
1015possible, the lookahead is required to select the action. State 7 is
1016one such state: if the lookahead is @samp{*} or @samp{/} then the
1017action is shifting, otherwise the action is reducing rule 1. In other
1018words, the first two items, corresponding to rule 1, are not eligible
1019when the lookahead is @samp{*}, since we specified that @samp{*} has
1020higher precedence that @samp{+}. More generally, some items are
1021eligible only with some set of possible lookaheads.
1022
1023States 8 to 10 are similar:
1024
1025@example
1026@group
1027state 8
1028
1029 exp -> exp . '+' exp (rule 1)
1030 exp -> exp . '-' exp (rule 2)
1031 exp -> exp '-' exp . (rule 2)
1032 exp -> exp . '*' exp (rule 3)
1033 exp -> exp . '/' exp (rule 4)
1034
1035 '*' shift, and go to state 5
1036 '/' shift, and go to state 6
1037
1038 '/' [reduce using rule 2 (exp)]
1039 $default reduce using rule 2 (exp)
1040
1041
1042
1043state 9
1044
1045 exp -> exp . '+' exp (rule 1)
1046 exp -> exp . '-' exp (rule 2)
1047 exp -> exp . '*' exp (rule 3)
1048 exp -> exp '*' exp . (rule 3)
1049 exp -> exp . '/' exp (rule 4)
1050
1051 '/' shift, and go to state 6
1052
1053 '/' [reduce using rule 3 (exp)]
1054 $default reduce using rule 3 (exp)
1055
1056
1057
1058state 10
1059
1060 exp -> exp . '+' exp (rule 1)
1061 exp -> exp . '-' exp (rule 2)
1062 exp -> exp . '*' exp (rule 3)
1063 exp -> exp . '/' exp (rule 4)
1064 exp -> exp '/' exp . (rule 4)
1065
1066 '+' shift, and go to state 3
1067 '-' shift, and go to state 4
1068 '*' shift, and go to state 5
1069 '/' shift, and go to state 6
1070
1071 '+' [reduce using rule 4 (exp)]
1072 '-' [reduce using rule 4 (exp)]
1073 '*' [reduce using rule 4 (exp)]
1074 '/' [reduce using rule 4 (exp)]
1075 $default reduce using rule 4 (exp)
1076@end group
1077@end example
1078
1079Observe that state 10 contains conflicts due to the lack of precedence
1080of @samp{/} wrt @samp{+}, @samp{-}, and @samp{*}, but also because the
1081associativity of @samp{/} is not specified.
1082
1083Finally, the state 11 (plus 12) is named the @dfn{final state}, or the
1084@dfn{accepting state}:
1085
1086@example
1087@group
1088state 11
1089
1090 $EOI shift, and go to state 12
1091
1092
1093
1094state 12
1095
1096 $default accept
1097@end group
1098@end example
1099
1100The end of input is shifted @samp{$EOI shift,} and the parser exits
1101successfully (@samp{go to state 12}, that terminates).
1102
1103@node Wisent Parsing
1104@chapter Wisent Parsing
1105
1106@cindex bottom-up parser
1107@cindex shift-reduce parser
1108The Wisent's parser is what is called a @dfn{bottom-up} or
1109@dfn{shift-reduce} parser which repeatedly:
1110
1111@table @dfn
1112@cindex shift
1113@item shift
1114That is pushes the value of the last lexical token read (the
1115look-ahead token) into a value stack, and reads a new one.
1116
1117@cindex reduce
1118@item reduce
1119That is replaces a nonterminal by its semantic value. The values of
1120the components which form the right hand side of a rule are popped
1121from the value stack and reduced by the semantic action of this rule.
1122The result is pushed back on top of value stack.
1123@end table
1124
1125The parser will stop on:
1126
1127@table @dfn
1128@cindex accept
1129@item accept
1130When all input has been successfully parsed. The semantic value of
1131the start nonterminal is on top of the value stack.
1132
1133@cindex syntax error
1134@item error
1135When a syntax error (an unexpected token in input) has been detected.
1136At this point the parser issues an error message and either stops or
1137calls a recovery routine to try to resume parsing.
1138@end table
1139
1140@cindex table-driven parser
1141The above elementary actions are driven by the @acronym{LALR(1)}
1142automaton built by @code{wisent-compile-grammar} from a context-free
1143grammar.
1144
1145The Wisent's parser is entered by calling the function:
1146
1147@findex wisent-parse
1148@defun wisent-parse automaton lexer &optional error start
1149Parse input using the automaton specified in @var{automaton}.
1150
1151@table @var
1152@item automaton
1153Is an @acronym{LALR(1)} automaton generated by
1154@code{wisent-compile-grammar} (@pxref{Wisent Grammar}).
1155
1156@item lexer
1157Is a function with no argument called by the parser to obtain the next
1158terminal (token) in input (@pxref{Writing a lexer}).
1159
1160@item error
1161Is an optional reporting function called when a parse error occurs.
1162It receives a message string to report. It defaults to the function
1163@code{wisent-message} (@pxref{Report errors}).
1164
1165@item start
1166Specify the start symbol (nonterminal) used by the parser as its goal.
1167It defaults to the start symbol defined in the grammar
1168(@pxref{Wisent Grammar}).
1169@end table
1170@end defun
1171
1172The following two normal hooks permit to do some useful processing
1173respectively before to start parsing, and after the parser terminated.
1174
1175@vindex wisent-pre-parse-hook
1176@defvar wisent-pre-parse-hook
1177Normal hook run just before entering the @var{LR} parser engine.
1178@end defvar
1179
1180@vindex wisent-post-parse-hook
1181@defvar wisent-post-parse-hook
1182Normal hook run just after the @var{LR} parser engine terminated.
1183@end defvar
1184
1185@menu
1186* Writing a lexer::
1187* Actions goodies::
1188* Report errors::
1189* Error recovery::
1190* Debugging actions::
1191@end menu
1192
1193@node Writing a lexer
1194@section What the parser must receive
1195
1196It is important to understand that the parser does not parse
1197characters, but lexical tokens, and does not know anything about
1198characters in text streams!
1199
1200@cindex lexical analysis
1201@cindex lexer
1202@cindex scanner
1203Reading input data to produce lexical tokens is performed by a lexer
1204(also called a scanner) in a lexical analysis step, before the syntax
1205analysis step performed by the parser. The parser automatically calls
1206the lexer when it needs the next token to parse.
1207
1208@cindex lexical tokens
1209A Wisent's lexer is an Emacs Lisp function with no argument. It must
1210return a valid lexical token of the form:
1211
1212@code{(@var{token-class value} [@var{start} . @var{end}])}
1213
1214@table @var
1215@item token-class
1216Is a category of lexical token identifying a terminal as specified in
1217the grammar (@pxref{Wisent Grammar}). It can be a symbol or a character
1218literal.
1219
1220@item value
1221Is the value of the lexical token. It can be of any valid Emacs Lisp
1222data type.
1223
1224@item start
1225@itemx end
1226Are the optionals beginning and end positions of @var{value} in the
1227input stream.
1228@end table
1229
1230When there are no more tokens to read the lexer must return the token
1231@code{(list wisent-eoi-term)} to each request.
1232
1233@vindex wisent-eoi-term
1234@defvar wisent-eoi-term
1235Predefined constant, End-Of-Input terminal symbol.
1236@end defvar
1237
1238@code{wisent-lex} is an example of a lexer that reads lexical tokens
1239produced by a @semantic{} lexer, and translates them into lexical tokens
1240suitable to the Wisent parser. See also @ref{Wisent Lex}.
1241
1242To call the lexer in a semantic action use the function
1243@code{wisent-lexer}. See also @ref{Actions goodies}.
1244
1245@node Actions goodies
1246@section Variables and macros useful in grammar actions.
1247
1248@vindex wisent-input
1249@defvar wisent-input
1250The last token read.
1251This variable only has meaning in the scope of @code{wisent-parse}.
1252@end defvar
1253
1254@findex wisent-lexer
1255@defun wisent-lexer
1256Obtain the next terminal in input.
1257@end defun
1258
1259@findex wisent-region
1260@defun wisent-region &rest positions
1261Return the start/end positions of the region including
1262@var{positions}. Each element of @var{positions} is a pair
1263@w{@code{(@var{start-pos} . @var{end-pos})}} or @code{nil}. The
1264returned value is the pair @w{@code{(@var{min-start-pos} .
1265@var{max-end-pos})}} or @code{nil} if no @var{positions} are
1266available.
1267@end defun
1268
1269@node Report errors
1270@section The error reporting function
1271
1272@cindex error reporting
1273When the parser encounters a syntax error it calls a user-defined
1274function. It must be an Emacs Lisp function with one argument: a
1275string containing the message to report.
1276
1277By default the parser uses this function to report error messages:
1278
1279@findex wisent-message
1280@defun wisent-message string &rest args
1281Print a one-line message if @code{wisent-parse-verbose-flag} is set.
1282Pass @var{string} and @var{args} arguments to @dfn{message}.
1283@end defun
1284
1285@table @strong
1286@item Please Note:
1287@code{wisent-message} uses the following function to print lexical
1288tokens:
1289
1290@defun wisent-token-to-string token
1291Return a printed representation of lexical token @var{token}.
1292@end defun
1293
1294The general printed form of a lexical token is:
1295
1296@w{@code{@var{token}(@var{value})@@@var{location}}}
1297@end table
1298
1299To control the verbosity of the parser you can set to non-@code{nil}
1300this variable:
1301
1302@vindex wisent-parse-verbose-flag
1303@deffn Option wisent-parse-verbose-flag
1304non-@code{nil} means to issue more messages while parsing.
1305@end deffn
1306
1307Or interactively use the command:
1308
1309@findex wisent-parse-toggle-verbose-flag
1310@deffn Command wisent-parse-toggle-verbose-flag
1311Toggle whether to issue more messages while parsing.
1312@end deffn
1313
1314When the error reporting function is entered the variable
1315@code{wisent-input} contains the unexpected token as returned by the
1316lexer.
1317
1318The error reporting function can be called from a semantic action too
1319using the special macro @code{wisent-error}. When called from a
1320semantic action entered by error recovery (@pxref{Error recovery}) the
1321value of the variable @code{wisent-recovering} is non-@code{nil}.
1322
1323@node Error recovery
1324@section Error recovery
1325
1326@cindex error recovery
1327The error recovery mechanism of the Wisent's parser conforms to the
1328one Bison uses. See @ref{Error Recovery, , , bison}, in the Bison
1329manual for details.
1330
1331@cindex error token
1332To recover from a syntax error you must write rules to recognize the
1333special token @code{error}. This is a terminal symbol that is
1334automatically defined and reserved for error handling.
1335
1336When the parser encounters a syntax error, it pops the state stack
1337until it finds a state that allows shifting the @code{error} token.
1338After it has been shifted, if the old look-ahead token is not
1339acceptable to be shifted next, the parser reads tokens and discards
1340them until it finds a token which is acceptable.
1341
1342@cindex error recovery strategy
1343Strategies for error recovery depend on the choice of error rules in
1344the grammar. A simple and useful strategy is simply to skip the rest
1345of the current statement if an error is detected:
1346
1347@example
1348@group
1349(stmnt (( error ?; )) ;; on error, skip until ';' is read
1350 )
1351@end group
1352@end example
1353
1354It is also useful to recover to the matching close-delimiter of an
1355opening-delimiter that has already been parsed:
1356
1357@example
1358@group
1359(primary (( ?@{ expr ?@} ))
1360 (( ?@{ error ?@} ))
1361 @dots{}
1362 )
1363@end group
1364@end example
1365
1366@cindex error recovery actions
1367Note that error recovery rules may have actions, just as any other
1368rules can. Here are some predefined hooks, variables, functions or
1369macros, useful in such actions:
1370
1371@vindex wisent-nerrs
1372@defvar wisent-nerrs
1373The number of parse errors encountered so far.
1374@end defvar
1375
1376@vindex wisent-recovering
1377@defvar wisent-recovering
1378non-@code{nil} means that the parser is recovering.
1379This variable only has meaning in the scope of @code{wisent-parse}.
1380@end defvar
1381
1382@findex wisent-error
1383@defun wisent-error msg
1384Call the user supplied error reporting function with message
1385@var{msg} (@pxref{Report errors}).
1386
1387For an example of use, @xref{wisent-skip-token}.
1388@end defun
1389
1390@findex wisent-errok
1391@defun wisent-errok
1392Resume generating error messages immediately for subsequent syntax
1393errors.
1394
1395The parser suppress error message for syntax errors that happens
1396shortly after the first, until three consecutive input tokens have
1397been successfully shifted.
1398
1399Calling @code{wisent-errok} in an action, make error messages resume
1400immediately. No error messages will be suppressed if you call it in
1401an error rule's action.
1402
1403For an example of use, @xref{wisent-skip-token}.
1404@end defun
1405
1406@findex wisent-clearin
1407@defun wisent-clearin
1408Discard the current lookahead token.
1409This will cause a new lexical token to be read.
1410
1411In an error rule's action the previous lookahead token is reanalyzed
1412immediately. @code{wisent-clearin} may be called to clear this token.
1413
1414For example, suppose that on a parse error, an error handling routine
1415is called that advances the input stream to some point where parsing
1416should once again commence. The next symbol returned by the lexical
1417scanner is probably correct. The previous lookahead token ought to
1418be discarded with @code{wisent-clearin}.
1419
1420For an example of use, @xref{wisent-skip-token}.
1421@end defun
1422
1423@findex wisent-abort
1424@defun wisent-abort
1425Abort parsing and save the lookahead token.
1426@end defun
1427
1428@findex wisent-set-region
1429@defun wisent-set-region start end
1430Change the region of text matched by the current nonterminal.
1431@var{start} and @var{end} are respectively the beginning and end
1432positions of the region occupied by the group of components associated
1433to this nonterminal. If @var{start} or @var{end} values are not a
1434valid positions the region is set to @code{nil}.
1435
1436For an example of use, @xref{wisent-skip-token}.
1437@end defun
1438
1439@vindex wisent-discarding-token-functions
1440@defvar wisent-discarding-token-functions
1441List of functions to be called when discarding a lexical token.
1442These functions receive the lexical token discarded.
1443When the parser encounters unexpected tokens, it can discards them,
1444based on what directed by error recovery rules. Either when the
1445parser reads tokens until one is found that can be shifted, or when an
1446semantic action calls the function @code{wisent-skip-token} or
1447@code{wisent-skip-block}.
1448For language specific hooks, make sure you define this as a local
1449hook.
1450
1451For example, in @semantic{}, this hook is set to the function
1452@code{wisent-collect-unmatched-syntax} to collect unmatched lexical
1453tokens (@pxref{Useful functions}).
1454@end defvar
1455
1456@findex wisent-skip-token
1457@defun wisent-skip-token
1458@anchor{wisent-skip-token}
1459Skip the lookahead token in order to resume parsing.
1460Return nil.
1461Must be used in error recovery semantic actions.
1462
1463It typically looks like this:
1464
1465@lisp
1466@group
1467(wisent-message "%s: skip %s" $action
1468 (wisent-token-to-string wisent-input))
1469(run-hook-with-args
1470 'wisent-discarding-token-functions wisent-input)
1471(wisent-clearin)
1472(wisent-errok)))
1473@end group
1474@end lisp
1475@end defun
1476
1477@findex wisent-skip-block
1478@defun wisent-skip-block
1479Safely skip a block in order to resume parsing.
1480Return nil.
1481Must be used in error recovery semantic actions.
1482
1483A block is data between an open-delimiter (syntax class @code{(}) and
1484a matching close-delimiter (syntax class @code{)}):
1485
1486@example
1487@group
1488(a parenthesized block)
1489[a block between brackets]
1490@{a block between braces@}
1491@end group
1492@end example
1493
1494The following example uses @code{wisent-skip-block} to safely skip a
1495block delimited by @samp{LBRACE} (@code{@{}) and @samp{RBRACE}
1496(@code{@}}) tokens, when a syntax error occurs in
1497@samp{other-components}:
1498
1499@example
1500@group
1501(block ((LBRACE other-components RBRACE))
1502 ((LBRACE RBRACE))
1503 ((LBRACE error)
1504 (wisent-skip-block))
1505 )
1506@end group
1507@end example
1508@end defun
1509
1510@node Debugging actions
1511@section Debugging semantic actions
1512
1513@cindex semantic action symbols
1514Each semantic action is represented by a symbol interned in an
1515@dfn{obarray} that is part of the @acronym{LALR(1)} automaton
1516(@pxref{Compiling a grammar}). @code{symbol-function} on a semantic
1517action symbol return the semantic action lambda expression.
1518
1519A semantic action symbol name has the form
1520@code{@var{nonterminal}:@var{index}}, where @var{nonterminal} is the
1521name of the nonterminal symbol the action belongs to, and @var{index}
1522is an action sequence number within the scope of @var{nonterminal}.
1523For example, this nonterminal definition:
1524
1525@example
1526@group
1527input:
1528 line [@code{input:0}]
1529 | input line
1530 (format "%s %s" $1 $2) [@code{input:1}]
1531 ;
1532@end group
1533@end example
1534
1535Will produce two semantic actions, and associated symbols:
1536
1537@table @code
1538@item input:0
1539A default action that returns @code{$1}.
1540
1541@item input:1
1542That returns @code{(format "%s %s" $1 $2)}.
1543@end table
1544
1545@cindex debugging semantic actions
1546Debugging uses the Lisp debugger to investigate what is happening
1547during execution of semantic actions.
1548Three commands are available to debug semantic actions. They receive
1549two arguments:
1550
1551@itemize @bullet
1552@item The automaton that contains the semantic action.
1553
1554@item The semantic action symbol.
1555@end itemize
1556
1557@findex wisent-debug-on-entry
1558@deffn Command wisent-debug-on-entry automaton function
1559Request @var{automaton}'s @var{function} to invoke debugger each time it is called.
1560@var{function} must be a semantic action symbol that exists in @var{automaton}.
1561@end deffn
1562
1563@findex wisent-cancel-debug-on-entry
1564@deffn Command wisent-cancel-debug-on-entry automaton function
1565Undo effect of @code{wisent-debug-on-entry} on @var{automaton}'s @var{function}.
1566@var{function} must be a semantic action symbol that exists in @var{automaton}.
1567@end deffn
1568
1569@findex wisent-debug-show-entry
1570@deffn Command wisent-debug-show-entry automaton function
1571Show the source of @var{automaton}'s semantic action @var{function}.
1572@var{function} must be a semantic action symbol that exists in @var{automaton}.
1573@end deffn
1574
1575@node Wisent Semantic
1576@chapter How to use Wisent with Semantic
1577
1578@cindex tags
1579This section presents how the Wisent's parser can be used to produce
1580@dfn{tags} for the @semantic{} tool set.
1581
1582@semantic{} tags form a hierarchy of Emacs Lisp data structures that
1583describes a program in a way independent of programming languages.
1584Tags map program declarations, like functions, methods, variables,
1585data types, classes, includes, grammar rules, etc..
1586
1587@cindex WY grammar format
1588To use the Wisent parser with @semantic{} you have to define
1589your grammar in @dfn{WY} form, a grammar format very close
1590to the one used by Bison.
1591
1592Please @inforef{top, Semantic Grammar Framework Manual, grammar-fw}
1593for more information on @semantic{} grammars.
1594
1595@menu
1596* Grammar styles::
1597* Wisent Lex::
1598@end menu
1599
1600@node Grammar styles
1601@section Grammar styles
1602
1603@cindex grammar styles
1604@semantic{} parsing heavily depends on how you wrote the grammar.
1605There are mainly two styles to write a Wisent's grammar intended to be
1606used with the @semantic{} tool set: the @dfn{Iterative style} and the
1607@dfn{Bison style}. Each one has pros and cons, and in certain cases
1608it can be worth a mix of the two styles!
1609
1610@menu
1611* Iterative style::
1612* Bison style::
1613* Mixed style::
1614* Start nonterminals::
1615* Useful functions::
1616@end menu
1617
1618@node Iterative style, Bison style, Grammar styles, Grammar styles
1619@subsection Iterative style
1620
1621@cindex grammar iterative style
1622The @dfn{iterative style} is the preferred style to use with @semantic{}.
1623It relies on an iterative parser back-end mechanism which parses start
1624nonterminals one at a time and automagically skips unexpected lexical
1625tokens in input.
1626
1627Compared to rule-based iterative functions (@pxref{Bison style}),
1628iterative parsers are better in that they can handle obscure errors
1629more cleanly.
1630
1631@cindex raw tag
1632Each start nonterminal must produces a @dfn{raw tag} by calling a
1633@code{TAG}-like grammar macro with appropriate parameters. See also
1634@ref{Start nonterminals}.
1635
1636@cindex expanded tag
1637Then, each parsing iteration automatically translates a raw tag into
1638@dfn{expanded tags}, updating the raw tag structure with internal
1639properties and buffer related data.
1640
1641After parsing completes, it results in a tree of expanded tags.
1642
1643The following example is a snippet of the iterative style Java grammar
1644provided in the @semantic{} distribution in the file
1645@file{semantic/wisent/java-tags.wy}.
1646
1647@example
1648@group
1649@dots{}
1650;; Alternate entry points
1651;; - Needed by partial re-parse
1652%start formal_parameter
1653@dots{}
1654;; - Needed by EXPANDFULL clauses
1655%start formal_parameters
1656@dots{}
1657
1658formal_parameter_list
1659 : PAREN_BLOCK
1660 (EXPANDFULL $1 formal_parameters)
1661 ;
1662
1663formal_parameters
1664 : LPAREN
1665 ()
1666 | RPAREN
1667 ()
1668 | formal_parameter COMMA
1669 | formal_parameter RPAREN
1670 ;
1671
1672formal_parameter
1673 : formal_parameter_modifier_opt type variable_declarator_id
1674 (VARIABLE-TAG $3 $2 nil :typemodifiers $1)
1675 ;
1676@end group
1677@end example
1678
1679@findex EXPANDFULL
1680It shows the use of the @code{EXPANDFULL} grammar macro to parse a
1681@samp{PAREN_BLOCK} which contains a @samp{formal_parameter_list}.
1682@code{EXPANDFULL} tells to recursively parse @samp{formal_parameters}
1683inside @samp{PAREN_BLOCK}. The parser iterates until it digested all
1684available input data inside the @samp{PAREN_BLOCK}, trying to match
1685any of the @samp{formal_parameters} rules:
1686
1687@itemize
1688@item @samp{LPAREN}
1689
1690@item @samp{RPAREN}
1691
1692@item @samp{formal_parameter COMMA}
1693
1694@item @samp{formal_parameter RPAREN}
1695@end itemize
1696
1697At each iteration it will return a @samp{formal_parameter} raw tag,
1698or @code{nil} to skip unwanted (single @samp{LPAREN} or @samp{RPAREN}
1699for example) or unexpected input data. Those raw tags will be
1700automatically expanded by the iterative back-end parser.
1701
1702@node Bison style
1703@subsection Bison style
1704
1705@cindex grammar bison style
1706What we call the @dfn{Bison style} is the traditional style of Bison's
1707grammars. Compared to iterative style, it is not straightforward to
1708use grammars written in Bison style in @semantic{}. Mainly because such
1709grammars are designed to parse the whole input data in one pass, and
1710don't use the iterative parser back-end mechanism (@pxref{Iterative
1711style}). With Bison style the parser is called once to parse the
1712grammar start nonterminal.
1713
1714The following example is a snippet of the Bison style Java grammar
1715provided in the @semantic{} distribution in the file
1716@file{semantic/wisent/java.wy}.
1717
1718@example
1719@group
1720%start formal_parameter
1721@dots{}
1722
1723formal_parameter_list
1724 : formal_parameter_list COMMA formal_parameter
1725 (cons $3 $1)
1726 | formal_parameter
1727 (list $1)
1728 ;
1729
1730formal_parameter
1731 : formal_parameter_modifier_opt type variable_declarator_id
1732 (EXPANDTAG
1733 (VARIABLE-TAG $3 $2 :typemodifiers $1)
1734 )
1735 ;
1736@end group
1737@end example
1738
1739The first consequence is that syntax errors are not automatically
1740handled by @semantic{}. Thus, it is necessary to explicitly handle
1741them at the grammar level, providing error recovery rules to skip
1742unexpected input data.
1743
1744The second consequence is that the iterative parser can't do automatic
1745tag expansion, except for the start nonterminal value. It is
1746necessary to explicitly expand tags from concerned semantic actions by
1747calling the grammar macro @code{EXPANDTAG} with a raw tag as
1748parameter. See also @ref{Start nonterminals}, for incremental
1749re-parse considerations.
1750
1751@node Mixed style
1752@subsection Mixed style
1753
1754@cindex grammar mixed style
1755@example
1756@group
1757%start grammar
1758;; Reparse
1759%start prologue epilogue declaration nonterminal rule
1760@dots{}
1761
1762%%
1763
1764grammar:
1765 prologue
1766 | epilogue
1767 | declaration
1768 | nonterminal
1769 | PERCENT_PERCENT
1770 ;
1771@dots{}
1772
1773nonterminal:
1774 SYMBOL COLON rules SEMI
1775 (TAG $1 'nonterminal :children $3)
1776 ;
1777
1778rules:
1779 lifo_rules
1780 (apply 'nconc (nreverse $1))
1781 ;
1782
1783lifo_rules:
1784 lifo_rules OR rule
1785 (cons $3 $1)
1786 | rule
1787 (list $1)
1788 ;
1789
1790rule:
1791 rhs
1792 (let* ((rhs $1)
1793 name type comps prec action elt)
1794 @dots{}
1795 (EXPANDTAG
1796 (TAG name 'rule :type type :value comps :prec prec :expr action)
1797 ))
1798 ;
1799@end group
1800@end example
1801
1802This example shows how iterative and Bison styles can be combined in
1803the same grammar to obtain a good compromise between grammar
1804complexity and an efficient parsing strategy in an interactive
1805environment.
1806
1807@samp{nonterminal} is parsed using iterative style via the main
1808@samp{grammar} rule. The semantic action uses the @code{TAG} macro to
1809produce a raw tag, automagically expanded by @semantic{}.
1810
1811But @samp{rules} part is parsed in Bison style! Why?
1812
1813Rule delimiters are the colon (@code{:}), that follows the nonterminal
1814name, and a final semicolon (@code{;}). Unfortunately these
1815delimiters are not @code{open-paren}/@code{close-paren} type, and the
1816Emacs' syntactic analyzer can't easily isolate data between them to
1817produce a @samp{RULES_PART} parenthesis-block-like lexical token.
1818Consequently it is not possible to use @code{EXPANDFULL} to iterate in
1819@samp{RULES_PART}, like this:
1820
1821@example
1822@group
1823nonterminal:
1824 SYMBOL COLON rules SEMI
1825 (TAG $1 'nonterminal :children $3)
1826 ;
1827
1828rules:
1829 RULES_PART ;; @strong{Map a parenthesis-block-like lexical token}
1830 (EXPANDFULL $1 'rules)
1831 ;
1832
1833rules:
1834 COLON
1835 ()
1836 OR
1837 ()
1838 SEMI
1839 ()
1840 rhs
1841 rhs
1842 (let* ((rhs $1)
1843 name type comps prec action elt)
1844 @dots{}
1845 (TAG name 'rule :type type :value comps :prec prec :expr action)
1846 )
1847 ;
1848@end group
1849@end example
1850
1851In such cases, when it is difficult for Emacs to obtain
1852parenthesis-block-like lexical tokens, the best solution is to use the
1853traditional Bison style with error recovery!
1854
1855In some extreme cases, it can also be convenient to extend the lexer,
1856to deliver new lexical tokens, to simplify the grammar.
1857
1858@node Start nonterminals
1859@subsection Start nonterminals
1860
1861@cindex start nonterminals
1862@cindex @code{reparse-symbol} property
1863When you write a grammar for @semantic{}, it is important to carefully
1864indicate the start nonterminals. Each one defines an entry point in
1865the grammar, and after parsing its semantic value is returned to the
1866back-end iterative engine. Consequently:
1867
1868@strong{The semantic value of a start nonterminal must be a produced
1869by a TAG like grammar macro}.
1870
1871Start nonterminals are declared by @code{%start} statements. When
1872nothing is specified the first nonterminal that appears in the grammar
1873is the start nonterminal.
1874
1875Generally, the following nonterminals must be declared as start
1876symbols:
1877
1878@itemize @bullet
1879@item The main grammar entry point
1880@quotation
1881Of course!
1882@end quotation
1883
1884@item nonterminals passed to @code{EXPAND}/@code{EXPANDFULL}
1885@quotation
1886These grammar macros recursively parse a part of input data, based on
1887rules of the given nonterminal.
1888
1889For example, the following will parse @samp{PAREN_BLOCK} data using
1890the @samp{formal_parameters} rules:
1891
1892@example
1893@group
1894formal_parameter_list
1895 : PAREN_BLOCK
1896 (EXPANDFULL $1 formal_parameters)
1897 ;
1898@end group
1899@end example
1900
1901The semantic value of @samp{formal_parameters} becomes the value of
1902the @code{EXPANDFULL} expression. It is a list of @semantic{} tags
1903spliced in the tags tree.
1904
1905Because the automaton must know that @samp{formal_parameters} is a
1906start symbol, you must declare it like this:
1907
1908@example
1909@group
1910%start formal_parameters
1911@end group
1912@end example
1913@end quotation
1914@end itemize
1915
1916@cindex incremental re-parse
1917@cindex reparse-symbol
1918The @code{EXPANDFULL} macro has a side effect it is important to know,
1919related to the incremental re-parse mechanism of @semantic{}: the
1920nonterminal symbol parameter passed to @code{EXPANDFULL} also becomes
1921the @code{reparse-symbol} property of the tag returned by the
1922@code{EXPANDFULL} expression.
1923
1924When buffer's data mapped by a tag is modified, @semantic{}
1925schedules an incremental re-parse of that data, using the tag's
1926@code{reparse-symbol} property as start nonterminal.
1927
1928@strong{The rules associated to such start symbols must be carefully
1929reviewed to ensure that the incremental parser will work!}
1930
1931Things are a little bit different when the grammar is written in Bison
1932style.
1933
1934@strong{The @code{reparse-symbol} property is set to the nonterminal
1935symbol the rule that explicitly uses @code{EXPANDTAG} belongs to.}
1936
1937For example:
1938
1939@example
1940@group
1941rule:
1942 rhs
1943 (let* ((rhs $1)
1944 name type comps prec action elt)
1945 @dots{}
1946 (EXPANDTAG
1947 (TAG name 'rule :type type :value comps :prec prec :expr action)
1948 ))
1949 ;
1950@end group
1951@end example
1952
1953Set the @code{reparse-symbol} property of the expanded tag to
1954@samp{rule}. A important consequence is that:
1955
1956@strong{Every nonterminal having any rule that calls @code{EXPANDTAG}
1957in a semantic action, should be declared as a start symbol!}
1958
1959@node Useful functions
1960@subsection Useful functions
1961
1962Here is a description of some predefined functions it might be useful
1963to know when writing new code to use Wisent in @semantic{}:
1964
1965@findex wisent-collect-unmatched-syntax
1966@defun wisent-collect-unmatched-syntax input
1967Add @var{input} lexical token to the cache of unmatched tokens, in
1968variable @code{semantic-unmatched-syntax-cache}.
1969
1970See implementation of the function @code{wisent-skip-token} in
1971@ref{Error recovery}, for an example of use.
1972@end defun
1973
1974@node Wisent Lex
1975@section The Wisent Lex lexer
1976
1977@findex semantic-lex
1978The lexical analysis step of @semantic{} is performed by the general
1979function @code{semantic-lex}. For more information, @inforef{Writing
1980Lexers, ,semantic-langdev}.
1981
1982@code{semantic-lex} produces lexical tokens of the form:
1983
1984@example
1985@group
1986@code{(@var{token-class start} . @var{end})}
1987@end group
1988@end example
1989
1990@table @var
1991@item token-class
1992Is a symbol that identifies a lexical token class, like @code{symbol},
1993@code{string}, @code{number}, or @code{PAREN_BLOCK}.
1994
1995@item start
1996@itemx end
1997Are the start and end positions of mapped data in the input buffer.
1998@end table
1999
2000The Wisent's parser doesn't depend on the nature of analyzed input
2001stream (buffer, string, etc.), and requires that lexical tokens have a
2002different form (@pxref{Writing a lexer}):
2003
2004@example
2005@group
2006@code{(@var{token-class value} [@var{start} . @var{end}])}
2007@end group
2008@end example
2009
2010@cindex lexical token mapping
2011@code{wisent-lex} is the default Wisent's lexer used in @semantic{}.
2012
2013@vindex wisent-lex-istream
2014@findex wisent-lex
2015@defun wisent-lex
2016Return the next available lexical token in Wisent's form.
2017
2018The variable @code{wisent-lex-istream} contains the list of lexical
2019tokens produced by @code{semantic-lex}. Pop the next token available
2020and convert it to a form suitable for the Wisent's parser.
2021@end defun
2022
2023Mapping of lexical tokens as produced by @code{semantic-lex} into
2024equivalent Wisent lexical tokens is straightforward:
2025
2026@example
2027@group
2028(@var{token-class start} . @var{end})
2029 @result{} (@var{token-class value start} . @var{end})
2030@end group
2031@end example
2032
2033@var{value} is the input @code{buffer-substring} from @var{start} to
2034@var{end}.
2035
2036@node GNU Free Documentation License
2037@appendix GNU Free Documentation License
2038
2039@include fdl.texi
2040
2041@node Index
2042@unnumbered Index
2043@printindex cp
2044
2045@iftex
2046@contents
2047@summarycontents
2048@end iftex
2049
2050@bye
2051
2052@c Following comments are for the benefit of ispell.
2053
2054@c LocalWords: Wisent automagically wisent Wisent's LALR obarray