Import Upstream version 20180207
[hcoop/debian/mlton.git] / mlyacc / doc / mlyacc.tex
1 \documentstyle{article}
2 \title{ ML-Yacc User's Manual \\
3 Version 2.4
4 }
5 \author{ David R. Tarditi$^1$\\
6 Andrew W. Appel$^2$\\
7 \\
8 $^1$Microsoft Research \\
9 \\
10 $^2$Department of Computer Science \\
11 Princeton University \\
12 Princeton, NJ 08544
13 }
14 \date{April 24, 2000}
15
16 \begin{document}
17 \maketitle
18 \begin{center}
19 (c) 1989, 1990, 1991,1994 Andrew W. Appel, David R. Tarditi
20 \end{center}
21
22 {\bf
23 This software comes with ABSOLUTELY NO WARRANTY. It is subject only to
24 the terms of the ML-Yacc NOTICE, LICENSE, and DISCLAIMER (in the
25 file COPYRIGHT distributed with this software).
26 }
27
28 New in this version: Improved error correction directive \verb|%change|
29 that allows multi-token insertions, deletions, substitutions.
30 Explanation of how to build a parser (Section 5) and the Calc example
31 (Section 7) revised for SML/NJ Version 110 and the use of CM.
32
33 \newpage
34 \tableofcontents
35 \newpage
36
37 \section{Introduction}
38 \subsection{General}
39 ML-Yacc is a parser generator for Standard ML modeled after the
40 Yacc parser generator. It generates parsers for LALR languages, like Yacc,
41 and has a similar syntax. The generated parsers use a different algorithm
42 for recovering from syntax errors than parsers generated by Yacc.
43 The algorithm is a partial implementation of an algorithm described in \cite{bf}.
44 A parser tries to recover from a syntax error
45 by making a single token insertion, deletion, or
46 substitution near the point in the input stream at which the error
47 was detected. The parsers delay the evaluation of semantic actions until
48 parses are completed successfully. This makes it possible for
49 parsers to recover from syntax errors that occur before the point
50 of error detection, but it does prevent the parsers from
51 affecting lexers in any significant way. The parsers
52 can insert tokens with values and substitute tokens with values
53 for other tokens. All symbols carry left and right position values
54 which are available to semantic actions and are used in
55 syntactic error messages.
56
57 ML-Yacc uses context-free grammars to specify the syntax of languages to
58 be parsed. See \cite{ahu} for definitions and information on context-free
59 grammars and LR parsing. We briefly review some terminology here. A
60 context-free grammar is defined by a set of terminals $T$, a set of
61 nonterminals $NT$, a set of productions $P$, and a start
62 nonterminal $S$.
63 Terminals are interchangeably referred to as tokens. The terminal
64 and nonterminal sets are assumed to be disjoint. The set of symbols is the
65 union of the nonterminal and terminal sets. We use lower case
66 Greek letters to denote a string of symbols. We use upper case
67 Roman letters near the beginning of the alphabet to denote nonterminals.
68 Each production gives a
69 derivation of a string of symbols from a nonterminal, which we will
70 write as $A \rightarrow \alpha$. We define a relation between strings of
71 symbols $\alpha$ and $\beta$, written $\alpha \vdash \beta$ and read
72 as $\alpha$ derives $\beta$, if and only if $\alpha = \delta A \gamma$,
73 $\beta = \delta \phi \gamma$ and
74 there exists some production $A \rightarrow \phi$. We write the
75 transitive closure of this relation as
76 $\vdash_*$. We say that a string of terminals $\alpha$ is a valid sentence
77 of the language, {\em i.e.} it is derivable, if the start symbol
78 $S \vdash_* \alpha$. The sequence of derivations is often
79 visualized as a parse tree.
80
81 ML-Yacc uses an attribute grammar scheme with synthesized attributes.
82 Each symbol in the grammar may have a value (i.e. attribute) associated
83 with it. Each production has a semantic action associated with it.
84 A production with a semantic action is called a rule.
85 Parsers perform bottom-up, left-to-right evaluations of parse trees using semantic
86 actions to compute values as they do so. Given a production
87 $P = A \rightarrow \alpha$, the corresponding semantic action is
88 used to compute a value for $A$ from the values of the symbols in $\alpha$.
89 If $A$ has no value, the semantic action is still evaluated but the value is ignored.
90 Each parse returns the value associated with the start symbol $S$ of the
91 grammar. A parse returns a nullary value if the start symbol does not carry a value.
92
93 The synthesized attribute scheme can be adapted easily to inherited
94 attributes. An inherited attribute is a value which propagates from
95 a nonterminal to the symbols produced by the nonterminal according to
96 some rule. Since functions are values in ML,
97 the semantic actions for the derived symbols
98 can return functions which takes the
99 inherited value as an argument.
100
101 \subsection{Modules}
102 ML-Yacc uses the ML modules facility to specify the interface between
103 a parser that it generates and a lexical analyzer that must be supplied
104 by you. It also uses the ML modules facility to factor out
105 a set of modules that are common to every generated parser.
106 These common modules include a parsing structure, which contains
107 an error-correcting LR parser\footnote{A plain LR parser is also
108 available.}, an LR table structure, and a structure
109 which defines the representation of terminals. ML-Yacc produces
110 a functor for a particular parser parameterized by the LR table
111 structure and the representation of terminals. This functor
112 contains values specific to the parser, such as the
113 LR table for the parser\footnote{The LR table is a value. The
114 LR table structure defines an abstract LR table type.}, the
115 semantic actions for the parser, and a structure containing
116 the terminals for the parser. ML-Yacc produces a signature
117 for the structure produced by applying this functor
118 and another signature for the structure containing the terminals for
119 the parser. You must
120 supply a functor for the lexing module parameterized this
121 structure.
122
123 Figure 1 is a dependency diagram of the modules that summarizes this
124 information. A module at the head of an arrow is dependent
125 on the module at the tail.
126
127 \begin{figure}
128 \begin{tabular}{|rcl|}
129 \hline
130 parsing structure & $\longrightarrow$ & values for a particular parser\\
131 values for a particular parser & $\longrightarrow$ & lexical analyzer\\
132 parsing structure, & $\longrightarrow$ & particular parser\\
133 values for a particular parser, & & \\
134 lexical analyzer & & \\
135 \hline
136 \end{tabular}
137 \caption{Module Dependencies}
138 \end{figure}
139
140 \subsection{Error Recovery}
141
142 The error recovery algorithm is able to accurately recover from many
143 single token syntax errors. It tries to make a single token
144 correction at the token in the input stream at which the syntax error
145 was detected and any of the 15 tokens\footnote{An arbitrary number
146 chosen because numbers above this do not seem to improve error
147 correction much.} before that token. The algorithm checks corrections
148 before the point of error detection because a syntax error is often
149 not detected until several tokens beyond the token which caused the
150 error.\footnote{An LR parser detects a syntax error as soon as
151 possible, but this does not necessarily mean that the token at which
152 the error was detected caused the error.}
153
154 The algorithm works by trying corrections at each
155 of the 16 tokens up to and including the token at which the
156 error was detected. At each token in the input stream, it
157 will try deleting the token, substituting other tokens for the
158 token, or inserting some other token before the token.
159
160 The algorithm uses a parse check to evaluate corrections. A parse
161 check is a check of how far a correction allows a parser to
162 parse without encountering a syntax error.
163 You pass an upper bound on how many tokens beyond the error
164 point a parser may read while doing a parse check as an argument to the
165 parser. This allows
166 you to control the amount of lookahead that a parser reads
167 for different kinds of systems. For an interactive system, you
168 should set the lookahead to zero. Otherwise, a parser may hang
169 waiting for input in the case of a syntax error. If the lookahead
170 is zero, no syntax errors will be corrected. For a batch system,
171 you should set the lookahead to 15.
172
173 The algorithm selects the set of corrections which allows the parse
174 to proceed the farthest
175 and parse through at least the error token. It then removes those
176 corrections involving keywords which do not meet a longer minimum
177 parse check. If there is more than one correction possible after this,
178 it uses a simple heuristic priority scheme to order the corrections,
179 and then arbitrarily chooses one of the corrections with the highest priority.
180 You have some control over the priority scheme by being able to
181 name a set of preferred insertions and a set of preferred substitutions.
182 The priorities for corrections, ordered from highest to lowest
183 priority, are
184 preferred insertions, preferred substitutions, insertions, deletions,
185 and substitutions.
186
187 The error recovery algorithm is guaranteed to terminate since it always
188 selects fixes which parse through the
189 error token.
190
191 The error-correcting LR parser implements the algorithm by keeping
192 a queue of its state stacks before shifting tokens and using
193 a lazy stream for the lexer.
194 This makes it possible to restart the
195 parse from before an error point and try various corrections. The
196 error-correcting LR parser does not defer semantic actions. Instead,
197 ML-Yacc creates semantic actions which are free of side-effects
198 and always terminate.
199 ML-Yacc uses higher-order functions to defer the
200 evaluation of all user semantic actions until the parse is successfully
201 completed without constructing an explicit parse tree.
202 You may declare whether your semantic actions are free of side-effects
203 and always terminate, in which case ML-Yacc does not need to defer
204 the evaluation of your semantic actions.
205
206 \subsection{Precedence}
207 ML-Yacc uses the same precedence scheme as Yacc for resolving
208 shift/reduce conflicts. Each terminal may be assigned a precedence and
209 associativity. Each rule is then assigned the precedence of its rightmost
210 terminal. If a shift/reduce conflict occurs, the conflict is resolved
211 silently if the terminal and the rule in the conflict have
212 precedences.
213 If the terminal has the higher precedence, the shift is chosen. If
214 the rule has the higher precedence, the reduction is chosen. If both
215 the terminal and the rule have the same precedence, then the associativity
216 of the terminal is used to resolve the conflict. If the terminal is
217 left associative, the reduction is chosen. If the terminal is
218 right associative, the shift is chosen. Terminals may be declared to
219 be nonassociative, also, in which case an error message is produced
220 if the associativity is need to resolve the parsing conflict.
221
222 If a terminal or a rule in a shift/reduce conflict does not have
223 a precedence, then an error message is produced and the shift
224 is chosen.
225
226 In reduce/reduce conflicts, an error message is always produced and
227 the first rule listed in the specification is chosen for reduction.
228 \subsection{Notation}
229
230 Text surrounded by brackets denotes meta-notation. If you see
231 something like \{parser name\}, you should substitute the actual
232 name of your parser for the meta-notation. Text in a bold-face
233 typewriter font ({\tt like this}) denotes text in a specification
234 or ML code.
235
236 \section{ML-Yacc specifications}
237
238 An ML-Yacc specification consists of three parts, each of which is
239 separated from the others by a {\tt \%\%} delimiter. The general format is:
240 \begin{quote}
241 \tt
242 \{user declarations\} \\
243 \%\% \\
244 \{ML-Yacc declarations\} \\
245 \%\% \\
246 \{rules\}
247 \end{quote}
248
249 You can define values available in the semantic
250 actions of the rules in the user declarations section.
251 It is recommended that you keep the size of this
252 section as small as possible and place large
253 blocks of code in other modules.
254
255 The ML-Yacc declarations section is used to make a set
256 of required declarations and a set of optional declarations.
257 You must declare the nonterminals and terminals and the
258 types of the values associated with them there. You must
259 also name the parser and declare the type of position values.
260 You should specify the set of terminals which can follow
261 the start symbol and the set of non-shiftable terminals.
262 You may optionally declare precedences for terminals,
263 make declarations that will
264 improve error-recovery, and suppress the generation of
265 default reductions in the parser. You may
266 declare whether the parser generator should create
267 a verbose description of the parser in a ``.desc'' file. This is useful
268 for finding the causes of shift/reduce errors and other parsing conflicts.
269
270 You may also declare whether the semantic actions are
271 free of significant side-effects and always terminate. Normally, ML-Yacc
272 delays the evaluation of semantic actions until the completion of a
273 successful parse. This ensures that there will be no semantic actions
274 to ``undo'' if a syntactic error-correction invalidates some semantic
275 actions. If, however, the semantic actions are free of significant
276 side-effects and always terminate, the results of semantic actions that
277 are invalidated by a syntactic error-correction can always be safely
278 ignored.
279
280 Parsers run faster and need less memory when it is not
281 necessary to delay the evaluation of semantic actions. You are
282 encouraged to write semantic actions that are free of side-effects and
283 always terminate and to declare this information to ML-Yacc.
284
285 A semantic action is free of significant side-effects if it can be reexecuted
286 a reasonably small number of times without affecting the result of a
287 parse. (The reexecution occurs when the error-correcting parser is testing
288 possible corrections to fix a syntax error, and the number of times
289 reexecution occurs is roughly bounded, for each syntax error, by the number of
290 terminals times the amount of lookahead permitted for the error-correcting
291 parser).
292
293 The rules section contains the context-free grammar productions and their
294 associated semantic actions.
295
296 \subsection{Lexical Definitions}
297
298 Comments have the same lexical definition as they do in Standard
299 ML and can be placed anywhere in a specification.
300
301 All characters up to the first occurrence of a delimiting
302 {\tt \%\%} outside of
303 a comment are placed in the user declarations section. After that, the
304 following words and symbols are reserved:
305 \begin{quote}
306
307 \verb'of for = { } , * -> : | ( )'
308
309 \end{quote}
310
311 The following classes of ML symbols are used:
312 \begin{quote}
313 \begin{description}
314 \item[identifiers:]
315 nonsymbolic ML identifiers, which consist
316 of an alphabetic character followed by one or
317 more alphabetic characters, numeric characters,
318 primes ``{\tt '}'', or underscores ``{\tt \_}''.
319 \item[type variables:]
320 nonsymbolic ML identifier starting with a prime ``{\tt '}''
321 \item[integers:] one or more decimal digits.
322 \item[qualified identifiers:] an identifer followed by a period.
323
324 \end{description}
325 \end{quote}
326 The following classes of non-ML symbols are used:
327 \begin{quote}
328 \begin{description}
329 \item[\% identifiers:]
330 a percent sign followed by one or more lowercase
331 alphabet letters. The valid \% identifiers
332 are:
333 \begin{quote}
334 \raggedright
335 \tt
336 \%arg \%eop \%header \%keyword \%left \%name \%nodefault
337 \%nonassoc \%nonterm \%noshift \%pos \%prec \%prefer
338 \%pure \%right \%start \%subst \%term \%value \%verbose
339 \end{quote}
340 \item[code:]
341 This class is meant to hold ML code. The ML code is not
342 parsed for syntax errors. It consists of a left parenthesis
343 followed by all characters up to a balancing right
344 parenthesis. Parentheses in ML comments and ML strings
345 are excluded from the count of balancing parentheses.
346
347 \end{description}
348 \end{quote}
349
350 \subsection{Grammar}
351
352 This is the grammar for specifications:
353 \begin{eqnarray*}
354 \mbox{spec} & ::= & \mbox{user-declarations {\tt \%\%} cmd-list {\tt \%\%} rule-list} \\
355 \mbox{ML-type} & ::= & \mbox{nonpolymorphic ML types (see the Standard ML manual)} \\
356 \mbox{symbol} & ::= & \mbox{identifier} \\
357 \mbox{symbol-list} & ::= & \mbox{symbol-list symbol} \\
358 & | & \epsilon \\
359 \mbox{symbol-type-list} & ::= & \mbox{symbol-type-list {\tt |} symbol {\tt of} ML-type} \\
360 & | & \mbox{symbol-type list {\tt |} symbol} \\
361 & | & \mbox{symbol {\tt of} ML-type} \\
362 & | & \mbox{symbol} \\
363 \mbox{subst-list} & ::= & \mbox{subst-list {\tt |} symbol {\tt for} symbol} \\
364 & | & \epsilon \\
365 \mbox{cmd} & ::= & \mbox{{\tt \%arg} (Any-ML-pattern) {\tt :} ML-type} \\
366 & | & \mbox{{\tt \%eop} symbol-list} \\
367 & | & \mbox{{\tt \%header} code} \\
368 & | & \mbox{{\tt \%keyword} symbol-list} \\
369 & | & \mbox{{\tt \%left} symbol-list} \\
370 & | & \mbox{{\tt \%name} identifier} \\
371 & | & \mbox{{\tt \%nodefault}} \\
372 & | & \mbox{{\tt \%nonassoc} symbol-list} \\
373 & | & \mbox{{\tt \%nonterm} symbol-type list} \\
374 & | & \mbox{{\tt \%noshift} symbol-list } \\
375 & | & \mbox{{\tt \%pos} ML-type} \\
376 & | & \mbox{{\tt \%prefer} symbol-list} \\
377 & | & \mbox{\tt \%pure} \\
378 & | & \mbox{{\tt \%right} symbol-list} \\
379 & | & \mbox{{\tt \%start} symbol} \\
380 & | & \mbox{{\tt \%subst} subst-list} \\
381 & | & \mbox{{\tt \%term} symbol-type-list} \\
382 & | & \mbox{{\tt \%value} symbol code} \\
383 & | & \mbox{{\tt \%verbose}} \\
384 \mbox{cmd-list} & ::= &\mbox{ cmd-list cmd} \\
385 & | & \mbox{cmd} \\
386 \mbox{rule-prec} & ::= & \mbox{{\tt \%prec} symbol} \\
387 & | & \epsilon \\
388 \mbox{clause-list} & ::= & \mbox{symbol-list rule-prec code} \\
389 & | & \mbox{clause-list {\tt |} symbol-list rule-prec code} \\
390 \mbox{rule} & ::= & \mbox{symbol {\tt :} clause-list} \\
391 \mbox{rule-list} & ::= & \mbox{rule-list rule} \\
392 & | & \mbox{rule}
393 \end{eqnarray*}
394 \subsection{Required ML-Yacc Declarations}
395 \begin{description}
396 \item[{\tt \%name}]
397 You must specify the name of the parser with {\tt \%name} \{name\}.
398 \item[{\tt \%nonterm} and {\tt \%term}]
399 You must define the terminal and nonterminal sets using the
400 {\tt \%term} and {\tt \%nonterm}
401 declarations, respectively. These declarations are like an ML datatype
402 definition.
403 The type of the value that a symbol may carry is defined at the same time
404 that the symbol is defined. Each declarations consists of the keyword
405 ({\tt \%term} or {\tt \%nonterm})
406 followed by a list of symbol entries separated by a bar (``{\tt |}'').
407 Each symbol entry is a symbol name followed by an optional
408 ``of \/ $<$ML-type$>$''. The types cannot be polymorphic.
409 Those symbol entries without a type carry no value.
410 Nonterminal and terminal names must be disjoint and no name may be declared
411 more than once in either declaration.
412
413 The symbol names and types are used to construct a datatype union for the
414 values on the semantic stack in the LR parser and to name the values
415 associated with subcomponents of a rule. The names and types of
416 terminals are also used to construct a signature for a structure that
417 may be passed to the lexer functor.
418
419 Because the types and names are used in these manners, do
420 not use ML keywords as symbol names. The programs produced by ML-Yacc
421 will not compile if ML keywords are used as symbol names.
422 Make sure that the types specified in the {\tt \%term} declaration are
423 fully qualified types or are available in the background
424 environment when the signatures produced by ML-Yacc are loaded. Do
425 not use any locally defined types from the user declarations section of
426 the specification.
427
428 These requirements on the types in the {\tt \%term} declaration are not
429 a burden.
430 They force the types to be defined in another module,
431 which is a good idea since these types will
432 be used in the lexer module.
433 \item[{\tt \%pos}]
434 You must declare the type of position values using the {\tt \%pos} declaration.
435 The syntax is {\tt \%pos} $<$ML-type$>$.
436 This type MUST be the same type as that which is actually found in the lexer.
437 It cannot be polymorphic.
438
439 \end{description}
440 \subsection{Optional ML-Yacc Declarations}
441 \label{optional-def}
442 \begin{description}
443 \item[{\tt \%arg}]
444 You may want each invocation of the entire parser to be parameterized
445 by a particular argument, such as the file-name of the input
446 being parsed in an invocation of the parser. The {\tt \%arg} declaration
447 allows you to specify such an argument.
448 (This is often cleaner than using ``global'' reference variables.)
449 The declaration
450 \begin{quote}
451
452 {\tt \%arg} (Any-ML-pattern) {\tt :} $<$ML-type$>$
453
454 \end{quote}
455 specifies the argument to the parser, as well as its type. For example:
456 \begin{quote}
457
458 {\tt \%arg (filename) : string}
459
460 \end{quote}
461
462 If {\tt \%arg} is not specified, it defaults to {\tt () : unit}.
463 \item[{\tt \%eop} and {\tt \%noshift}]
464 You should specify the set of
465 terminals that may follow the start
466 symbol, also called end-of-parse symbols, using the {\tt \%eop}
467 declaration. The {\tt \%eop} keyword should be followed by the list of
468 terminals. This is useful, for example, in an interactive system
469 where you want to force the evaluation of a statement before an
470 end-of-file (remember, a parser delays the execution of semantic
471 actions until a parse is successful).
472
473 ML-Yacc has no concept of an end-of-file. You must
474 define an end-of-file terminal (EOF, perhaps) in the
475 {\tt \%term} declaration.
476 You must declare terminals which cannot be shifted, such as
477 end-of-file, in the {\tt \%noshift} declaration. The
478 {\tt \%noshift} keyword should be followed by the list of non-shiftable
479 terminals. An error message will be printed if a non-shiftable terminal
480 is found on the right hand side of any rule, but ML-Yacc will not prevent
481 you from using such grammars.
482
483 It is important to emphasize that
484 \begin{em}
485 non-shiftable terminals must be declared.
486 \end{em}
487 The error-correcting parser may attempt to read past such terminals
488 while evaluating a correction to a syntax error otherwise. This may
489 confuse the lexer.
490 \item[{\tt \%header}]
491 You may define code to head the functor \{parser name\}LrValsFun here. This
492 may be useful for adding additonal parameter structures to the functor.
493 The functor must be parameterized by the Token structure, so
494 the declaration should always have the form:
495 \begin{quote}
496 \begin{verbatim}
497 %header (functor {parser name}LrValsFun(
498 structure Token : TOKEN
499 ...)
500 )
501 \end{verbatim}
502 \end{quote}
503
504 \item[{\tt \%left},{\tt \%right},{\tt \%nonassoc}]
505 You should list the precedence declarations in order of increasing (tighter-binding)
506 precedence. Each precedence declaration consists
507 of \% keyword specifying associativity followed by a list of terminals.
508 The keywords are {\tt \%left}, {\tt \%right}, and {\tt \%nonassoc},
509 standing for their respective associativities.
510 \item[{\tt \%nodefault}]
511 The {\tt \%nodefault} declaration suppresses the generation of default
512 reductions. If only one production can be reduced in a given state in
513 an LR table, it may be made the default action for the state. An incorrect
514 reduction will be caught later when the parser attempts to shift the lookahead
515 terminal which caused the reduction. ML-Yacc usually produces programs and
516 verbose files with default reductions. This saves a great deal of
517 space in representing the LR tables,but
518 sometimes it is useful for debugging and advanced
519 uses of the parser to suppress the generation of default reductions.
520 \item[{\tt \%pure}]
521 Include the {\tt \%pure} declaration if the semantic actions
522 are free of significant side-effects and always terminate.
523 \item[{\tt \%start}]
524 You may define the start symbol using
525 the {\tt \%start} declaration. Otherwise the nonterminal for the
526 first rule will be used as the start nonterminal.
527 The keyword {\tt \%start} should be followed by the name of the starting
528 nonterminal. This nonterminal should not be used on the right hand
529 side of any rules, to avoid conflicts between reducing to the start
530 symbol and shifting a terminal. ML-Yacc will not prevent you
531 from using such grammars, but it will print a warning message.
532 \item[{\tt \%verbose}]
533
534 Include the {\tt \%verbose} declaration to produce a verbose
535 description of the LALR parser. The name of this file is
536 the name of the specification file with a ``.desc'' appended to it.
537
538 This file has the following format:
539 \begin{enumerate}
540
541 \item A summary of errors found while generating the LALR tables.
542 \item A detailed description of all errors.
543 \item A description of the states of the parser. Each state
544 is preceded by a list of conflicts in the state.
545
546 \end{enumerate}
547 \end{description}
548
549 \subsection{Declarations for improving error-recovery}
550
551 These optional declarations improve error-recovery:
552
553 \begin{description}
554 \item[{\tt \%keyword}]
555 Specify all keywords in a grammar here. The {\tt \%keyword}
556 should be followed by a list
557 of terminal names. Fixes involving keywords are generally dangerous;
558 they are prone to substantially altering the syntactic meaning
559 of the program. They are subject to a more rigorous parse check than
560 other fixes.
561
562 \item[{\tt \%prefer}]
563 List terminals to prefer for insertion after the {\tt \%prefer}.
564 Corrections which insert a terminal on this list will be chosen over
565 other corrections, all other things being equal.
566 \item[{\tt \%subst}]
567 This declaration should be followed by a list of clauses of the
568 form \{terminal\} {\tt for} \{terminal\}, where items on the list are
569 separated using a {\tt |}. Substitution corrections on this list
570 will be chosen over all other corrections except preferred insertion
571 corrections (listed above), all other things being equal.
572 \item[{\tt \%change}]
573 This is a generalization of {\tt \%prefer} and {\tt \%subst}.
574 It takes a the following syntax:
575 \begin{quote}
576 ${\it tokens}_{1a}$ \verb|->| ${\it tokens}_{1b}$ \verb+|+ ${\it tokens}_{2a}$ \verb|->| ${\it tokens}_{2b}$ {\it etc.}
577 \end{quote}
578 where each {\it tokens} is a (possibly empty) seqence of tokens. The
579 idea is that any instance of ${\it tokens}_{1a}$ can be ``corrected'' to
580 ${\it tokens}_{1b}$, and so on. For example, to suggest that a good
581 error correction to try is \verb|IN ID END| (which is useful for the
582 ML parser), write,
583 \begin{verbatim}
584 %change -> IN ID END
585 \end{verbatim}
586 \item[{\tt \%value}]
587 The error-correction algorithm may also insert terminals with values.
588 You must supply a value for such a terminal. The keyword
589 should be followed by a terminal and a piece of
590 code (enclosed in parentheses) that when evaluated supplies the value.
591 There must be a separate {\tt \%value} declaration for each terminal with
592 a value that you wish may be inserted or substituted in an error correction.
593 The code for the value is not evaluated until the parse is
594 successful.
595
596 Do not specify a {\tt \%value} for terminals without
597 values. This will result in a type error in the program produced by
598 ML-Yacc.
599 \end{description}
600
601 \subsection{Rules}
602
603 All rules are declared in the final section, after the last {\tt \%\%}
604 delimiter. A rule consists of a left hand side nonterminal, followed by
605 a colon, followed by a list of right hand side clauses.
606
607 The right hand side clauses should be separated by bars (``{\tt |}''). Each
608 clause consists of a list of nonterminal and terminal symbols, followed
609 by an optional {\tt \%prec} declaration, and then followed by the code to be
610 evaluated when the rule is reduced.
611
612 The optional {\tt \%prec} consists of the keyword {\tt \%prec} followed by a
613 terminal whose precedence should be used as the precedence of the
614 rule.
615
616 The values of those symbols on the right hand side which have values are
617 available inside the code. Positions for all the symbols are also
618 available.
619 Each value has the general form \{symbol name\}\{n+1\}, where \{n\} is the
620 number of occurrences of the symbol to the left of the symbol. If
621 the symbol occurs only once in the rule, \{symbol name\} may also
622 be used.
623 The positions are given by \{symbol~name\}\{n+1\}left and
624 \{symbol~name\}\{n+1\}right. where \{n\} is defined as before.
625 The position for a null rhs of
626 a production is assumed to be the leftmost position of the lookahead
627 terminal which is causing the reduction. This position value is
628 available in {\tt defaultPos}.
629
630 The value to which the code evaluates is used as the value of the
631 nonterminal. The type of the value and the nonterminal must match.
632 The value is ignored if the nonterminal has no value, but is still
633 evaluated for side-effects.
634
635 \section{Producing files with ML-Yacc}
636
637 ML-Yacc may be used from the interactive system or built as a
638 stand-alone program which may be run from the Unix command line.
639 See the file {\bf README} in the mlyacc directory for directions
640 on installing ML-Yacc. We recommend thaat ML-Yacc be installed as
641 a stand-alone program.
642
643 If you are using the stand-alone version of ML-Yacc, invoke the
644 program ``sml-yacc'' with the name of the specifcation file.
645 If you are using ML-Yacc in the interactive system, load the file
646 ``smlyacc.sml''. The end result is a structure ParseGen, with one
647 value parseGen in it. Apply parseGen to a string containing the
648 name of the specification file.
649
650 Two files will be created, one named by
651 attaching ``.sig'' to the name of the specification, the other named by
652 attaching ``.sml'' to the name of the specification.
653
654 \section{The lexical analyzer}
655
656 Let the name for
657 the parser given in the {\tt \%name} declaration be denoted by \{n\} and
658 the specification file name be denoted by \{spec name\}
659 The parser generator creates a functor named \{n\}LrValsFun for
660 the values needed for a particular parser. This functor is placed
661 in \{spec name\}.sml. This
662 functor contains a structure
663 Tokens which allows you to construct terminals from the appropriate
664 values. The structure has a function for each terminal that takes a tuple
665 consisting of the value for the terminal (if there is any), a leftmost
666 position for the terminal, and a rightmost position for the terminal and
667 constructs the terminal from these values.
668
669 A signature for the structure Tokens is created and placed in the ``.sig''
670 file created by ML-Yacc. This signature is \{n\}\_TOKENS,
671 where \{n\} is
672 the name given in the parser specification. A signature
673 \{n\}\_LRVALS is created for the structure produced by
674 applying \{n\}LrValsFun.
675
676 Use the signature \{n\}\_TOKENS to create a functor for the
677 lexical analyzer which takes the structure Tokens as an argument. The
678 signature \{n\}\_TOKENS
679 will not change unless the {\tt \%term} declaration in a
680 specification is altered by adding terminals or
681 changing the types of terminals. You do not need to recompile
682 the lexical analyzer functor each time the specification for
683 the parser is changed if the
684 signature \{n\}\_TOKENS does not change.
685
686 If you are using ML-Lex to create the lexical analyzer, you
687 can turn the lexer structure into a functor using the
688 {\tt \%header} declaration.
689 {\tt \%header} allows the user to define the header for a structure body.
690
691 If the name of the parser in the specification were Calc, you
692 would add this declaration to the specification for the lexical
693 analyzer:
694 \begin{quote}
695 \tt
696 \begin{verbatim}
697 %header (functor CalcLexFun(structure Tokens : Calc_TOKENS))
698 \end{verbatim}
699 \end{quote}
700
701 You must define the following in the user definitions section:
702 \begin{quote}
703 \tt
704 \begin{verbatim}
705 type pos
706 \end{verbatim}
707 \end{quote}
708 This is the type of position values for terminals. This type
709 must be the same as the one declared in the specification for
710 the grammar. Note, however, that this type is not available
711 in the Tokens structure that parameterizes the lexer functor.
712
713 You must include the following code in the user definitions section of
714 the ML-Lex specification:
715 \begin{quote}
716 \tt
717 \begin{verbatim}
718 type svalue = Tokens.svalue
719 type ('a,'b) token = ('a,'b) Tokens.token
720 type lexresult = (svalue,pos) token
721 \end{verbatim}
722 \end{quote}
723
724 These types are used to give lexers signatures.
725
726 You may use a lexer constructed using ML-Lex with the {\tt \%arg}
727 declaration, but you must follow special instructions for tying the parser
728 and lexer together.
729
730 \section{Creating the parser}
731 \label{create-parser}
732 Let the name of the grammar specification file be denoted by
733 \{grammar\} and the name of the lexer specification file be
734 denoted by \{lexer\} (e.g. in our calculator example these would
735 stand for calc.grm and calc.lex, respectively).
736 Let the parser name in the specification be represented by \{n\}
737 (e.g. Calc in our calculator example).
738
739 To construct a parser, do the following:
740 \begin{enumerate}
741 \item In the appropriate CM description file (e.g. for your main
742 program or one of its subgroups or libraries), include the lines:
743 \begin{quote}
744 \begin{verbatim}
745 ml-yacc-lib.cm
746 {lexer}
747 {grammar}
748 \end{verbatim}
749 \end{quote}
750 This will cause ML-Yacc to be run on \{grammar\}, producing source files
751 \{grammar\}.sig and \{grammar\}.sml, and ML-Lex to be run on
752 \{lexer\}, producing a source file \{lexer\}.sml. Then these files
753 will be compiled after loading the necessary signatures and modules
754 from the ML-Yacc library as specified by {\tt ml-yacc-lib.cm}.
755 \item Apply functors to create the parser:
756 \begin{quote}
757 \begin{verbatim}
758 structure {n}LrVals =
759 {n}LrValsFun(structure Token = LrParser.Token)
760 structure {n}Lex =
761 {n}LexFun(structure Tokens = {n}LrVals.Tokens)
762 structure {n}Parser=
763 Join(structure ParserData = {n}LrVals.ParserData
764 structure Lex={n}Lex
765 structure LrParser=LrParser)
766 \end{verbatim}
767 \end{quote}
768 If the lexer was created using the {\tt \%arg} declaration in ML-Lex,
769 the definition of \{n\}Parser must be changed to use another functor
770 called JoinWithArg:
771 \begin{quote}
772 \begin{verbatim}
773 structure {n}Parser=
774 JoinWithArg
775 (structure ParserData={n}LrVals.ParserData
776 structure Lex={n}Lex
777 structure LrParser=LrParser)
778 \end{verbatim}
779 \end{quote}
780 \end{enumerate}
781
782 The following outline summarizes this process:
783 \begin{quote}
784 \begin{verbatim}
785 (* available at top level *)
786
787 TOKEN
788 LR_TABLE
789 STREAM
790 LR_PARSER
791 PARSER_DATA
792 structure LrParser : LR_PARSER
793
794 (* printed out in .sig file created by parser generator: *)
795
796 signature {n}_TOKENS =
797 sig
798 structure Token : TOKEN
799 type svalue
800 val PLUS : 'pos * 'pos ->
801 (svalue,'pos) Token.token
802 val INTLIT : int * 'pos * 'pos ->
803 (svalue,'pos) Token.token
804 ...
805 end
806
807 signature {n}_LRVALS =
808 sig
809 structure Tokens : {n}_TOKENS
810 structure ParserData : PARSER_DATA
811 sharing ParserData.Token = Tokens.Token
812 sharing type ParserData.svalue = Tokens.svalue
813 end
814
815 (* printed out by lexer generator: *)
816
817 functor {n}LexFun(structure Tokens : {n}_TOKENS)=
818 struct
819 ...
820 end
821
822 (* printed out in .sml file created by parser generator: *)
823
824 functor {n}LrValsFun(structure Token : TOKENS) =
825 struct
826
827 structure ParserData =
828 struct
829 structure Token = Token
830
831 (* code in header section of specification *)
832
833 structure Header = ...
834 type svalue = ...
835 type result = ...
836 type pos = ...
837 structure Actions = ...
838 structure EC = ...
839 val table = ...
840 end
841
842 structure Tokens : {n}_TOKENS =
843 struct
844 structure Token = ParserData.Token
845 type svalue = ...
846 fun PLUS(p1,p2) = ...
847 fun INTLIT(i,p1,p2) = ...
848 end
849
850 end
851
852 (* to be done by the user: *)
853
854 structure {n}LrVals =
855 {n}LrValsFun(structure Token = LrParser.Token)
856
857 structure {n}Lex =
858 {n}LexFun(structure Tokens = {n}LrVals.Tokens)
859
860 structure {n}Parser =
861 Join(structure Lex = {n}Lex
862 structure ParserData = {n}ParserData
863 structure LrParser = LrParser)
864 \end{verbatim}
865 \end{quote}
866
867 \section{Using the parser}
868 \subsection{Parser Structure Signatures}
869 The final structure created will have the signature PARSER:
870 \begin{quote}
871 \begin{verbatim}
872 signature PARSER =
873 sig
874 structure Token : TOKEN
875 structure Stream : STREAM
876 exception ParseError
877
878 type pos (* pos is the type of line numbers *)
879 type result (* value returned by the parser *)
880 type arg (* type of the user-supplied argument *)
881 type svalue (* the types of semantic values *)
882
883 val makeLexer : (int -> string) ->
884 (svalue,pos) Token.token Stream.stream
885 val parse :
886 int * ((svalue,pos) Token.token Stream.stream) *
887 (string * pos * pos -> unit) * arg ->
888 result * (svalue,pos) Token.token Stream.stream
889 val sameToken :
890 (svalue,pos) Token.token * (svalue,pos) Token.token ->
891 bool
892 end
893 \end{verbatim}
894 \end{quote}
895 or the signature ARG\_PARSER if you used {\tt \%arg} to create the lexer.
896 This signature differs from ARG\_PARSER in that it
897 which has an additional type {\tt lexarg} and a different type
898 for {\tt makeLexer}:
899 \begin{quote}
900 \begin{verbatim}
901 type lexarg
902 val makeLexer : (int -> string) -> lexarg ->
903 (svalue,pos) token stream
904 \end{verbatim}
905 \end{quote}
906
907 The signature STREAM (providing lazy streams) is:
908 \begin{quote}
909 \begin{verbatim}
910 signature STREAM =
911 sig
912 type 'a stream
913 val streamify : (unit -> 'a) -> 'a stream
914 val cons : 'a * 'a stream -> 'a stream
915 val get : 'a stream -> 'a * 'a stream
916 end
917 \end{verbatim}
918 \end{quote}
919
920 \subsection{Using the parser structure}
921
922 The parser structure converts the lexing function produced by
923 ML-Lex into a function which creates a lazy stream of tokens. The
924 function {\tt makeLexer} takes the same values as the corresponding
925 {\tt makeLexer} created by ML-Lex, but returns a stream of tokens
926 instead of a function which yields tokens.
927
928 The function parse takes the token stream and some other arguments that
929 are described below and parses the token stream. It returns a pair composed
930 of the value associated with the start symbol and the rest of
931 the token stream. The rest of the token stream includes the
932 end-of-parse symbol which caused the reduction of some rule
933 to the start symbol. The function parse raises the
934 exception ParseError if a syntax error occurs which it cannot fix.
935
936 The lazy stream is implemented by the {\tt Stream} structure.
937 The function {\tt streamify} converts a conventional implementation
938 of a stream into a lazy stream. In a conventional implementation
939 of a stream, a stream consists of a position in a list of
940 values. Fetching a value from a stream returns the
941 value associated with the position and updates the position to
942 the next element in the list of values. The fetch is a side-effecting
943 operation. In a lazy stream, a fetch returns a value and a new
944 stream, without a side-effect which updates the position value.
945 This means that a stream can be repeatedly re-evaluated without
946 affecting the values that it returns. If $f$ is the function
947 that is passed to {\tt streamify}, $f$ is called only as many
948 times as necessary to construct the portion of the list of values
949 that is actually used.
950
951 Parse also takes an integer giving the maximum amount of lookahead permitted
952 for the error-correcting parse, a function to print error messages,
953 and a value of type arg. The maximum amount of lookahead for interactive
954 systems should be zero. In this case, no attempt is made to correct any
955 syntax errors. For non-interactive systems, try 15. The
956 function to print error messages takes a tuple of values consisting
957 of the left and right positions of the terminal which caused the error
958 and an error message. If the {\tt \%arg} declaration is not used, the
959 value of type arg should be a value of type unit.
960
961 The function sameToken can be used to see if two tokens
962 denote the same terminal, irregardless of any values that the
963 tokens carry. It is useful if you have multiple end-of-parse
964 symbols and must check which end-of-parse symbol has been left on the
965 front of the token stream.
966
967 The types have the following meanings. The type {\tt arg} is the type
968 of the additional argument to the parser, which is specified by the
969 {\tt \%arg} declaration in the ML-Yacc specification. The type
970 {\tt lexarg} is the optional argument to lexers, and is specified by
971 the {\tt \%arg} declaration in an ML-Lex specifcation. The type {\tt pos}
972 is the type of line numbers, and is specified by the {\tt \%pos} declaration
973 in an ML-Yacc specification and defined in the user declarations
974 section of the ML-Lex specification. The type {\tt result} is
975 the type associated with the start symbol in the ML-Yacc specification.
976
977 \section{Examples}
978
979 See the directory examples for examples of parsers constructed using
980 ML-Yacc. Here is a small sample parser and lexer for an interactive
981 calculator, from the directory examples/calc, along with code for
982 creating a parsing function. The calculator reads one or more
983 expressions from the standard input, evaluates the expressions, and
984 prints their values. Expressions should be separated by semicolons,
985 and may also be ended by using an end-of-file. This shows how to
986 construct an interactive parser which reads a top-level declaration
987 and processes the declaration before reading the next top-level
988 declaration.
989
990 \subsection{Sample Grammar}
991 \begin{tt}
992 \begin{verbatim}
993 (* Sample interactive calculator for ML-Yacc *)
994
995 fun lookup "bogus" = 10000
996 | lookup s = 0
997
998 %%
999
1000 %eop EOF SEMI
1001
1002 (* %pos declares the type of positions for terminals.
1003 Each symbol has an associated left and right position. *)
1004
1005 %pos int
1006
1007 %left SUB PLUS
1008 %left TIMES DIV
1009 %right CARAT
1010
1011 %term ID of string | NUM of int | PLUS | TIMES | PRINT |
1012 SEMI | EOF | CARAT | DIV | SUB
1013 %nonterm EXP of int | START of int option
1014
1015 %name Calc
1016
1017 %subst PRINT for ID
1018 %prefer PLUS TIMES DIV SUB
1019 %keyword PRINT SEMI
1020
1021 %noshift EOF
1022 %value ID ("bogus")
1023 %nodefault
1024 %verbose
1025 %%
1026
1027 (* the parser returns the value associated with the expression *)
1028
1029 START : PRINT EXP (print EXP;
1030 print "\n";
1031 flush_out std_out; SOME EXP)
1032 | EXP (SOME EXP)
1033 | (NONE)
1034 EXP : NUM (NUM)
1035 | ID (lookup ID)
1036 | EXP PLUS EXP (EXP1+EXP2)
1037 | EXP TIMES EXP (EXP1*EXP2)
1038 | EXP DIV EXP (EXP1 div EXP2)
1039 | EXP SUB EXP (EXP1-EXP2)
1040 | EXP CARAT EXP (let fun e (m,0) = 1
1041 | e (m,l) = m*e(m,l-1)
1042 in e (EXP1,EXP2)
1043 end)
1044 \end{verbatim}
1045 \end{tt}
1046 \subsection{Sample Lexer}
1047 \begin{tt}
1048 \begin{verbatim}
1049 structure Tokens = Tokens
1050
1051 type pos = int
1052 type svalue = Tokens.svalue
1053 type ('a,'b) token = ('a,'b) Tokens.token
1054 type lexresult= (svalue,pos) token
1055
1056 val pos = ref 0
1057 val eof = fn () => Tokens.EOF(!pos,!pos)
1058 val error = fn (e,l : int,_) =>
1059 output(std_out,"line " ^ (makestring l) ^
1060 ": " ^ e ^ "\n")
1061 %%
1062 %header (functor CalcLexFun(structure Tokens: Calc_TOKENS));
1063 alpha=[A-Za-z];
1064 digit=[0-9];
1065 ws = [\ \t];
1066 %%
1067 \n => (pos := (!pos) + 1; lex());
1068 {ws}+ => (lex());
1069 {digit}+ => (Tokens.NUM
1070 (revfold (fn (a,r) => ord(a)-ord("0")+10*r)
1071 (explode yytext) 0,
1072 !pos,!pos));
1073 "+" => (Tokens.PLUS(!pos,!pos));
1074 "*" => (Tokens.TIMES(!pos,!pos));
1075 ";" => (Tokens.SEMI(!pos,!pos));
1076 {alpha}+ => (if yytext="print"
1077 then Tokens.PRINT(!pos,!pos)
1078 else Tokens.ID(yytext,!pos,!pos)
1079 );
1080 "-" => (Tokens.SUB(!pos,!pos));
1081 "^" => (Tokens.CARAT(!pos,!pos));
1082 "/" => (Tokens.DIV(!pos,!pos));
1083 "." => (error ("ignoring bad character "^yytext,!pos,!pos);
1084 lex());
1085 \end{verbatim}
1086 \end{tt}
1087 \subsection{Top-level code}
1088
1089 You must follow the instructions in Section~\ref{create-parser}
1090 to create the parser and lexer functors and load them. After you have
1091 done this, you must then apply the functors to produce the {\tt CalcParser}
1092 structure. The code for doing this is shown below.
1093 \begin{quote}
1094 \begin{verbatim}
1095 structure CalcLrVals =
1096 CalcLrValsFun(structure Token = LrParser.Token)
1097
1098 structure CalcLex =
1099 CalcLexFun(structure Tokens = CalcLrVals.Tokens);
1100
1101 structure CalcParser =
1102 Join(structure LrParser = LrParser
1103 structure ParserData = CalcLrVals.ParserData
1104 structure Lex = CalcLex)
1105 \end{verbatim}
1106 \end{quote}
1107
1108 Now we need a function which given a lexer invokes the parser. The
1109 function {\tt invoke} does this.
1110
1111 \begin{quote}
1112 \begin{verbatim}
1113 fun invoke lexstream =
1114 let fun print_error (s,i:int,_) =
1115 TextIO.output(TextIO.stdOut,
1116 "Error, line " ^ (Int.toString i) ^ ", " ^ s ^ "\n")
1117 in CalcParser.parse(0,lexstream,print_error,())
1118 end
1119 \end{verbatim}
1120 \end{quote}
1121
1122 Finally, we need a function which can read one or more expressions from
1123 the standard input. The function {\tt parse}, shown below, does this.
1124 It runs the calculator on the standard input and terminates
1125 when an end-of-file is encountered.
1126
1127 \begin{quote}
1128 \begin{verbatim}
1129 fun parse () =
1130 let val lexer = CalcParser.makeLexer
1131 (fn _ => TextIO.inputLine TextIO.stdIn)
1132 val dummyEOF = CalcLrVals.Tokens.EOF(0,0)
1133 val dummySEMI = CalcLrVals.Tokens.SEMI(0,0)
1134 fun loop lexer =
1135 let val (result,lexer) = invoke lexer
1136 val (nextToken,lexer) = CalcParser.Stream.get lexer
1137 in case result
1138 of SOME r =>
1139 TextIO.output(TextIO.stdOut,
1140 "result = " ^ (Int.toString r) ^ "\n")
1141 | NONE => ();
1142 if CalcParser.sameToken(nextToken,dummyEOF) then ()
1143 else loop lexer
1144 end
1145 in loop lexer
1146 end
1147 \end{verbatim}
1148 \end{quote}
1149
1150 \section{Signatures}
1151
1152 This section contains signatures used by ML-Yacc for structures in
1153 the file base.sml, functors and structures that it generates, and for
1154 the signatures of lexer structures supplied by you.
1155
1156 \subsection{Parsing structure signatures}
1157
1158 \begin{quote}
1159 \begin{verbatim}
1160 (* STREAM: signature for a lazy stream.*)
1161
1162 signature STREAM =
1163 sig
1164 type 'a stream
1165 val streamify : (unit -> 'a) -> 'a stream
1166 val cons : 'a * 'a stream -> 'a stream
1167 val get : 'a stream -> 'a * 'a stream
1168 end
1169
1170 (* LR_TABLE: signature for an LR Table.*)
1171
1172 signature LR_TABLE =
1173 sig
1174 datatype ('a,'b) pairlist
1175 = EMPTY
1176 | PAIR of 'a * 'b * ('a,'b) pairlist
1177 datatype state = STATE of int
1178 datatype term = T of int
1179 datatype nonterm = NT of int
1180 datatype action = SHIFT of state
1181 | REDUCE of int
1182 | ACCEPT
1183 | ERROR
1184 type table
1185
1186 val numStates : table -> int
1187 val numRules : table -> int
1188 val describeActions : table -> state ->
1189 (term,action) pairlist * action
1190 val describeGoto : table -> state ->
1191 (nonterm,state) pairlist
1192 val action : table -> state * term -> action
1193 val goto : table -> state * nonterm -> state
1194 val initialState : table -> state
1195 exception Goto of state * nonterm
1196
1197 val mkLrTable :
1198 {actions : ((term,action) pairlist * action) array,
1199 gotos : (nonterm,state) pairlist array,
1200 numStates : int, numRules : int,
1201 initialState : state} -> table
1202 end
1203
1204 (* TOKEN: signature for the internal structure of a token.*)
1205
1206 signature TOKEN =
1207 sig
1208 structure LrTable : LR_TABLE
1209 datatype ('a,'b) token = TOKEN of LrTable.term *
1210 ('a * 'b * 'b)
1211 val sameToken : ('a,'b) token * ('a,'b) token -> bool
1212 end
1213
1214 (* LR_PARSER: signature for a polymorphic LR parser *)
1215
1216 signature LR_PARSER =
1217 sig
1218 structure Stream: STREAM
1219 structure LrTable : LR_TABLE
1220 structure Token : TOKEN
1221
1222 sharing LrTable = Token.LrTable
1223
1224 exception ParseError
1225
1226 val parse:
1227 {table : LrTable.table,
1228 lexer : ('b,'c) Token.token Stream.stream,
1229 arg: 'arg,
1230 saction : int *
1231 'c *
1232 (LrTable.state * ('b * 'c * 'c)) list *
1233 'arg ->
1234 LrTable.nonterm *
1235 ('b * 'c * 'c) *
1236 ((LrTable.state *('b * 'c * 'c)) list),
1237 void : 'b,
1238 ec: {is_keyword : LrTable.term -> bool,
1239 noShift : LrTable.term -> bool,
1240 preferred_subst:LrTable.term -> LrTable.term list,
1241 preferred_insert : LrTable.term -> bool,
1242 errtermvalue : LrTable.term -> 'b,
1243 showTerminal : LrTable.term -> string,
1244 terms: LrTable.term list,
1245 error : string * 'c * 'c -> unit
1246 },
1247 lookahead : int (* max amount of lookahead used in
1248 * error correction *)
1249 } -> 'b * (('b,'c) Token.token Stream.stream)
1250 end
1251 \end{verbatim}
1252 \end{quote}
1253
1254 \subsection{Lexers}
1255
1256 Lexers for use with ML-Yacc's output must match one of these signatures.
1257
1258 \begin{quote}
1259 \begin{verbatim}
1260 signature LEXER =
1261 sig
1262 structure UserDeclarations :
1263 sig
1264 type ('a,'b) token
1265 type pos
1266 type svalue
1267 end
1268 val makeLexer : (int -> string) -> unit ->
1269 (UserDeclarations.svalue, UserDeclarations.pos)
1270 UserDeclarations.token
1271 end
1272
1273 (* ARG_LEXER: the %arg option of ML-Lex allows users to
1274 produce lexers which also take an argument before
1275 yielding a function from unit to a token.
1276 *)
1277
1278 signature ARG_LEXER =
1279 sig
1280 structure UserDeclarations :
1281 sig
1282 type ('a,'b) token
1283 type pos
1284 type svalue
1285 type arg
1286 end
1287 val makeLexer :
1288 (int -> string) ->
1289 UserDeclarations.arg ->
1290 unit ->
1291 (UserDeclarations.svalue, UserDeclarations.pos)
1292 UserDeclarations.token
1293 end
1294 \end{verbatim}
1295 \end{quote}
1296
1297 \subsection{Signatures for the functor produced by ML-Yacc}
1298
1299 The following signature is used in signatures generated by
1300 ML-Yacc:
1301 \begin{quote}
1302 \begin{verbatim}
1303 (* PARSER_DATA: the signature of ParserData structures in
1304 {n}LrValsFun functor produced by ML-Yacc. All such
1305 structures match this signature. *)
1306
1307 signature PARSER_DATA =
1308 sig
1309 type pos (* the type of line numbers *)
1310 type svalue (* the type of semantic values *)
1311 type arg (* the type of the user-supplied *)
1312 (* argument to the parser *)
1313 type result
1314
1315 structure LrTable : LR_TABLE
1316 structure Token : TOKEN
1317 sharing Token.LrTable = LrTable
1318
1319 structure Actions :
1320 sig
1321 val actions : int * pos *
1322 (LrTable.state * (svalue * pos * pos)) list * arg ->
1323 LrTable.nonterm * (svalue * pos * pos) *
1324 ((LrTable.state *(svalue * pos * pos)) list)
1325 val void : svalue
1326 val extract : svalue -> result
1327 end
1328
1329 (* structure EC contains information used to improve
1330 error recovery in an error-correcting parser *)
1331
1332 structure EC :
1333 sig
1334 val is_keyword : LrTable.term -> bool
1335 val noShift : LrTable.term -> bool
1336 val preferred_subst: LrTable.term -> LrTable.term list
1337 val preferred_insert : LrTable.term -> bool
1338 val errtermvalue : LrTable.term -> svalue
1339 val showTerminal : LrTable.term -> string
1340 val terms: LrTable.term list
1341 end
1342
1343 (* table is the LR table for the parser *)
1344
1345 val table : LrTable.table
1346 end
1347 \end{verbatim}
1348 \end{quote}
1349
1350 ML-Yacc generates these two signatures:
1351 \begin{quote}
1352 \begin{verbatim}
1353 (* printed out in .sig file created by parser generator: *)
1354
1355 signature {n}_TOKENS =
1356 sig
1357 type ('a,'b) token
1358 type svalue
1359 ...
1360 end
1361
1362 signature {n}_LRVALS =
1363 sig
1364 structure Tokens : {n}_TOKENS
1365 structure ParserData : PARSER_DATA
1366 sharing type ParserData.Token.token = Tokens.token
1367 sharing type ParserData.svalue = Tokens.svalue
1368 end
1369 \end{verbatim}
1370 \end{quote}
1371 \subsection{User parser signatures}
1372
1373 Parsers created by applying the Join functor will match this signature:
1374 \begin{quote}
1375 \begin{verbatim}
1376 signature PARSER =
1377 sig
1378 structure Token : TOKEN
1379 structure Stream : STREAM
1380 exception ParseError
1381
1382 type pos (* pos is the type of line numbers *)
1383 type result (* value returned by the parser *)
1384 type arg (* type of the user-supplied argument *)
1385 type svalue (* the types of semantic values *)
1386
1387 val makeLexer : (int -> string) ->
1388 (svalue,pos) Token.token Stream.stream
1389
1390 val parse :
1391 int * ((svalue,pos) Token.token Stream.stream) *
1392 (string * pos * pos -> unit) * arg ->
1393 result * (svalue,pos) Token.token Stream.stream
1394 val sameToken :
1395 (svalue,pos) Token.token * (svalue,pos) Token.token ->
1396 bool
1397 end
1398 \end{verbatim}
1399 \end{quote}
1400 Parsers created by applying the JoinWithArg functor will match this
1401 signature:
1402 \begin{quote}
1403 \begin{verbatim}
1404 signature ARG_PARSER =
1405 sig
1406 structure Token : TOKEN
1407 structure Stream : STREAM
1408 exception ParseError
1409
1410 type arg
1411 type lexarg
1412 type pos
1413 type result
1414 type svalue
1415
1416 val makeLexer : (int -> string) -> lexarg ->
1417 (svalue,pos) Token.token Stream.stream
1418 val parse : int *
1419 ((svalue,pos) Token.token Stream.stream) *
1420 (string * pos * pos -> unit) *
1421 arg ->
1422 result * (svalue,pos) Token.token Stream.stream
1423 val sameToken :
1424 (svalue,pos) Token.token * (svalue,pos) Token.token ->
1425 bool
1426 end
1427 \end{verbatim}
1428 \end{quote}
1429
1430 \section{Sharing constraints}
1431
1432 Let the name of the parser be denoted by \{n\}. If
1433 you have not created a lexer which takes an argument, and
1434 you have followed the directions given earlier for creating the parser, you
1435 will have the following structures with the following signatures:
1436 \begin{quote}
1437 \begin{verbatim}
1438 (* always present *)
1439
1440 signature TOKEN
1441 signature LR_TABLE
1442 signature STREAM
1443 signature LR_PARSER
1444 signature PARSER_DATA
1445 structure LrParser : LR_PARSER
1446
1447 (* signatures generated by ML-Yacc *)
1448
1449 signature {n}_TOKENS
1450 signature {n}_LRVALS
1451
1452 (* structures created by you *)
1453
1454 structure {n}LrVals : {n}_LRVALS
1455 structure Lex : LEXER
1456 structure {n}Parser : PARSER
1457 \end{verbatim}
1458 \end{quote}
1459
1460 The following sharing constraints will exist:
1461 \begin{quote}
1462 \begin{verbatim}
1463 sharing {n}Parser.Token = LrParser.Token =
1464 {n}LrVals.ParserData.Token
1465 sharing {n}Parser.Stream = LrParser.Stream
1466
1467 sharing type {n}Parser.arg = {n}LrVals.ParserData.arg
1468 sharing type {n}Parser.result = {n}LrVals.ParserData.result
1469 sharing type {n}Parser.pos = {n}LrVals.ParserData.pos =
1470 Lex.UserDeclarations.pos
1471 sharing type {n}Parser.svalue = {n}LrVals.ParserData.svalue =
1472 {n}LrVals.Tokens.svalue = Lex.UserDeclarations.svalue
1473 sharing type {n}Parser.Token.token =
1474 {n}LrVals.ParserData.Token.token =
1475 LrParser.Token.token =
1476 Lex.UserDeclarations.token
1477
1478 sharing {n}LrVals.LrTable = LrParser.LrTable
1479
1480 \end{verbatim}
1481 \end{quote}
1482
1483 If you used a lexer which takes an argument, then you will
1484 have:
1485 \begin{quote}
1486 \begin{verbatim}
1487 structure ARG_LEXER
1488 structure {n}Parser : PARSER
1489
1490 (* additional sharing constraint *)
1491
1492 sharing type {n}Parser.lexarg = Lex.UserDeclarations.arg
1493 \end{verbatim}
1494 \end{quote}
1495
1496 \section{Hints}
1497 \subsection{Multiple start symbols}
1498 To have multiple start symbols, define a dummy token for each
1499 start symbol. Then define a start symbol which derives the
1500 multiple start symbols with dummy tokens placed in front of
1501 them. When you start the parser you must place a dummy token
1502 on the front of the lexer stream to select a start symbol
1503 from which to begin parsing.
1504
1505 Assuming that you have followed the naming conventions used before,
1506 create the lexer using the makeLexer function in the \{n\}Parser structure.
1507 Then, place the dummy token on the front of the lexer:
1508 \begin{quote}
1509 \begin{verbatim}
1510 val dummyLexer =
1511 {n}Parser.Stream.cons
1512 ({n}LrVals.Tokens.{dummy token name}
1513 ({dummy lineno},{dummy lineno}),
1514 lexer)
1515 \end{verbatim}
1516 \end{quote}
1517 You have to pass a Tokens structure to the lexer. This Tokens structure
1518 contains functions which construct tokens from values and line numbers.
1519 So to create your dummy token just apply the appropriate token constructor
1520 function from this Tokens structure to a value (if there is one) and the
1521 line numbers. This is exactly what you do in the lexer to construct tokens.
1522
1523 Then you must place the dummy token on the front of your lex stream.
1524 The structure \{n\}Parser contains a structure Stream which implements
1525 lazy streams. So you just cons the dummy token on to stream returned
1526 by makeLexer.
1527 \subsection{Functorizing things further}
1528
1529 You may wish to functorize things even further. Two possibilities
1530 are turning the lexer and parser structures into closed functors,
1531 that is, functors which do not refer to types or values defined
1532 outside their body or outside their parameter structures (except
1533 for pervasive types and values), and creating a functor which
1534 encapsulates the code necessary to invoke the parser.
1535
1536 Use the {\tt \%header} declarations in ML-Lex and ML-Yacc to create
1537 closed functors. See section~\ref{optional-def} of this manual
1538 and section 4 of the manual for ML-Lex for complete descriptions of these
1539 declarations. If you do this, you should also parameterize these
1540 structures by the types of line numbers. The type will be an
1541 abstract type, so you will also need to define all the valid
1542 operations on the type. The signature INTERFACE, defined below,
1543 shows one possible signature for a structure defining the line
1544 number type and associated operations.
1545
1546 If you wish to encapsulate the code necessary to invoke the
1547 parser, your functor generally will have form:
1548 \begin{quote}
1549 \begin{verbatim}
1550 functor Encapsulate(
1551 structure Parser : PARSER
1552 structure Interface : INTERFACE
1553 sharing type Parser.arg = Interface.arg
1554 sharing type Parser.pos = Interface.pos
1555 sharing type Parser.result = ...
1556 structure Tokens : {parser name}_TOKENS
1557 sharing type Tokens.token = Parser.Token.token
1558 sharing type Tokens.svalue = Parser.svalue) =
1559 struct
1560 ...
1561 end
1562 \end{verbatim}
1563 \end{quote}
1564
1565 The signature INTERFACE, defined below, is a possible signature for
1566 a structure
1567 defining the types
1568 of line numbers and arguments (types pos and arg, respectively)
1569 along with operations for them. You need this structure
1570 because
1571 these types will be abstract types inside the body of your
1572 functor.
1573 \begin{quote}
1574 \begin{verbatim}
1575 signature INTERFACE =
1576 sig
1577 type pos
1578 val line : pos ref
1579 val reset : unit -> unit
1580 val next : unit -> unit
1581 val error : string * pos * pos -> unit
1582
1583 type arg
1584 val nothing : arg
1585 end
1586 \end{verbatim}
1587 \end{quote}
1588
1589 The directory example/fol contains a sample parser in which
1590 the code for tying together the lexer and parser has been
1591 encapsulated in a functor.
1592
1593 \section{Acknowledgements}
1594
1595 Nick Rothwell wrote an SLR table generator in 1988 which inspired the
1596 initial work on an ML parser generator. Bruce Duba and David
1597 MacQueen made useful suggestions about the design of the error-correcting
1598 parser. Thanks go to all the users at Carnegie Mellon who beta-tested
1599 this version. Their comments and questions led to the creation of
1600 this manual and helped improve it.
1601
1602 \section{Bugs}
1603
1604 There is a slight difference in syntax between ML-Lex and ML-Yacc.
1605 In ML-Lex, semantic actions must be followed by a semicolon but
1606 in ML-Yacc semantic actions cannot be followed by a semicolon.
1607 The syntax should be the same. ML-Lex also produces structures with
1608 two different signatures, but it should produce structures with just
1609 one signature. This would simplify some things.
1610
1611 \begin{thebibliography}{9}
1612
1613 \bibitem{bf} ``A Practical Method for LR and LL Syntactic Error
1614 Diagnosis and Recovery'', M. Burke and G. Fisher,
1615 ACM Transactions on Programming Languages and
1616 Systems, Vol. 9, No. 2, April 1987, pp. 164-167.
1617 \bibitem{ahu} A. Aho, R. Sethi, J. Ullman, {\em Compilers: Principles,
1618 Techniques, and Tools}, Addison-Wesley, Reading, MA, 1986.
1619
1620 \end{thebibliography}
1621
1622 \end{document}