mlyacc/doc/mlyacc.tex

   1 \documentstyle{article}
   2 \title{                        ML-Yacc User's Manual \\
   3                                Version 2.4
   4       }
   5 \author{                David R. Tarditi$^1$\\
   6                         Andrew W. Appel$^2$\\
   7 \\
   8 $^1$Microsoft Research \\
   9 \\
  10 $^2$Department of Computer Science \\
  11     Princeton University \\
  12     Princeton, NJ 08544
  13 }
  14 \date{April 24, 2000}
  15
  16 \begin{document}
  17 \maketitle
  18 \begin{center}
  19 (c) 1989, 1990, 1991,1994 Andrew W. Appel, David R. Tarditi
  20 \end{center}
  21
  22 {\bf
  23 This software comes with ABSOLUTELY NO WARRANTY.  It is subject only to
  24 the terms of the ML-Yacc NOTICE, LICENSE, and DISCLAIMER (in the
  25 file COPYRIGHT distributed with this software).
  26 }
  27
  28 New in this version:  Improved error correction directive \verb|%change|
  29 that allows multi-token insertions, deletions, substitutions.
  30 Explanation of how to build a parser (Section 5) and the Calc example
  31 (Section 7) revised for SML/NJ Version 110 and the use of CM.
  32
  33 \newpage
  34 \tableofcontents
  35 \newpage
  36
  37 \section{Introduction}
  38 \subsection{General}
  39 ML-Yacc is a parser generator for Standard ML modeled after the
  40 Yacc parser generator.  It generates parsers for LALR languages, like Yacc,
  41 and has a similar syntax.  The generated parsers use a different algorithm
  42 for recovering from syntax errors than parsers generated by Yacc.
  43 The algorithm is a partial implementation of an algorithm described in \cite{bf}.
  44 A parser tries to recover from a syntax error
  45 by making a single token insertion, deletion, or
  46 substitution near the point in the input stream at which the error
  47 was detected.  The parsers delay the evaluation of semantic actions until
  48 parses are completed successfully.  This makes it possible for
  49 parsers to recover from syntax errors that occur before the point
  50 of error detection, but it does prevent the parsers from
  51 affecting lexers in any significant way.  The parsers
  52 can insert tokens with values and substitute tokens with values
  53 for other tokens. All symbols carry left and right position values
  54 which are available to semantic actions and are used in
  55 syntactic error messages.
  56
  57 ML-Yacc uses context-free grammars to specify the syntax of languages to
  58 be parsed.  See \cite{ahu} for definitions and information on context-free
  59 grammars and LR parsing.  We briefly review some terminology here.  A
  60 context-free grammar is defined by a set of terminals $T$, a set of
  61 nonterminals $NT$, a set of productions $P$, and a start
  62 nonterminal $S$.
  63 Terminals are interchangeably referred to as tokens.  The terminal
  64 and nonterminal sets are assumed to be disjoint.  The set of symbols is the
  65 union of the nonterminal and terminal sets.  We use lower case
  66 Greek letters to denote a string of symbols.  We use upper case
  67 Roman letters near the beginning of the alphabet to denote nonterminals.
  68 Each production gives a
  69 derivation of a string of symbols from a nonterminal, which we will
  70 write as $A \rightarrow \alpha$.  We define a relation between strings of
  71 symbols $\alpha$ and $\beta$, written $\alpha \vdash \beta$ and read
  72 as $\alpha$ derives $\beta$, if and only if $\alpha = \delta A \gamma$,
  73 $\beta = \delta \phi \gamma$ and
  74 there exists some production $A \rightarrow \phi$.  We write the
  75 transitive closure of this relation as
  76 $\vdash_*$. We say that a string of terminals $\alpha$ is a valid sentence
  77 of the language, {\em i.e.} it is derivable, if the start symbol
  78 $S \vdash_* \alpha$.   The sequence of derivations is often
  79 visualized as a parse tree.
  80
  81 ML-Yacc uses an attribute grammar scheme with synthesized attributes.
  82 Each symbol in the grammar may have a value (i.e. attribute) associated
  83 with it.  Each production has a semantic action associated with it.
  84 A production with a semantic action is called a rule.
  85 Parsers perform bottom-up, left-to-right evaluations of parse trees using semantic
  86 actions to compute values as they do so.  Given a production
  87 $P = A \rightarrow \alpha$, the corresponding semantic action is
  88 used to compute a value for $A$ from the values of the symbols in $\alpha$.
  89 If $A$ has no value, the semantic action is still evaluated but the value is ignored.
  90 Each parse returns the value associated with the start symbol $S$ of the
  91 grammar.  A parse returns a nullary value if the start symbol does not carry a value.
  92
  93 The synthesized attribute scheme can be adapted easily to inherited
  94 attributes. An inherited attribute is a value which propagates from
  95 a nonterminal to the symbols produced by the nonterminal according to
  96 some rule.  Since functions are values in ML,
  97 the semantic actions for the derived symbols
  98 can return functions which takes the
  99 inherited value as an argument.
 100
 101 \subsection{Modules}
 102 ML-Yacc uses the ML modules facility to specify the interface between
 103 a parser that it generates and a lexical analyzer that must be supplied
 104 by you.  It also uses the ML modules facility to factor out
 105 a set of modules that are common to every generated parser.
 106 These common modules include a parsing structure, which contains
 107 an error-correcting LR parser\footnote{A plain LR parser is also
 108 available.}, an LR table structure, and a structure
 109 which defines the representation of terminals.  ML-Yacc produces
 110 a functor for a particular parser parameterized by the LR table
 111 structure and the representation of terminals.  This functor
 112 contains values specific to the parser, such as the
 113 LR table for the parser\footnote{The LR table is a value.  The
 114 LR table structure defines an abstract LR table type.}, the
 115 semantic actions for the parser, and a structure containing
 116 the terminals for the parser.   ML-Yacc produces a signature
 117 for the structure produced by applying this functor
 118 and another signature for the structure containing the terminals for
 119 the parser.  You must
 120 supply a functor for the lexing module parameterized this
 121 structure.
 122
 123 Figure 1 is a dependency diagram of the modules that summarizes this
 124 information.  A module at the head of an arrow is dependent
 125 on the module at the tail.
 126
 127 \begin{figure}
 128 \begin{tabular}{|rcl|}
 129 \hline
 130 parsing structure & $\longrightarrow$ & values for a particular parser\\
 131 values for a particular parser & $\longrightarrow$ & lexical analyzer\\
 132 parsing structure, & $\longrightarrow$ & particular parser\\
 133 values for a particular parser, & & \\
 134 lexical analyzer & & \\
 135 \hline
 136 \end{tabular}
 137 \caption{Module Dependencies}
 138 \end{figure}
 139
 140 \subsection{Error Recovery}
 141
 142 The error recovery algorithm is able to accurately recover from many
 143 single token syntax errors.  It tries to make a single token
 144 correction at the token in the input stream at which the syntax error
 145 was detected and any of the 15 tokens\footnote{An arbitrary number
 146 chosen because numbers above this do not seem to improve error
 147 correction much.} before that token.  The algorithm checks corrections
 148 before the point of error detection because a syntax error is often
 149 not detected until several tokens beyond the token which caused the
 150 error.\footnote{An LR parser detects a syntax error as soon as
 151 possible, but this does not necessarily mean that the token at which
 152 the error was detected caused the error.}
 153
 154 The algorithm works by trying corrections at each
 155 of the 16 tokens up to and including the token at which the
 156 error was detected.  At each token in the input stream, it
 157 will try deleting the token, substituting other tokens for the
 158 token, or inserting some other token before the token.
 159
 160 The algorithm uses a parse check to evaluate corrections.  A parse
 161 check is a check of how far a correction allows a parser to
 162 parse without encountering a syntax error.
 163 You pass an upper bound on how many tokens beyond the error
 164 point a parser may read while doing a parse check as an argument to the
 165 parser.  This allows
 166 you to control the amount of lookahead that a parser reads
 167 for different kinds of systems.  For an interactive system, you
 168 should set the lookahead to zero.  Otherwise, a parser may hang
 169 waiting for input in the case of a syntax error.  If the lookahead
 170 is zero, no syntax errors will be corrected.  For a batch system,
 171 you should set the lookahead to 15.
 172
 173 The algorithm selects the set of corrections which allows the parse
 174 to proceed the farthest
 175 and parse through at least the error token.  It then removes those
 176 corrections involving keywords which do not meet a longer minimum
 177 parse check.  If there is more than one correction possible after this,
 178 it uses a simple heuristic priority scheme to order the corrections,
 179 and then arbitrarily chooses one of the corrections with the highest priority.
 180 You have some control over the priority scheme by being able to
 181 name a set of preferred insertions and a set of preferred substitutions.
 182 The priorities for corrections, ordered from highest to lowest
 183 priority, are
 184 preferred insertions, preferred substitutions, insertions, deletions,
 185 and substitutions.
 186
 187 The error recovery algorithm is guaranteed to terminate since it always
 188 selects fixes which parse through the
 189 error token.
 190
 191 The error-correcting LR parser implements the algorithm by keeping
 192 a queue of its state stacks before shifting tokens and using
 193 a lazy stream for the lexer.
 194 This makes it possible to restart the
 195 parse from before an error point and try various corrections.  The
 196 error-correcting LR parser does not defer semantic actions.  Instead,
 197 ML-Yacc creates semantic actions which are free of side-effects
 198 and always terminate.
 199 ML-Yacc uses higher-order functions to defer the
 200 evaluation of all user semantic actions until the parse is successfully
 201 completed without constructing an explicit parse tree.
 202 You may declare whether your semantic actions are free of side-effects
 203 and always terminate, in which case ML-Yacc does not need to defer
 204 the evaluation of your semantic actions.
 205
 206 \subsection{Precedence}
 207 ML-Yacc uses the same precedence scheme as Yacc for resolving
 208 shift/reduce conflicts.  Each terminal may be assigned a precedence and
 209 associativity.  Each rule is then assigned the precedence of its rightmost
 210 terminal.  If a shift/reduce conflict occurs, the conflict is resolved
 211 silently if the terminal and the rule in the conflict have
 212 precedences.
 213 If the terminal has the higher precedence, the shift is chosen.  If
 214 the rule has the higher precedence, the reduction is chosen.  If both
 215 the terminal and the rule have the same precedence, then the associativity
 216 of the terminal is used to resolve the conflict.  If the terminal is
 217 left associative, the reduction is chosen.  If the terminal is
 218 right associative, the shift is chosen.   Terminals may be declared to
 219 be nonassociative, also, in which case an error message is produced
 220 if the associativity is need to resolve the parsing conflict.
 221
 222 If a terminal or a rule in a shift/reduce conflict does not have
 223 a precedence, then an error message is produced and the shift
 224 is chosen.
 225
 226 In reduce/reduce conflicts, an error message is always produced and
 227 the first rule listed in the specification is chosen for reduction.
 228 \subsection{Notation}
 229
 230 Text surrounded by brackets denotes meta-notation.  If you see
 231 something like \{parser name\}, you should substitute the actual
 232 name of your parser for the meta-notation.  Text in a bold-face
 233 typewriter font ({\tt like this}) denotes text in a specification
 234 or ML code.
 235
 236 \section{ML-Yacc specifications}
 237
 238 An ML-Yacc specification consists of three parts, each of which is
 239 separated from the others by a {\tt \%\%} delimiter.  The general format is:
 240 \begin{quote}
 241 \tt
 242         \{user declarations\} \\
 243         \%\% \\
 244         \{ML-Yacc declarations\} \\
 245         \%\% \\
 246         \{rules\}
 247 \end{quote}
 248
 249 You can define values available in the semantic
 250 actions of the rules in the user declarations section.
 251 It is recommended that you keep the size of this
 252 section as small as possible and place large
 253 blocks of code in other modules.
 254
 255 The ML-Yacc declarations section is used to make a set
 256 of required declarations and a set of optional declarations.
 257 You must declare the nonterminals and terminals and the
 258 types of the values associated with them there.  You must
 259 also name the parser and declare the type of position values.
 260 You should specify the set of terminals which can follow
 261 the start symbol and the set of non-shiftable terminals.
 262 You may optionally declare precedences for terminals,
 263 make declarations that will
 264 improve error-recovery, and suppress the generation of
 265 default reductions in the parser.  You may
 266 declare whether the parser generator should create
 267 a verbose description of the parser in a ``.desc'' file.  This is useful
 268 for finding the causes of shift/reduce errors and other parsing conflicts.
 269
 270 You may also declare whether the semantic actions are
 271 free of significant side-effects and always terminate.  Normally, ML-Yacc
 272 delays the evaluation of semantic actions until the completion of a
 273 successful parse.  This ensures that there will be no semantic actions
 274 to ``undo'' if a syntactic error-correction invalidates some semantic
 275 actions.  If, however, the semantic actions are free of significant
 276 side-effects and always terminate, the results of semantic actions that
 277 are invalidated by a syntactic error-correction can always be safely
 278 ignored.
 279
 280 Parsers run faster and need less memory when it is not
 281 necessary to delay the evaluation of semantic actions.  You are
 282 encouraged to write semantic actions that are free of side-effects and
 283 always terminate and to declare this information to ML-Yacc.
 284
 285 A semantic action is free of significant side-effects if it can be reexecuted
 286 a reasonably small number of times without affecting the result of a
 287 parse.  (The reexecution occurs when the error-correcting parser is testing
 288 possible corrections to fix a syntax error, and the number of times
 289 reexecution occurs is roughly bounded, for each syntax error, by the number of
 290 terminals times the amount of lookahead permitted for the error-correcting
 291 parser).
 292
 293 The rules section contains the context-free grammar productions and their
 294 associated semantic actions.
 295
 296 \subsection{Lexical Definitions}
 297
 298 Comments have the same lexical definition as they do in Standard
 299 ML and can be placed anywhere in a specification.
 300
 301 All characters up to the first occurrence of a delimiting
 302 {\tt  \%\%} outside of
 303 a comment are placed in the user declarations section.  After that, the
 304 following words and symbols are reserved:
 305 \begin{quote}
 306
 307 \verb'of for = { } , * -> : | ( )'
 308
 309 \end{quote}
 310
 311 The following classes of ML symbols are used:
 312 \begin{quote}
 313 \begin{description}
 314 \item[identifiers:]
 315                 nonsymbolic ML identifiers, which consist
 316                 of an alphabetic character followed by one or
 317                 more alphabetic characters, numeric characters,
 318                 primes ``{\tt '}'', or underscores ``{\tt \_}''.
 319 \item[type variables:]
 320                 nonsymbolic ML identifier starting with a prime ``{\tt '}''
 321 \item[integers:] one or more decimal digits.
 322 \item[qualified identifiers:] an identifer followed by a period.
 323
 324 \end{description}
 325 \end{quote}
 326 The following classes of non-ML symbols are used:
 327 \begin{quote}
 328 \begin{description}
 329 \item[\% identifiers:]
 330                 a percent sign followed by one or more lowercase
 331                 alphabet letters.  The valid \% identifiers
 332                 are:
 333 \begin{quote}
 334 \raggedright
 335 \tt
 336                 \%arg \%eop \%header \%keyword \%left \%name \%nodefault
 337                 \%nonassoc \%nonterm \%noshift \%pos \%prec \%prefer
 338                 \%pure \%right \%start \%subst \%term \%value \%verbose
 339 \end{quote}
 340 \item[code:]
 341                 This class is meant to hold ML code.  The ML code is not
 342                 parsed for syntax errors.  It consists of a left parenthesis
 343                 followed by all characters up to a balancing right
 344                 parenthesis.  Parentheses in ML comments and ML strings
 345                 are excluded from the count of balancing parentheses.
 346
 347 \end{description}
 348 \end{quote}
 349
 350 \subsection{Grammar}
 351
 352 This is the grammar for specifications:
 353 \begin{eqnarray*}
 354 \mbox{spec} & ::= & \mbox{user-declarations {\tt \%\%} cmd-list {\tt \%\%} rule-list} \\
 355 \mbox{ML-type} & ::= & \mbox{nonpolymorphic ML types (see the Standard ML manual)} \\
 356 \mbox{symbol} & ::= & \mbox{identifier} \\
 357 \mbox{symbol-list} & ::= & \mbox{symbol-list symbol} \\
 358               &  | & \epsilon \\
 359 \mbox{symbol-type-list} & ::= & \mbox{symbol-type-list {\tt |} symbol {\tt of} ML-type} \\
 360                    & | & \mbox{symbol-type list {\tt |} symbol} \\
 361                    & | & \mbox{symbol {\tt of} ML-type} \\
 362                    & | & \mbox{symbol} \\
 363 \mbox{subst-list} & ::= & \mbox{subst-list {\tt |} symbol {\tt for} symbol} \\
 364              &  |  & \epsilon \\
 365 \mbox{cmd} & ::= & \mbox{{\tt \%arg} (Any-ML-pattern) {\tt :} ML-type} \\
 366  & | & \mbox{{\tt \%eop} symbol-list} \\
 367  & | & \mbox{{\tt \%header} code} \\
 368  & | & \mbox{{\tt \%keyword} symbol-list} \\
 369  & | & \mbox{{\tt \%left} symbol-list} \\
 370  & | & \mbox{{\tt \%name} identifier} \\
 371  & | & \mbox{{\tt \%nodefault}} \\
 372  & | & \mbox{{\tt \%nonassoc} symbol-list} \\
 373  & | & \mbox{{\tt \%nonterm} symbol-type list} \\
 374  & | & \mbox{{\tt \%noshift} symbol-list } \\
 375  & | & \mbox{{\tt \%pos} ML-type} \\
 376  & | & \mbox{{\tt \%prefer} symbol-list} \\
 377  & | & \mbox{\tt \%pure} \\
 378  & | & \mbox{{\tt \%right} symbol-list} \\
 379  & | & \mbox{{\tt \%start} symbol} \\
 380  & | & \mbox{{\tt \%subst} subst-list} \\
 381  & | & \mbox{{\tt \%term} symbol-type-list} \\
 382  & | & \mbox{{\tt \%value} symbol code} \\
 383  & | & \mbox{{\tt \%verbose}} \\
 384 \mbox{cmd-list} & ::= &\mbox{ cmd-list cmd} \\
 385  & | & \mbox{cmd} \\
 386 \mbox{rule-prec} & ::= & \mbox{{\tt \%prec} symbol} \\
 387             &  | & \epsilon \\
 388 \mbox{clause-list} & ::= & \mbox{symbol-list rule-prec code} \\
 389               &  | &  \mbox{clause-list {\tt |} symbol-list rule-prec code} \\
 390 \mbox{rule} & ::= & \mbox{symbol {\tt :} clause-list} \\
 391 \mbox{rule-list} & ::= & \mbox{rule-list rule} \\
 392             &  |  & \mbox{rule}
 393 \end{eqnarray*}
 394 \subsection{Required ML-Yacc Declarations}
 395 \begin{description}
 396 \item[{\tt \%name}]
 397 You must specify the name of the parser with {\tt \%name} \{name\}.
 398 \item[{\tt \%nonterm} and {\tt \%term}]
 399 You must define the terminal and nonterminal sets using the
 400 {\tt \%term} and {\tt \%nonterm}
 401 declarations, respectively.  These declarations are like an ML datatype
 402 definition.
 403 The type of the value that a symbol may carry is defined at the same time
 404 that the symbol is defined.  Each declarations consists of the keyword
 405 ({\tt \%term} or {\tt \%nonterm})
 406 followed by a list of symbol entries separated by a bar (``{\tt |}'').
 407 Each symbol entry is a symbol name followed by an optional
 408 ``of \/ $<$ML-type$>$''. The types cannot be polymorphic.
 409 Those symbol entries without a type carry no value.
 410 Nonterminal and terminal names must be disjoint and no name may be declared
 411 more than once in either declaration.
 412
 413 The symbol names and types are used to construct a datatype union for the
 414 values on the semantic stack in the LR parser and to name the values
 415 associated with subcomponents of a rule.  The names and types of
 416 terminals are also used to construct a signature for a structure that
 417 may be passed to the lexer functor.
 418
 419 Because the types and names are used in these manners, do
 420 not use ML keywords as symbol names.   The programs produced by ML-Yacc
 421 will not compile if ML keywords are used as symbol names.
 422 Make sure that the types specified in the {\tt \%term} declaration are
 423 fully qualified types or are available in the background
 424 environment when the signatures produced by ML-Yacc are loaded.  Do
 425 not use any locally defined types from the user declarations section of
 426 the specification.
 427
 428 These requirements on the types in the {\tt \%term} declaration are not
 429 a burden.
 430 They force the types to be defined in another module,
 431 which is a good idea since these types will
 432 be used in the lexer module.
 433 \item[{\tt \%pos}]
 434 You must declare the type of position values using the {\tt \%pos} declaration.
 435 The syntax is {\tt \%pos} $<$ML-type$>$.
 436 This type MUST be the same type as that which is actually found in the lexer.
 437 It cannot be polymorphic.
 438
 439 \end{description}
 440 \subsection{Optional ML-Yacc Declarations}
 441 \label{optional-def}
 442 \begin{description}
 443 \item[{\tt \%arg}]
 444 You may want each invocation of the entire parser to be parameterized
 445 by a particular argument, such as the file-name of the input
 446 being parsed in an invocation of the parser.  The {\tt \%arg} declaration
 447 allows you to specify such an argument.
 448 (This is often cleaner than using ``global'' reference variables.)
 449 The declaration
 450 \begin{quote}
 451
 452         {\tt \%arg} (Any-ML-pattern) {\tt :} $<$ML-type$>$
 453
 454 \end{quote}
 455 specifies the argument to the parser, as well as its type.  For example:
 456 \begin{quote}
 457
 458         {\tt \%arg (filename) : string}
 459
 460 \end{quote}
 461
 462 If {\tt \%arg} is not specified, it defaults to {\tt () : unit}.
 463 \item[{\tt \%eop} and {\tt \%noshift}]
 464 You should specify the set of
 465 terminals that may follow the start
 466 symbol, also called end-of-parse symbols, using the {\tt \%eop}
 467 declaration.  The {\tt \%eop} keyword should be followed by the list of
 468 terminals.  This is useful, for example, in an interactive system
 469 where you want to force the evaluation of a statement before an
 470 end-of-file (remember, a parser delays the execution of semantic
 471 actions until a parse is successful).
 472
 473 ML-Yacc has no concept of an end-of-file.  You must
 474 define an end-of-file terminal (EOF, perhaps) in the
 475 {\tt \%term} declaration.
 476 You must declare terminals which cannot be shifted, such as
 477 end-of-file, in the {\tt \%noshift} declaration.  The
 478 {\tt \%noshift} keyword should be followed by the list of non-shiftable
 479 terminals. An error message will be printed if a non-shiftable terminal
 480 is found on the right hand side of any rule, but ML-Yacc will not prevent
 481 you from using such grammars.
 482
 483 It is important to emphasize that
 484 \begin{em}
 485 non-shiftable terminals must be declared.
 486 \end{em}
 487 The error-correcting parser may attempt to read past such terminals
 488 while evaluating a correction to a syntax error otherwise.  This may
 489 confuse the lexer.
 490 \item[{\tt \%header}]
 491 You may define code to head the functor \{parser name\}LrValsFun here.  This
 492 may be useful for adding additonal parameter structures to the functor.
 493 The functor must be parameterized by the Token structure, so
 494 the declaration should always have the form:
 495 \begin{quote}
 496 \begin{verbatim}
 497 %header (functor {parser name}LrValsFun(
 498                                 structure Token : TOKEN
 499                                        ...)
 500         )
 501 \end{verbatim}
 502 \end{quote}
 503
 504 \item[{\tt \%left},{\tt \%right},{\tt \%nonassoc}]
 505 You should list the precedence declarations in order of increasing (tighter-binding)
 506 precedence.  Each precedence declaration consists
 507 of \% keyword specifying associativity followed by a list of terminals.
 508 The keywords are {\tt \%left}, {\tt \%right}, and {\tt \%nonassoc},
 509 standing for their respective associativities.
 510 \item[{\tt \%nodefault}]
 511 The {\tt \%nodefault} declaration suppresses the generation of default
 512 reductions.  If only one production can be reduced in a given state in
 513 an LR table, it may be made the default action for the state.  An incorrect
 514 reduction will be caught later when the parser attempts to shift the lookahead
 515 terminal which caused the reduction. ML-Yacc usually produces programs and
 516 verbose files with default reductions.  This saves a great deal of
 517 space in representing the LR tables,but
 518 sometimes it is useful for debugging and advanced
 519 uses of the parser to suppress the generation of default reductions.
 520 \item[{\tt \%pure}]
 521 Include the {\tt \%pure} declaration if the semantic actions
 522 are free of significant side-effects and always terminate.
 523 \item[{\tt \%start}]
 524 You may define the start symbol using
 525 the {\tt \%start} declaration.  Otherwise the nonterminal for the
 526 first rule will be used as the start nonterminal.
 527 The keyword {\tt \%start} should be followed by the name of the starting
 528 nonterminal.  This nonterminal should not be used on the right hand
 529 side of any rules, to avoid conflicts between reducing to the start
 530 symbol and shifting a terminal.  ML-Yacc will not prevent you
 531 from using such grammars, but it will print a warning message.
 532 \item[{\tt \%verbose}]
 533
 534 Include the {\tt \%verbose} declaration to produce a verbose
 535 description of the LALR parser.   The name of this file is
 536 the name of the specification file with a ``.desc'' appended to it.
 537
 538      This file has the following format:
 539 \begin{enumerate}
 540
 541 \item A summary of errors found while generating the LALR tables.
 542 \item A detailed description of all errors.
 543 \item A description of the states of the parser.  Each state
 544         is preceded by a list of conflicts in the state.
 545
 546 \end{enumerate}
 547 \end{description}
 548
 549 \subsection{Declarations for improving error-recovery}
 550
 551 These optional declarations improve error-recovery:
 552
 553 \begin{description}
 554 \item[{\tt \%keyword}]
 555     Specify all keywords in a grammar here.  The {\tt \%keyword}
 556     should be followed by a list
 557     of terminal names.   Fixes involving keywords are generally dangerous;
 558     they are prone to substantially altering the syntactic meaning
 559     of the program.  They are subject to a more rigorous parse check than
 560     other fixes.
 561
 562 \item[{\tt \%prefer}]
 563      List terminals to prefer for insertion after the {\tt \%prefer}.
 564 Corrections which insert a terminal on this list will be chosen over
 565 other corrections, all other things being equal.
 566 \item[{\tt \%subst}]
 567         This declaration should be followed by a list of clauses of the
 568      form \{terminal\} {\tt for} \{terminal\}, where items on the list are
 569      separated using a {\tt |}.  Substitution corrections on this list
 570 will be chosen over all other corrections except preferred insertion
 571 corrections (listed above), all other things being equal.
 572 \item[{\tt \%change}]
 573     This is a generalization of {\tt \%prefer}  and {\tt \%subst}.
 574 It takes a the following syntax:
 575 \begin{quote}
 576 ${\it tokens}_{1a}$ \verb|->| ${\it tokens}_{1b}$ \verb+|+ ${\it tokens}_{2a}$ \verb|->| ${\it tokens}_{2b}$ {\it etc.}
 577 \end{quote}
 578 where each {\it tokens} is a (possibly empty) seqence of tokens.  The
 579 idea is that any instance of ${\it tokens}_{1a}$ can be ``corrected'' to
 580 ${\it tokens}_{1b}$, and so on.  For example, to suggest that a good
 581 error correction to try is \verb|IN ID END| (which is useful for the
 582 ML parser), write,
 583 \begin{verbatim}
 584        %change   ->  IN ID END
 585 \end{verbatim}
 586 \item[{\tt \%value}]
 587         The error-correction algorithm may also insert terminals with values.
 588      You must supply a value for such a terminal. The keyword
 589      should be followed by a terminal and a piece of
 590      code (enclosed in parentheses) that when evaluated supplies the value.
 591      There must be a separate {\tt \%value} declaration for each terminal with
 592      a value that you wish may be inserted or substituted in an error correction.
 593      The code for the value is not evaluated until the parse is
 594      successful.
 595
 596          Do not specify a {\tt \%value} for terminals without
 597      values. This will result in a type error in the program produced by
 598      ML-Yacc.
 599 \end{description}
 600
 601 \subsection{Rules}
 602
 603 All rules are declared in the final section, after the last {\tt \%\%}
 604 delimiter.  A rule consists of a left hand side nonterminal, followed by
 605 a colon, followed by a list of right hand side clauses.
 606
 607 The right hand side clauses should be separated by bars (``{\tt |}'').  Each
 608 clause consists of a list of nonterminal and terminal symbols, followed
 609 by an optional {\tt \%prec} declaration, and then followed by the code to be
 610 evaluated when the rule is reduced.
 611
 612 The optional {\tt \%prec} consists of the keyword {\tt \%prec} followed by a
 613 terminal whose precedence should be used as the precedence of the
 614 rule.
 615
 616 The values of those symbols on the right hand side which have values are
 617 available inside the code.  Positions for all the symbols are also
 618 available.
 619 Each value has the general form \{symbol name\}\{n+1\}, where \{n\} is the
 620 number of occurrences of the symbol to the left of the symbol.  If
 621 the symbol occurs only once in the rule, \{symbol name\} may also
 622 be used.
 623 The positions are given by \{symbol~name\}\{n+1\}left and
 624 \{symbol~name\}\{n+1\}right.  where \{n\} is defined as before.
 625 The position for a null rhs of
 626 a production is assumed to be the leftmost position of the lookahead
 627 terminal which is causing the reduction. This position value is
 628 available in {\tt defaultPos}.
 629
 630 The value to which the code evaluates is used as the value of the
 631 nonterminal.  The type of the value and the nonterminal must match.
 632 The value is ignored if the nonterminal has no value, but is still
 633 evaluated for side-effects.
 634
 635 \section{Producing files with ML-Yacc}
 636
 637 ML-Yacc may be used from the interactive system or built as a
 638 stand-alone program which may be run from the Unix command line.
 639 See the file {\bf README} in the mlyacc directory for directions
 640 on installing ML-Yacc.  We recommend thaat ML-Yacc be installed as
 641 a stand-alone program.
 642
 643 If you are using the stand-alone version of ML-Yacc, invoke the
 644 program ``sml-yacc'' with the name of the specifcation file.
 645 If you are using  ML-Yacc in the interactive system, load the file
 646 ``smlyacc.sml''.  The end result is a structure ParseGen, with one
 647 value parseGen in it.  Apply parseGen to a string containing the
 648 name of the specification file.
 649
 650 Two files will be created, one named by
 651 attaching ``.sig'' to the name of the specification, the other named by
 652 attaching ``.sml'' to the name of the specification.
 653
 654 \section{The lexical analyzer}
 655
 656 Let the name for
 657 the parser given in the {\tt \%name} declaration be denoted by \{n\} and
 658 the specification file name be denoted by \{spec name\}
 659 The parser generator creates a functor named \{n\}LrValsFun for
 660 the values needed for a particular parser.  This functor is placed
 661 in \{spec name\}.sml.  This
 662 functor contains a structure
 663 Tokens which allows you to construct terminals from the appropriate
 664 values.  The structure has a function for each terminal that takes a tuple
 665 consisting of  the value for the terminal (if there is any), a leftmost
 666 position for the terminal, and a rightmost position for the terminal and
 667 constructs the terminal from these values.
 668
 669 A signature for the structure Tokens is created and placed in the ``.sig''
 670 file created by ML-Yacc.  This signature is \{n\}\_TOKENS,
 671  where \{n\} is
 672 the name given in the parser specification.  A signature
 673 \{n\}\_LRVALS is created for the structure produced by
 674 applying \{n\}LrValsFun.
 675
 676 Use the signature \{n\}\_TOKENS to create a functor for the
 677 lexical analyzer which takes the structure Tokens as an argument.  The
 678 signature \{n\}\_TOKENS
 679 will not change unless the {\tt \%term} declaration in a
 680 specification is altered by adding terminals or
 681 changing the types of terminals.  You do not need to recompile
 682 the lexical analyzer functor each time the specification for
 683 the parser is changed if the
 684 signature \{n\}\_TOKENS does not change.
 685
 686 If you are using ML-Lex to create the lexical analyzer, you
 687 can turn the lexer structure into a functor using the
 688 {\tt \%header} declaration.
 689 {\tt \%header} allows the user to define the header for a structure body.
 690
 691 If the name of the parser in the specification were Calc, you
 692 would add this declaration to the specification for the lexical
 693 analyzer:
 694 \begin{quote}
 695 \tt
 696 \begin{verbatim}
 697 %header (functor CalcLexFun(structure Tokens : Calc_TOKENS))
 698 \end{verbatim}
 699 \end{quote}
 700
 701 You must define the following in the user definitions section:
 702 \begin{quote}
 703 \tt
 704 \begin{verbatim}
 705 type pos
 706 \end{verbatim}
 707 \end{quote}
 708 This is the type of position values for terminals.  This type
 709 must be the same as the one declared in the specification for
 710 the grammar.  Note, however, that this type is not available
 711 in the Tokens structure that parameterizes the lexer functor.
 712
 713 You must include the following code in the user definitions section of
 714 the ML-Lex specification:
 715 \begin{quote}
 716 \tt
 717 \begin{verbatim}
 718 type svalue = Tokens.svalue
 719 type ('a,'b) token = ('a,'b) Tokens.token
 720 type lexresult  = (svalue,pos) token
 721 \end{verbatim}
 722 \end{quote}
 723
 724 These types are used to give lexers signatures.
 725
 726 You may use a lexer constructed using ML-Lex with the {\tt \%arg}
 727 declaration, but you must follow special instructions for tying the parser
 728 and lexer together.
 729
 730 \section{Creating the parser}
 731 \label{create-parser}
 732 Let the name of the grammar specification file be denoted by
 733 \{grammar\} and the name of the lexer specification file be
 734 denoted by \{lexer\} (e.g. in our calculator example these would
 735 stand for calc.grm and calc.lex, respectively).
 736 Let the parser name in the specification be represented by \{n\}
 737 (e.g. Calc in our calculator example).
 738
 739 To construct a parser, do the following:
 740 \begin{enumerate}
 741 \item In the appropriate CM description file (e.g. for your main
 742 program or one of its subgroups or libraries), include the lines:
 743 \begin{quote}
 744 \begin{verbatim}
 745 ml-yacc-lib.cm
 746 {lexer}
 747 {grammar}
 748 \end{verbatim}
 749 \end{quote}
 750 This will cause ML-Yacc to be run on \{grammar\}, producing source files
 751 \{grammar\}.sig and \{grammar\}.sml, and ML-Lex to be run on
 752 \{lexer\}, producing a source file \{lexer\}.sml.  Then these files
 753 will be compiled after loading the necessary signatures and modules
 754 from the ML-Yacc library as specified by {\tt ml-yacc-lib.cm}.
 755 \item Apply functors to create the parser:
 756 \begin{quote}
 757 \begin{verbatim}
 758 structure {n}LrVals =
 759   {n}LrValsFun(structure Token = LrParser.Token)
 760 structure {n}Lex =
 761   {n}LexFun(structure Tokens = {n}LrVals.Tokens)
 762 structure {n}Parser=
 763   Join(structure ParserData = {n}LrVals.ParserData
 764        structure Lex={n}Lex
 765        structure LrParser=LrParser)
 766 \end{verbatim}
 767 \end{quote}
 768 If the lexer was created using the {\tt \%arg} declaration in ML-Lex,
 769 the definition of \{n\}Parser must be changed to use another functor
 770 called JoinWithArg:
 771 \begin{quote}
 772 \begin{verbatim}
 773 structure {n}Parser=
 774   JoinWithArg
 775     (structure ParserData={n}LrVals.ParserData
 776      structure Lex={n}Lex
 777      structure LrParser=LrParser)
 778 \end{verbatim}
 779 \end{quote}
 780 \end{enumerate}
 781
 782 The following outline summarizes this process:
 783 \begin{quote}
 784 \begin{verbatim}
 785 (* available at top level *)
 786
 787 TOKEN
 788 LR_TABLE
 789 STREAM
 790 LR_PARSER
 791 PARSER_DATA
 792 structure LrParser : LR_PARSER
 793
 794 (* printed out in .sig file created by parser generator: *)
 795
 796 signature {n}_TOKENS =
 797 sig
 798   structure Token : TOKEN
 799   type svalue
 800   val PLUS : 'pos * 'pos ->
 801              (svalue,'pos) Token.token
 802   val INTLIT : int * 'pos * 'pos ->
 803                (svalue,'pos) Token.token
 804   ...
 805 end
 806
 807 signature {n}_LRVALS =
 808 sig
 809   structure Tokens : {n}_TOKENS
 810   structure ParserData : PARSER_DATA
 811   sharing ParserData.Token = Tokens.Token
 812   sharing type ParserData.svalue = Tokens.svalue
 813 end
 814
 815 (* printed out by lexer generator: *)
 816
 817 functor {n}LexFun(structure Tokens : {n}_TOKENS)=
 818 struct
 819   ...
 820 end
 821
 822 (* printed out in .sml file created by parser generator: *)
 823
 824 functor {n}LrValsFun(structure Token : TOKENS) =
 825 struct
 826
 827   structure ParserData =
 828   struct
 829     structure Token = Token
 830
 831     (* code in header section of specification *)
 832
 833     structure Header = ...
 834     type svalue = ...
 835     type result = ...
 836     type pos = ...
 837     structure Actions = ...
 838     structure EC = ...
 839     val table = ...
 840   end
 841
 842   structure Tokens : {n}_TOKENS =
 843   struct
 844     structure Token = ParserData.Token
 845     type svalue = ...
 846     fun PLUS(p1,p2) = ...
 847     fun INTLIT(i,p1,p2) = ...
 848   end
 849
 850 end
 851
 852 (* to be done by the user: *)
 853
 854 structure {n}LrVals =
 855   {n}LrValsFun(structure Token = LrParser.Token)
 856
 857 structure {n}Lex =
 858   {n}LexFun(structure Tokens = {n}LrVals.Tokens)
 859
 860 structure {n}Parser =
 861   Join(structure Lex = {n}Lex
 862        structure ParserData = {n}ParserData
 863        structure LrParser = LrParser)
 864 \end{verbatim}
 865 \end{quote}
 866
 867 \section{Using the parser}
 868 \subsection{Parser Structure Signatures}
 869 The final structure created will have the signature PARSER:
 870 \begin{quote}
 871 \begin{verbatim}
 872 signature PARSER =
 873 sig
 874   structure Token : TOKEN
 875   structure Stream : STREAM
 876   exception ParseError
 877
 878   type pos    (* pos is the type of line numbers *)
 879   type result (* value returned by the parser *)
 880   type arg    (* type of the user-supplied argument  *)
 881   type svalue (* the types of semantic values *)
 882
 883   val makeLexer : (int -> string) ->
 884                     (svalue,pos) Token.token Stream.stream
 885   val parse :
 886       int * ((svalue,pos) Token.token Stream.stream) *
 887       (string * pos * pos -> unit) * arg ->
 888         result * (svalue,pos) Token.token Stream.stream
 889   val sameToken :
 890       (svalue,pos) Token.token * (svalue,pos) Token.token ->
 891         bool
 892 end
 893 \end{verbatim}
 894 \end{quote}
 895 or the signature ARG\_PARSER if you used {\tt \%arg} to create the lexer.
 896 This signature differs from ARG\_PARSER in that it
 897 which has an additional type {\tt lexarg} and a different type
 898 for {\tt makeLexer}:
 899 \begin{quote}
 900 \begin{verbatim}
 901 type lexarg
 902 val makeLexer : (int -> string)  -> lexarg ->
 903                   (svalue,pos) token stream
 904 \end{verbatim}
 905 \end{quote}
 906
 907 The signature STREAM (providing lazy streams) is:
 908 \begin{quote}
 909 \begin{verbatim}
 910 signature STREAM =
 911 sig
 912   type 'a stream
 913   val streamify : (unit -> 'a) -> 'a stream
 914   val cons : 'a * 'a stream -> 'a stream
 915   val get : 'a stream -> 'a * 'a stream
 916 end
 917 \end{verbatim}
 918 \end{quote}
 919
 920 \subsection{Using the parser structure}
 921
 922 The parser structure converts the lexing function produced by
 923 ML-Lex into a function which creates a lazy stream of tokens.  The
 924 function {\tt makeLexer} takes the same values as the corresponding
 925 {\tt makeLexer} created by ML-Lex, but returns a stream of tokens
 926 instead of a function which yields tokens.
 927
 928 The function parse takes the token stream and some other arguments that
 929 are described below and parses the token stream.  It returns a pair composed
 930 of the value associated with the start symbol and the rest of
 931 the token stream.  The rest of the token stream includes the
 932 end-of-parse symbol which caused the reduction of some rule
 933 to the start symbol.  The function parse raises the
 934 exception ParseError if a syntax error occurs which it cannot fix.
 935
 936 The lazy stream is implemented by the {\tt Stream} structure.
 937 The function {\tt streamify} converts a conventional implementation
 938 of a stream into a lazy stream.  In a conventional implementation
 939 of a stream, a stream consists of a position in a list of
 940 values.  Fetching a value from a stream returns the
 941 value associated with the position and updates the position to
 942 the next element in the list of values.  The fetch is a side-effecting
 943 operation.  In a lazy stream, a fetch returns a value and a new
 944 stream, without a side-effect which updates the position value.
 945 This means that a stream can be repeatedly re-evaluated without
 946 affecting the values that it returns.  If $f$ is the function
 947 that is passed to {\tt streamify}, $f$ is called only as many
 948 times as necessary to construct the portion of the list of values
 949 that is actually used.
 950
 951 Parse also takes an integer giving the maximum amount of lookahead permitted
 952 for the error-correcting parse, a function to print error messages,
 953 and a value of type arg.  The maximum amount of lookahead for interactive
 954 systems should be zero.  In this case, no attempt is made to correct any
 955 syntax errors.  For non-interactive systems, try 15.  The
 956 function to print error messages takes a tuple of values consisting
 957 of the left and right positions of the terminal which caused the error
 958 and an error message.   If the {\tt \%arg} declaration is not used, the
 959 value of type arg should be a value of type unit.
 960
 961 The function sameToken can be used to see if two tokens
 962 denote the same terminal, irregardless of any values that the
 963 tokens carry.  It is useful if you have multiple end-of-parse
 964 symbols and must check which end-of-parse symbol has been left on the
 965 front of the token stream.
 966
 967 The types have the following meanings.  The type {\tt arg} is the type
 968 of the additional argument to the parser, which is specified by the
 969 {\tt \%arg} declaration in the ML-Yacc specification.  The type
 970 {\tt lexarg} is the optional argument to lexers, and is specified by
 971 the {\tt \%arg} declaration in an ML-Lex specifcation.  The type {\tt pos}
 972 is the type of line numbers, and is specified by the {\tt \%pos} declaration
 973 in an ML-Yacc specification and defined in the user declarations
 974 section of the ML-Lex specification.  The type {\tt result} is
 975 the type associated with the start symbol in the ML-Yacc specification.
 976
 977 \section{Examples}
 978
 979 See the directory examples for examples of parsers constructed using
 980 ML-Yacc.  Here is a small sample parser and lexer for an interactive
 981 calculator, from the directory examples/calc, along with code for
 982 creating a parsing function.  The calculator reads one or more
 983 expressions from the standard input, evaluates the expressions, and
 984 prints their values.  Expressions should be separated by semicolons,
 985 and may also be ended by using an end-of-file.  This shows how to
 986 construct an interactive parser which reads a top-level declaration
 987 and processes the declaration before reading the next top-level
 988 declaration.
 989
 990 \subsection{Sample Grammar}
 991 \begin{tt}
 992 \begin{verbatim}
 993 (* Sample interactive calculator for ML-Yacc *)
 994
 995 fun lookup "bogus" = 10000
 996   | lookup s = 0
 997
 998 %%
 999
1000 %eop EOF SEMI
1001
1002 (* %pos declares the type of positions for terminals.
1003    Each symbol has an associated left and right position. *)
1004
1005 %pos int
1006
1007 %left SUB PLUS
1008 %left TIMES DIV
1009 %right CARAT
1010
1011 %term ID of string | NUM of int | PLUS | TIMES | PRINT |
1012       SEMI | EOF | CARAT | DIV | SUB
1013 %nonterm EXP of int | START of int option
1014
1015 %name Calc
1016
1017 %subst PRINT for ID
1018 %prefer PLUS TIMES DIV SUB
1019 %keyword PRINT SEMI
1020
1021 %noshift EOF
1022 %value ID ("bogus")
1023 %nodefault
1024 %verbose
1025 %%
1026
1027 (* the parser returns the value associated with the expression *)
1028
1029   START : PRINT EXP (print EXP;
1030                      print "\n";
1031                      flush_out std_out; SOME EXP)
1032         | EXP (SOME EXP)
1033         | (NONE)
1034   EXP : NUM             (NUM)
1035       | ID              (lookup ID)
1036       | EXP PLUS EXP    (EXP1+EXP2)
1037       | EXP TIMES EXP   (EXP1*EXP2)
1038       | EXP DIV EXP     (EXP1 div EXP2)
1039       | EXP SUB EXP     (EXP1-EXP2)
1040       | EXP CARAT EXP   (let fun e (m,0) = 1
1041                                 | e (m,l) = m*e(m,l-1)
1042                          in e (EXP1,EXP2)
1043                          end)
1044 \end{verbatim}
1045 \end{tt}
1046 \subsection{Sample Lexer}
1047 \begin{tt}
1048 \begin{verbatim}
1049 structure Tokens = Tokens
1050
1051 type pos = int
1052 type svalue = Tokens.svalue
1053 type ('a,'b) token = ('a,'b) Tokens.token
1054 type lexresult= (svalue,pos) token
1055
1056 val pos = ref 0
1057 val eof = fn () => Tokens.EOF(!pos,!pos)
1058 val error = fn (e,l : int,_) =>
1059               output(std_out,"line " ^ (makestring l) ^
1060                                ": " ^ e ^ "\n")
1061 %%
1062 %header (functor CalcLexFun(structure Tokens: Calc_TOKENS));
1063 alpha=[A-Za-z];
1064 digit=[0-9];
1065 ws = [\ \t];
1066 %%
1067 \n       => (pos := (!pos) + 1; lex());
1068 {ws}+    => (lex());
1069 {digit}+ => (Tokens.NUM
1070                 (revfold (fn (a,r) => ord(a)-ord("0")+10*r)
1071                          (explode yytext) 0,
1072                   !pos,!pos));
1073 "+"      => (Tokens.PLUS(!pos,!pos));
1074 "*"      => (Tokens.TIMES(!pos,!pos));
1075 ";"      => (Tokens.SEMI(!pos,!pos));
1076 {alpha}+ => (if yytext="print"
1077                  then Tokens.PRINT(!pos,!pos)
1078                  else Tokens.ID(yytext,!pos,!pos)
1079             );
1080 "-"      => (Tokens.SUB(!pos,!pos));
1081 "^"      => (Tokens.CARAT(!pos,!pos));
1082 "/"      => (Tokens.DIV(!pos,!pos));
1083 "."      => (error ("ignoring bad character "^yytext,!pos,!pos);
1084              lex());
1085 \end{verbatim}
1086 \end{tt}
1087 \subsection{Top-level code}
1088
1089 You must follow the instructions in Section~\ref{create-parser}
1090 to create the parser and lexer functors and load them.  After you have
1091 done this, you must then apply the functors to produce the {\tt CalcParser}
1092 structure.  The code for doing this is shown below.
1093 \begin{quote}
1094 \begin{verbatim}
1095 structure CalcLrVals =
1096   CalcLrValsFun(structure Token = LrParser.Token)
1097
1098 structure CalcLex =
1099   CalcLexFun(structure Tokens = CalcLrVals.Tokens);
1100
1101 structure CalcParser =
1102   Join(structure LrParser = LrParser
1103        structure ParserData = CalcLrVals.ParserData
1104        structure Lex = CalcLex)
1105 \end{verbatim}
1106 \end{quote}
1107
1108 Now we need a function which given a lexer invokes the parser.  The
1109 function {\tt invoke} does this.
1110
1111 \begin{quote}
1112 \begin{verbatim}
1113 fun invoke lexstream =
1114     let fun print_error (s,i:int,_) =
1115             TextIO.output(TextIO.stdOut,
1116                           "Error, line " ^ (Int.toString i) ^ ", " ^ s ^ "\n")
1117      in CalcParser.parse(0,lexstream,print_error,())
1118     end
1119 \end{verbatim}
1120 \end{quote}
1121
1122 Finally, we need a function which can read one or more expressions from
1123 the standard input.  The function {\tt parse}, shown below, does this.
1124 It runs the calculator on the standard input and terminates
1125 when an end-of-file is encountered.
1126
1127 \begin{quote}
1128 \begin{verbatim}
1129 fun parse () =
1130     let val lexer = CalcParser.makeLexer
1131                       (fn _ => TextIO.inputLine TextIO.stdIn)
1132         val dummyEOF = CalcLrVals.Tokens.EOF(0,0)
1133         val dummySEMI = CalcLrVals.Tokens.SEMI(0,0)
1134         fun loop lexer =
1135             let val (result,lexer) = invoke lexer
1136                 val (nextToken,lexer) = CalcParser.Stream.get lexer
1137              in case result
1138                   of SOME r =>
1139                       TextIO.output(TextIO.stdOut,
1140                              "result = " ^ (Int.toString r) ^ "\n")
1141                    | NONE => ();
1142                 if CalcParser.sameToken(nextToken,dummyEOF) then ()
1143                 else loop lexer
1144             end
1145      in loop lexer
1146     end
1147 \end{verbatim}
1148 \end{quote}
1149
1150 \section{Signatures}
1151
1152 This section contains signatures used by ML-Yacc for structures in
1153 the file base.sml, functors and structures that it generates, and for
1154 the signatures of lexer structures supplied by you.
1155
1156 \subsection{Parsing structure signatures}
1157
1158 \begin{quote}
1159 \begin{verbatim}
1160 (* STREAM: signature for a lazy stream.*)
1161
1162 signature STREAM =
1163 sig
1164   type 'a stream
1165   val streamify : (unit -> 'a) -> 'a stream
1166   val cons : 'a * 'a stream -> 'a stream
1167   val get : 'a stream -> 'a * 'a stream
1168 end
1169
1170 (* LR_TABLE: signature for an LR Table.*)
1171
1172 signature LR_TABLE =
1173 sig
1174   datatype ('a,'b) pairlist
1175     = EMPTY
1176     | PAIR of 'a * 'b * ('a,'b) pairlist
1177   datatype state = STATE of int
1178   datatype term = T of int
1179   datatype nonterm = NT of int
1180   datatype action = SHIFT of state
1181                   | REDUCE of int
1182                   | ACCEPT
1183                   | ERROR
1184   type table
1185
1186   val numStates : table -> int
1187   val numRules : table -> int
1188   val describeActions : table -> state ->
1189                           (term,action) pairlist * action
1190   val describeGoto : table -> state ->
1191                        (nonterm,state) pairlist
1192   val action : table -> state * term -> action
1193   val goto : table -> state * nonterm -> state
1194   val initialState : table -> state
1195   exception Goto of state * nonterm
1196
1197   val mkLrTable :
1198       {actions : ((term,action) pairlist * action) array,
1199        gotos : (nonterm,state) pairlist array,
1200        numStates : int, numRules : int,
1201        initialState : state} -> table
1202 end
1203
1204 (* TOKEN: signature for the internal structure of a token.*)
1205
1206 signature TOKEN =
1207 sig
1208   structure LrTable : LR_TABLE
1209   datatype ('a,'b) token = TOKEN of LrTable.term *
1210                                     ('a * 'b * 'b)
1211   val sameToken : ('a,'b) token * ('a,'b) token -> bool
1212 end
1213
1214 (* LR_PARSER: signature for a polymorphic LR parser *)
1215
1216 signature LR_PARSER =
1217 sig
1218   structure Stream: STREAM
1219   structure LrTable : LR_TABLE
1220   structure Token : TOKEN
1221
1222   sharing LrTable = Token.LrTable
1223
1224   exception ParseError
1225
1226   val parse:
1227        {table : LrTable.table,
1228         lexer : ('b,'c) Token.token Stream.stream,
1229         arg: 'arg,
1230         saction : int *
1231                  'c *
1232                  (LrTable.state * ('b * 'c * 'c)) list *
1233                  'arg ->
1234                   LrTable.nonterm *
1235                   ('b * 'c * 'c) *
1236                   ((LrTable.state *('b * 'c * 'c)) list),
1237         void : 'b,
1238         ec: {is_keyword : LrTable.term -> bool,
1239              noShift : LrTable.term -> bool,
1240              preferred_subst:LrTable.term -> LrTable.term list,
1241              preferred_insert : LrTable.term -> bool,
1242              errtermvalue : LrTable.term -> 'b,
1243              showTerminal : LrTable.term -> string,
1244              terms: LrTable.term list,
1245              error : string * 'c * 'c -> unit
1246             },
1247         lookahead : int (* max amount of lookahead used in
1248                          * error correction *)
1249        } -> 'b * (('b,'c) Token.token Stream.stream)
1250 end
1251 \end{verbatim}
1252 \end{quote}
1253
1254 \subsection{Lexers}
1255
1256 Lexers for use with ML-Yacc's output must match one of these signatures.
1257
1258 \begin{quote}
1259 \begin{verbatim}
1260 signature LEXER =
1261 sig
1262   structure UserDeclarations :
1263     sig
1264       type ('a,'b) token
1265       type pos
1266       type svalue
1267     end
1268   val makeLexer : (int -> string) -> unit ->
1269        (UserDeclarations.svalue, UserDeclarations.pos)
1270        UserDeclarations.token
1271 end
1272
1273 (* ARG_LEXER: the %arg option of ML-Lex allows users to
1274    produce lexers which also take an argument before
1275    yielding a function from unit to a token.
1276 *)
1277
1278 signature ARG_LEXER =
1279 sig
1280   structure UserDeclarations :
1281     sig
1282       type ('a,'b) token
1283       type pos
1284       type svalue
1285       type arg
1286     end
1287   val makeLexer :
1288       (int -> string) ->
1289       UserDeclarations.arg ->
1290       unit ->
1291        (UserDeclarations.svalue, UserDeclarations.pos)
1292        UserDeclarations.token
1293 end
1294 \end{verbatim}
1295 \end{quote}
1296
1297 \subsection{Signatures for the functor produced by ML-Yacc}
1298
1299 The following signature is used in signatures generated by
1300 ML-Yacc:
1301 \begin{quote}
1302 \begin{verbatim}
1303 (* PARSER_DATA: the signature of ParserData structures in
1304    {n}LrValsFun functor produced by ML-Yacc. All such
1305    structures match this signature. *)
1306
1307 signature PARSER_DATA =
1308 sig
1309   type pos       (* the type of line numbers *)
1310   type svalue    (* the type of semantic values *)
1311   type arg       (* the type of the user-supplied *)
1312                  (* argument to the parser *)
1313   type result
1314
1315   structure LrTable : LR_TABLE
1316   structure Token : TOKEN
1317   sharing Token.LrTable = LrTable
1318
1319   structure Actions :
1320     sig
1321       val actions : int * pos *
1322        (LrTable.state * (svalue * pos * pos)) list * arg ->
1323                LrTable.nonterm * (svalue * pos * pos) *
1324              ((LrTable.state *(svalue * pos * pos)) list)
1325       val void : svalue
1326       val extract : svalue -> result
1327     end
1328
1329   (* structure EC contains information used to improve
1330      error recovery in an error-correcting parser *)
1331
1332   structure EC :
1333     sig
1334       val is_keyword : LrTable.term -> bool
1335       val noShift : LrTable.term -> bool
1336       val preferred_subst: LrTable.term -> LrTable.term list
1337       val preferred_insert : LrTable.term -> bool
1338       val errtermvalue : LrTable.term -> svalue
1339       val showTerminal : LrTable.term -> string
1340       val terms: LrTable.term list
1341     end
1342
1343   (* table is the LR table for the parser *)
1344
1345   val table : LrTable.table
1346 end
1347 \end{verbatim}
1348 \end{quote}
1349
1350 ML-Yacc generates these two signatures:
1351 \begin{quote}
1352 \begin{verbatim}
1353 (* printed out in .sig file created by parser generator: *)
1354
1355 signature {n}_TOKENS =
1356 sig
1357   type ('a,'b) token
1358   type svalue
1359   ...
1360 end
1361
1362 signature {n}_LRVALS =
1363 sig
1364   structure Tokens : {n}_TOKENS
1365   structure ParserData : PARSER_DATA
1366   sharing type ParserData.Token.token = Tokens.token
1367   sharing type ParserData.svalue = Tokens.svalue
1368 end
1369 \end{verbatim}
1370 \end{quote}
1371 \subsection{User parser signatures}
1372
1373 Parsers created by applying the Join functor will match this signature:
1374 \begin{quote}
1375 \begin{verbatim}
1376 signature PARSER =
1377 sig
1378   structure Token : TOKEN
1379   structure Stream : STREAM
1380   exception ParseError
1381
1382   type pos    (* pos is the type of line numbers *)
1383   type result (* value returned by the parser *)
1384   type arg    (* type of the user-supplied argument  *)
1385   type svalue (* the types of semantic values *)
1386
1387   val makeLexer : (int -> string) ->
1388                    (svalue,pos) Token.token Stream.stream
1389
1390   val parse :
1391       int * ((svalue,pos) Token.token Stream.stream) *
1392       (string * pos * pos -> unit) * arg ->
1393           result * (svalue,pos) Token.token Stream.stream
1394   val sameToken :
1395     (svalue,pos) Token.token * (svalue,pos) Token.token ->
1396         bool
1397 end
1398 \end{verbatim}
1399 \end{quote}
1400 Parsers created by applying the JoinWithArg functor will match this
1401 signature:
1402 \begin{quote}
1403 \begin{verbatim}
1404 signature ARG_PARSER =
1405 sig
1406   structure Token : TOKEN
1407   structure Stream : STREAM
1408   exception ParseError
1409
1410   type arg
1411   type lexarg
1412   type pos
1413   type result
1414   type svalue
1415
1416   val makeLexer : (int -> string) -> lexarg ->
1417                      (svalue,pos) Token.token Stream.stream
1418   val parse : int *
1419               ((svalue,pos) Token.token Stream.stream) *
1420               (string * pos * pos -> unit) *
1421               arg ->
1422                result * (svalue,pos) Token.token Stream.stream
1423   val sameToken :
1424       (svalue,pos) Token.token * (svalue,pos) Token.token ->
1425            bool
1426 end
1427 \end{verbatim}
1428 \end{quote}
1429
1430 \section{Sharing constraints}
1431
1432 Let the name of the parser be denoted by \{n\}.  If
1433 you have not created a lexer which takes an argument, and
1434 you have followed the directions given earlier for creating the parser, you
1435 will have the following structures with the following signatures:
1436 \begin{quote}
1437 \begin{verbatim}
1438 (* always present *)
1439
1440 signature TOKEN
1441 signature LR_TABLE
1442 signature STREAM
1443 signature LR_PARSER
1444 signature PARSER_DATA
1445 structure LrParser : LR_PARSER
1446
1447 (* signatures generated by ML-Yacc *)
1448
1449 signature {n}_TOKENS
1450 signature {n}_LRVALS
1451
1452 (* structures created by you *)
1453
1454 structure {n}LrVals : {n}_LRVALS
1455 structure Lex : LEXER
1456 structure {n}Parser : PARSER
1457 \end{verbatim}
1458 \end{quote}
1459
1460 The following sharing constraints will exist:
1461 \begin{quote}
1462 \begin{verbatim}
1463 sharing {n}Parser.Token = LrParser.Token =
1464           {n}LrVals.ParserData.Token
1465 sharing {n}Parser.Stream = LrParser.Stream
1466
1467 sharing type {n}Parser.arg = {n}LrVals.ParserData.arg
1468 sharing type {n}Parser.result = {n}LrVals.ParserData.result
1469 sharing type {n}Parser.pos = {n}LrVals.ParserData.pos =
1470                 Lex.UserDeclarations.pos
1471 sharing type {n}Parser.svalue = {n}LrVals.ParserData.svalue =
1472         {n}LrVals.Tokens.svalue = Lex.UserDeclarations.svalue
1473 sharing type {n}Parser.Token.token =
1474            {n}LrVals.ParserData.Token.token =
1475            LrParser.Token.token =
1476            Lex.UserDeclarations.token
1477
1478 sharing {n}LrVals.LrTable = LrParser.LrTable
1479
1480 \end{verbatim}
1481 \end{quote}
1482
1483 If you used a lexer which takes an argument, then you will
1484 have:
1485 \begin{quote}
1486 \begin{verbatim}
1487 structure ARG_LEXER
1488 structure {n}Parser : PARSER
1489
1490 (* additional sharing constraint *)
1491
1492 sharing type {n}Parser.lexarg = Lex.UserDeclarations.arg
1493 \end{verbatim}
1494 \end{quote}
1495
1496 \section{Hints}
1497 \subsection{Multiple start symbols}
1498 To have multiple start symbols, define a dummy token for each
1499 start symbol.  Then define a start symbol which derives the
1500 multiple start symbols with dummy tokens placed in front of
1501 them.  When you start the parser you must place a dummy token
1502 on the front of the lexer stream to select a start symbol
1503 from which to begin parsing.
1504
1505 Assuming that you have followed the naming conventions used before,
1506 create the lexer using the makeLexer function in the \{n\}Parser structure.
1507 Then, place the dummy token on the front of the lexer:
1508 \begin{quote}
1509 \begin{verbatim}
1510 val dummyLexer =
1511     {n}Parser.Stream.cons
1512         ({n}LrVals.Tokens.{dummy token name}
1513                  ({dummy lineno},{dummy lineno}),
1514         lexer)
1515 \end{verbatim}
1516 \end{quote}
1517 You have to pass a Tokens structure to the lexer.  This Tokens structure
1518 contains functions which construct tokens from values and line numbers.
1519 So to create your dummy token just apply the appropriate token constructor
1520 function from this Tokens structure to a value (if there is one) and the
1521 line numbers.   This is exactly what you do in the lexer to construct tokens.
1522
1523 Then you must place the dummy token on the front of your lex stream.
1524 The structure \{n\}Parser contains a structure Stream which implements
1525 lazy streams.  So you just cons the dummy token on to stream returned
1526 by makeLexer.
1527 \subsection{Functorizing things further}
1528
1529 You may wish to functorize things even further.  Two possibilities
1530 are turning the lexer and parser structures into closed functors,
1531 that is, functors which do not refer to types or values defined
1532 outside their body or outside their parameter structures (except
1533 for pervasive types and values), and creating a functor which
1534 encapsulates the code necessary to invoke the parser.
1535
1536 Use the {\tt \%header} declarations in ML-Lex and ML-Yacc to create
1537 closed functors.  See section~\ref{optional-def} of this manual
1538 and section 4 of the manual for ML-Lex for complete descriptions of these
1539 declarations.  If you do this, you should also parameterize these
1540 structures by the types of line numbers.  The type will be an
1541 abstract type, so you will also need to define all the valid
1542 operations on the type.  The signature INTERFACE, defined below,
1543 shows one possible signature for a structure defining the line
1544 number type and associated operations.
1545
1546 If you wish to encapsulate the code necessary to invoke the
1547 parser, your functor generally will have form:
1548 \begin{quote}
1549 \begin{verbatim}
1550 functor Encapsulate(
1551      structure Parser : PARSER
1552      structure Interface : INTERFACE
1553          sharing type Parser.arg = Interface.arg
1554          sharing type Parser.pos = Interface.pos
1555          sharing type Parser.result = ...
1556      structure Tokens : {parser name}_TOKENS
1557          sharing type Tokens.token = Parser.Token.token
1558          sharing type Tokens.svalue = Parser.svalue) =
1559   struct
1560         ...
1561   end
1562 \end{verbatim}
1563 \end{quote}
1564
1565 The signature INTERFACE, defined below, is a possible signature for
1566 a structure
1567 defining the types
1568 of line numbers and arguments (types pos and arg, respectively)
1569 along with operations for them.  You need this structure
1570 because
1571 these types will be abstract types inside the body of your
1572 functor.
1573 \begin{quote}
1574 \begin{verbatim}
1575 signature INTERFACE =
1576 sig
1577    type pos
1578    val line : pos ref
1579    val reset : unit -> unit
1580    val next : unit -> unit
1581    val error : string * pos * pos -> unit
1582
1583    type arg
1584    val nothing : arg
1585 end
1586 \end{verbatim}
1587 \end{quote}
1588
1589 The directory example/fol contains a sample parser in which
1590 the code for tying together the lexer and parser has been
1591 encapsulated in a functor.
1592
1593 \section{Acknowledgements}
1594
1595 Nick Rothwell wrote an SLR table generator in 1988 which inspired the
1596 initial work on an ML parser generator.  Bruce Duba and David
1597 MacQueen made useful suggestions about the design of the error-correcting
1598 parser.  Thanks go to all the users at Carnegie Mellon who beta-tested
1599 this version.  Their comments and questions led to the creation of
1600 this manual and helped improve it.
1601
1602 \section{Bugs}
1603
1604 There is a slight difference in syntax between ML-Lex and ML-Yacc.
1605 In ML-Lex, semantic actions must be followed by a semicolon but
1606 in ML-Yacc semantic actions cannot be followed by a semicolon.
1607 The syntax should be the same.  ML-Lex also produces structures with
1608 two different signatures, but it should produce structures with just
1609 one signature.  This would simplify some things.
1610
1611 \begin{thebibliography}{9}
1612
1613 \bibitem{bf} ``A Practical Method for LR and LL Syntactic Error
1614 Diagnosis and Recovery'', M. Burke and G. Fisher,
1615 ACM Transactions on Programming Languages and
1616 Systems, Vol. 9, No. 2, April 1987, pp. 164-167.
1617 \bibitem{ahu} A. Aho, R. Sethi, J. Ullman, {\em Compilers: Principles,
1618 Techniques, and Tools}, Addison-Wesley, Reading, MA, 1986.
1619
1620 \end{thebibliography}
1621
1622 \end{document}