Import Upstream version 20180207
[hcoop/debian/mlton.git] / mllex / lexgen.tex
CommitLineData
7f918cf1
CE
1% Modified by Matthew Fluet on 2007-11-07.
2% Add %posint command.
3%
4% Modified by Matthew Fluet on 2007-10-31.
5% Add \r escape sequence (from Florian Weimer).
6% Fix TeX formatting bug (from Florian Weimer).
7%
8\documentstyle{article}
9\title{ A lexical analyzer generator for Standard ML.\\
10 Version 1.6.0, October 1994
11 }
12\author{ Andrew W. Appel$^1$\\
13 James S. Mattson\\
14 David R. Tarditi$^2$\\
15\\
16\small
17$^1$Department of Computer Science, Princeton University \\
18\small
19$^2$School of Computer Science, Carnegie Mellon University
20}
21\date{}
22\begin{document}
23\maketitle
24\begin{center}
25(c) 1989-94 Andrew W. Appel, James S. Mattson, David R. Tarditi
26\end{center}
27
28{\bf
29This software comes with ABSOLUTELY NO WARRANTY. It is subject only to
30the terms of the ML-Yacc NOTICE, LICENSE, and DISCLAIMER (in the
31file COPYRIGHT distributed with this software).
32}
33
34\vspace{1in}
35
36New in this version:
37\begin{itemize}
38\item REJECT is much less costly than before.
39\item Lexical analyzers with more than 255 states can now compile in your
40lifetime.
41\end{itemize}
42
43\newpage
44\tableofcontents
45\newpage
46
47\section{General Description}
48
49Computer programs often need to divide their input into words and
50distinguish between different kinds of words. Compilers, for
51example, need to distinguish between integers, reserved words, and
52identifiers. Applications programs often need to be able to
53recognize components of typed commands from users.
54
55The problem of segmenting input into words and recognizing classes of
56words is known as lexical analysis. Small cases of this problem,
57such as reading text strings separated by spaces, can be solved by
58using hand-written programs. Larger cases of this problem, such as
59tokenizing an input stream for a compiler, can also be solved using
60hand-written programs.
61
62A hand-written program for a large lexical analysis problem, however,
63suffers from two major problems. First, the program requires a fair
64amount of programmer time to create. Second, the description of
65classes of words is not explicit in the program. It must be inferred
66from the program code. This makes it difficult to verify if the
67program recognizes the correct words for each class. It also makes
68future maintenance of the program difficult.
69
70Lex, a programming tool for the Unix system, is a successful solution
71to the general problem of lexical analysis. It uses regular
72expressions to describe classes of words. A program fragment is
73associated with each class of words. This information is given to
74Lex as a specification (a Lex program). Lex produces a program for a
75function that can be used to perform lexical analysis.
76
77The function operates as follows. It finds the longest word starting
78from the current position in the input stream that is in one of the
79word classes. It executes the program fragment associated with the
80class, and sets the current position in the input stream to be the
81character after the word. The program fragment has the actual text
82of the word available to it, and may be any piece of code. For many
83applications it returns some kind of value.
84
85Lex allows the programmer to make the language description explicit,
86and to concentrate on what to do with the recognized words, not how
87to recognize the words. It saves programmer time and increases
88program maintainability.
89
90Unfortunately, Lex is targeted only C. It also places artificial
91limits on the size of strings that can be recognized.
92
93ML-Lex is a variant of Lex for the ML programming language. ML-Lex
94has a syntax similar to Lex, and produces an ML program instead of a
95C program. ML-Lex produces a program that runs very efficiently.
96Typically the program will be as fast or even faster than a
97hand-coded lexer implemented in Standard ML.
98
99The program typically uses only a small amount of space.
100ML-Lex thus allows ML programmers the same benefits that Lex allows C
101programmers. It also does not place artificial limits on the size of
102recognized strings.
103
104\section{ML-Lex specifications}
105
106An ML-Lex specification has the general format:
107
108\begin{quote}
109 {user declarations}
110 \verb|%%|
111 {ML-Lex definitions}
112 \verb|%%|
113 {rules}
114\end{quote}
115
116Each section is separated from the others by a \verb|%%| delimiter.
117
118The rules are used to define the lexical analysis function. Each
119rule has two parts---a regular expression and an action. The regular
120expression defines the word class that a rule matches. The action is
121a program fragment to be executed when a rule matches the input. The
122actions are used to compute values, and must all return values of the
123same type.
124
125The user can define values available to all rule actions in the user
126declarations section. The user must define two values in this
127section---a type lexresult and a function eof. Lexresult defines the
128type of values returned by the rule actions. The function "eof" is
129called by the lexer when the end of the input stream is reached. It
130will typically return a value signalling eof or raise an exception.
131It is called with the same argument as lex (see \verb|%arg|, below),
132and must return a value of type lexresult.
133
134In the definitions section, the user can define named regular
135expressions, a set of start states, and specify which of the various
136bells and whistles of ML-Lex are desired.
137
138The start states allow the user to control when certain rules are
139matched. Rules may be defined to match only when the lexer is in
140specific start states. The user may change the lexer's start state
141in a rule action. This allows the user to specify special handling
142of lexical objects.
143
144This feature is typically used to handle quoted strings with escapes
145to denote special characters. The rules to recognize the inside
146contents of a string are defined for only one start state. This
147start state is entered when the beginning of a string is recognized,
148and exited when the end of the string is recognized.
149
150\section{Regular expressions}
151
152Regular expressions are a simple language for denoting classes of
153strings. A regular expression is defined inductively over an
154alphabet with a set of basic operations. The alphabet for ML-Lex is
155the Ascii character set (character codes 0--127; or if
156\verb|%full| is used, 0--255).
157
158The syntax and semantics of regular expressions will be described in
159order of decreasing precedence (from the most tightly binding operators
160to the most weakly binding):
161
162\begin{itemize}
163\item An individual character stands for itself, except for the
164 reserved characters \verb@? * + | ( ) ^ $ / ; . = < > [ { " \@
165
166\item[\\] A backslash followed by one of the reserved characters stands
167 for that character.
168
169\item A set of characters enclosed in square brackets [ ] stands
170 for any one of those characters. Inside the brackets, only
171 the symbols \verb|\ - ^| are reserved. An initial up-arrow
172 \verb|^| stands
173 for the complement of the characters listed, e.g. \verb|[^abc]|
174 stands any character except a, b, or c. The hyphen - denotes
175 a range of characters, e.g. \verb|[a-z]| stands for any lower-case
176 alphabetic character, and \verb|[0-9a-fA-F]| stands for any hexadecimal
177 digit. To include \verb|^| literally in a bracketed set, put it anywhere
178 but first; to include \verb|-| literally in a set, put it first or last.
179
180\item[\verb|.|] The dot \verb|.| character stands for any character except newline,
181 i.e. the same as \verb|[^\n]|
182
183\item The following special escape sequences are available, inside
184 or outside of square-brackets:
185
186 \begin{tabular}{ll}
187 \verb|\b|& backspace\\
188 \verb|\n|& newline\\
189 \verb|\r|& carriage return\\
190 \verb|\t|& tab\\
191 \verb|\h|& stands for all characters with codes $>127$,\\
192 &~~~~ when 7-bit characters are used.\\
193 \verb|\ddd|& where \verb|ddd| is a 3 digit decimal escape.\\
194
195 \end{tabular}
196
197\item[\verb|"|] A sequence of characters will stand for itself (reserved
198 characters will be taken literally) if it is enclosed in
199 double quotes \verb|" "|.
200
201\item[\{\}] A named regular expression (defined in the ``definitions"
202 section) may be referred to by enclosing its name in
203 braces \verb|{ }|.
204
205\item[()] Any regular expression may be enclosed in parentheses \verb|( )|
206 for syntactic (but, as usual, not semantic) effect.
207
208\item[\verb|*|] The postfix operator \verb|*| stands for Kleene closure:
209 zero or more repetitions of the preceding expression.
210
211\item[\verb|+|] The postfix operator \verb|+| stands for one or more repetitions
212 of the preceding expression.
213
214\item[\verb|?|] The postfix operator \verb|?| stands for zero or one occurrence of
215 the preceding expression.
216
217\item A postfix repetition range $\{n_1,n_2\}$ where $n_1$ and $n_2$ are small
218 integers stands for any number of repetitions between $n_1$ and $n_2$
219 of the preceding expression. The notation $\{n_1\}$ stands for
220 exactly $n_1$ repetitions.
221
222\item Concatenation of expressions denotes concatenation of strings.
223 The expression $e_1 e_2$ stands for any string that results from
224 the concatenation of one string that matches $e_1$ with another
225 string that matches $e_2$.
226
227\item\verb-|- The infix operator \verb-|- stands for alternation. The expression
228 $e_1$~\verb"|"~$e_2$ stands for anything that either $e_1$ or $e_2$ stands for.
229
230\item[\verb|/|] The infix operator \verb|/| denotes lookahead. Lookahead is not
231 implemented and cannot be used, because there is a bug
232 in the algorithm for generating lexers with lookahead. If
233 it could be used, the expression $e_1 / e_2$ would match any string
234 that $e_1$ stands for, but only when that string is followed by a
235 string that matches $e_2$.
236
237\item When the up-arrow \verb|^| occurs at the beginning of an expression,
238 that expression will only match strings that occur at the
239 beginning of a line (right after a newline character).
240
241\item[\$] The dollar sign of C Lex \$ is not implemented, since it is an abbreviation
242 for lookahead involving the newline character (that is, it
243 is an abbreviation for \verb|/\n|).
244\end{itemize}
245
246Here are some examples of regular expressions, and descriptions of the
247set of strings they denote:
248
249\begin{tabular}{ll}
250\verb~0 | 1 | 2 | 3~& A single digit between 0 and 3\\
251\verb|[0123]|& A single digit between 0 and 3\\
252\verb|0123|& The string ``0123"\\
253\verb|0*|& All strings of 0 or more 0's\\
254\verb|00*|& All strings of 1 or more 0's\\
255\verb|0+|& All strings of 1 or more 0's\\
256\verb|[0-9]{3}|& Any three-digit decimal number.\\
257\verb|\\[ntb]|& A newline, tab, or backspace.\\
258\verb|(00)*|& Any string with an even number of 0's.
259\end{tabular}
260
261\section{ML-Lex syntax summary}
262
263\subsection{User declarations}
264
265Anything up to the first \verb|%%| is in the user declarations section. The
266user should note that no symbolic identifier containing
267\verb|%%| can be
268used in this section.
269
270\subsection{ML-Lex definitions}
271
272Start states can be defined with
273\begin{quote}
274\verb|%s| {identifier list} \verb|;|
275\end{quote}
276
277An identifier list consists of one or more identifiers.
278
279An identifier consists of one or more letters, digits, underscores,
280or primes, and must begin with a letter.
281
282Named expressions can be defined with
283
284\begin{quote}
285 {identifier} = {regular expression} ;
286\end{quote}
287
288Regular expressions are defined below.
289
290The following \% commands are also available:
291
292\begin{description}
293\item[\tt \%reject] create REJECT() function
294\item[\tt \%count] count newlines using yylineno
295\item[\tt \%posarg] pass initial-position argument to makeLexer
296\item[\tt \%full] create lexer for the full 8-bit character set,
297 with characters in the range 0--255 permitted
298 as input.
299\item[\tt \%structure \{identifier\}] name the structure in the output program
300 {identifier} instead of Mlex
301\item[\tt \%header] use code following it to create header for lexer
302 structure
303\item[\tt \%arg] extra (curried) formal parameter argument to be
304 passed to the lex functions, and to be passed
305 to the eof function in place of ()
306\item[\tt \%posint \{identifier\}] use the {\tt INTEGER} structure for the
307 type of {\tt yypos}; use {\tt Int64} or {\tt Position}
308 to allow lexing of multi-gigabyte input files
309\end{description}
310 These functions are discussed in section~\ref{avail}.
311
312\subsection{Rules}
313
314Each rule has the format:
315
316\begin{quote}
317 \verb|<|{\it start state list}\verb|>| {\it regular expression} \verb|=> (| {\it code} \verb|);|
318\end{quote}
319
320All parentheses in {\it code} must be balanced, including those
321used in strings and comments.
322
323The {\it start state list} is optional. It consists of a list of
324identifiers separated by commas, and is delimited by triangle
325brackets \verb|< >|. Each identifier must be a start state defined in the
326\verb|%s| section above.
327
328The regular expression is only recognized when the lexer is in one of
329the start states in the start state list. If no start state list is
330given, the expression is recognized in all start states.
331
332The lexer begins in a pre-defined start state called \verb|INITIAL|.
333
334The lexer resolves conflicts among rules by choosing the rule with
335the longest match, and in the case two rules match the same string,
336choosing the rule listed first in the specification.
337
338The rules should match all possible input. If some input occurs that
339does not match any rule, the lexer created by ML-Lex will raise an
340exception LexError. Note that this differs from C Lex, which prints
341any unmatched input on the standard output.
342
343\section{Values available inside the code associated with a rule.}
344\label{avail}
345
346ML-Lex places the value of the string matched by a regular expression
347in \verb|yytext|, a string variable.
348
349The user may recursively
350call the lexing function with \verb|lex()|. (If \verb|%arg| is used, the
351lexing function may be re-invoked with the same argument by using
352continue().) This is convenient for ignoring white space or comments silently:
353
354\begin{verbatim}
355 [\ \t\n]+ => ( lex());
356\end{verbatim}
357
358To switch start states, the user may call \verb|YYBEGIN| with the name of a
359start state.
360
361The following values will be available only if the corresponding \verb|%|
362command is in the ML-Lex definitions sections:
363
364\begin{tabular}{lll}
365\\
366{\bf Value}&{\bf \% command}&{\bf description}\\
367\hline
368{\tt REJECT} &{\tt\%reject}&\parbox[t]{2.6in}{{\tt REJECT()} causes the current
369 rule to be ``rejected.''
370 The lexer behaves as if the
371 current rule had not matched;
372 another rule that matches this
373 string, or that matches the longest
374 possible prefix of this string,
375 is used instead.} \\
376{\tt yypos} & & \parbox[t]{2.6in}{The position of the first character
377of {\tt yytext}, relative to the beginning of the file.}\\
378{\tt yylineno } & {\tt \%count} & Current line number\\
379\\
380\end{tabular}
381
382
383These values should be used only if necessary. Adding {\tt REJECT} to a
384lexer will slow it down by 20\%; adding {\tt yylineno} will slow it down by
385another 20\%, or more. (It is much more efficient to
386recognize \verb|\n| and
387have an action that increments the line-number variable.) The use of
388the lookahead operator {\tt /} will also slow down the entire lexer.
389The character-position, {\tt yypos}, is not costly to maintain, however.
390
391\paragraph{Bug.} The position of the first character in the file
392is reported as 2 (unless the {\tt \%posarg} feature is used).
393To preserve compatibility, this bug has not been fixed.
394
395\section{Running ML-Lex}
396
397From the Unix shell, run {\tt sml-lex~myfile.lex}
398The output file will be myfile.lex.sml. The extension {\tt .lex} is not
399required but is recommended.
400
401Within an interactive system [not the preferred method]:
402Use {\tt lexgen.sml}; this will create a structure LexGen. The function
403LexGen.lexGen creates a program for a lexer from an input
404specification. It takes a string argument -- the name of the file
405containing the input specification. The output file name is
406determined by appending ``{\tt .sml}'' to the input file name.
407
408\section{Using the program produced by ML-Lex}
409
410When the output file is loaded, it will create a structure Mlex that
411contains the function {\tt makeLexer} which takes a function from
412${\it int} \rightarrow {\it string}$ and returns a lexing function:
413
414\begin{verbatim}
415 val makeLexer : (int->string) -> yyarg -> lexresult
416\end{verbatim}
417where {\tt yyarg} is the type given in the {\tt \%yyarg} directive,
418or {\tt unit} if there is no {\tt \%yyarg} directive.
419
420For example,
421
422\begin{verbatim}
423 val lexer = Mlex.makeLexer (inputc (open_in "f"))
424\end{verbatim}
425
426creates a lexer that operates on the file whose name is f.
427
428When the {\tt \%posarg} directive is used, the type of
429{\tt makeLexer} is
430\begin{verbatim}
431 val makeLexer : ((int->string)*int) -> yyarg -> lexresult
432\end{verbatim}
433where the extra {\tt int} argument is one less than the {\tt yypos}
434of the first character in the input. The value $k$ would be used,
435for example, when creating
436a lexer to start in the middle of a file, when $k$ characters have
437already been read. At the beginning of the file, $k=0$ should be used.
438
439The ${\it int} \rightarrow {\it string}$ function
440should read a string of characters
441from the input stream. It should return a null string to indicate
442that the end of the stream has been reached. The integer is the
443number of characters that the lexer wishes to read; the function may
444return any non-zero number of characters. For example,
445
446\begin{verbatim}
447 val lexer =
448 let val input_line = fn f =>
449 let fun loop result =
450 let val c = input (f,1)
451 val result = c :: result
452 in if String.size c = 0 orelse c = "\n" then
453 String.implode (rev result)
454 else loop result
455 end
456 in loop nil
457 end
458 in Mlex.makeLexer (fn n => input_line std_in)
459 end
460\end{verbatim}
461
462is appropriate for interactive streams where prompting, etc. occurs;
463the lexer won't care that \verb|input_line| might return a string of more
464than or less than $n$ characters.
465
466The lexer tries to read a large number of characters from the input
467function at once, and it is desirable that the input function return
468as many as possible. Reading many characters at once makes the lexer
469more efficient. Fewer input calls and buffering operations are
470needed, and input is more efficient in large block reads. For
471interactive streams this is less of a concern, as the limiting factor
472is the speed at which the user can type.
473
474To obtain a value, invoke the lexer by passing it a unit:
475
476\begin{verbatim}
477 val nextToken = lexer()
478\end{verbatim}
479
480If one wanted to restart the lexer, one would just discard {\tt lexer}
481and create a new lexer on the same stream with another call to
482{\tt makeLexer}. This is the best way to discard any characters buffered
483internally by the lexer.
484
485All code in the user declarations section is placed inside a
486structure UserDeclarations. To access this structure, use the path name
487{\tt Mlex.UserDeclarations}.
488
489If any input cannot be matched, the program will raise the exception
490{\tt Mlex.LexError}. An internal error (i.e. bug) will cause the
491exception {\tt Internal.LexerError} to be raised.
492
493If {\tt \%structure} is used, remember that the structure name will no
494longer be Mlex, but the one specified in the command.
495
496\section{Sample}
497
498Here is a sample lexer for a calculator program:
499
500\small
501\begin{verbatim}
502datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN |
503 NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES
504
505val linenum = ref 1
506val error = fn x => output(std_out,x ^ "\n")
507val eof = fn () => EOF
508%%
509%structure CalcLex
510alpha=[A-Za-z];
511digit=[0-9];
512ws = [\ \t];
513%%
514\n => (inc linenum; lex());
515{ws}+ => (lex());
516"/" => (DIV);
517";" => (EOS);
518"(" => (LPAREN);
519{digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0));
520")" => (RPAREN);
521"+" => (PLUS);
522{alpha}+ => (if yytext="print" then PRINT else ID yytext);
523"-" => (SUB);
524"*" => (TIMES);
525. => (error ("calc: ignoring bad character "^yytext); lex());
526\end{verbatim}
527
528
529Here is the parser for the calculator:
530\begin{verbatim}
531
532(* Sample interactive calculator to demonstrate use of lexer
533
534 The original grammar was
535
536 stmt_list -> stmt_list stmt
537 stmt -> print exp ; | exp ;
538 exp -> exp + t | exp - t | t
539 t -> t * f | t/f | f
540 f -> (exp) | id | num
541
542 The function parse takes a stream and parses it for the calculator
543 program.
544
545 If a syntax error occurs, parse prints an error message and calls
546 itself on the stream. On this system that has the effect of ignoring
547 all input to the end of a line.
548*)
549
550structure Calc =
551 struct
552 open CalcLex
553 open UserDeclarations
554 exception Error
555 fun parse strm =
556 let
557 val say = fn s => output(std_out,s)
558 val input_line = fn f =>
559 let fun loop result =
560 let val c = input (f,1)
561 val result = c :: result
562 in if String.size c = 0 orelse c = "\n" then
563 String.implode (rev result)
564 else loop result
565 end
566 in loop nil
567 end
568 val lexer = makeLexer (fn n => input_line strm)
569 val nexttok = ref (lexer())
570 val advance = fn () => (nexttok := lexer(); !nexttok)
571 val error = fn () => (say ("calc: syntax error on line" ^
572 (makestring(!linenum)) ^ "\n"); raise Error)
573 val lookup = fn i =>
574 if i = "ONE" then 1
575 else if i = "TWO" then 2
576 else (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error)
577 fun STMT_LIST () =
578 case !nexttok of
579 EOF => ()
580 | _ => (STMT(); STMT_LIST())
581
582 and STMT() =
583 (case !nexttok
584 of EOS => ()
585 | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ())
586 | _ => (E(); ());
587 case !nexttok
588 of EOS => (advance())
589 | _ => error())
590 and E () = E' (T())
591 and E' (i : int ) =
592 case !nexttok of
593 PLUS => (advance (); E'(i+T()))
594 | SUB => (advance (); E'(i-T()))
595 | RPAREN => i
596 | EOF => i
597 | EOS => i
598 | _ => error()
599 and T () = T'(F())
600 and T' i =
601 case !nexttok of
602 PLUS => i
603 | SUB => i
604 | TIMES => (advance(); T'(i*F()))
605 | DIV => (advance (); T'(i div F()))
606 | EOF => i
607 | EOS => i
608 | RPAREN => i
609 | _ => error()
610 and F () =
611 case !nexttok of
612 ID i => (advance(); lookup i)
613 | LPAREN =>
614 let val v = (advance(); E())
615 in if !nexttok = RPAREN then (advance (); v) else error()
616 end
617 | NUM i => (advance(); i)
618 | _ => error()
619 in STMT_LIST () handle Error => parse strm
620 end
621 end
622\end{verbatim}
623\end{document}