1 % Modified by Matthew Fluet on 2007-11-07.
3 %
4 % Modified by Matthew Fluet on 2007-10-31.
5 % Add \r escape sequence (from Florian Weimer).
6 % Fix TeX formatting bug (from Florian Weimer).
7 %
8 \documentstyle{article}
9 \title{ A lexical analyzer generator for Standard ML.\\
10 Version 1.6.0, October 1994
11 }
12 \author{ Andrew W. Appel$^1$\\
13 James S. Mattson\\
14 David R. Tarditi$^2$\\
15 \\
16 \small
17 $^1$Department of Computer Science, Princeton University \\
18 \small
19 $^2$School of Computer Science, Carnegie Mellon University
20 }
21 \date{}
22 \begin{document}
23 \maketitle
24 \begin{center}
25 (c) 1989-94 Andrew W. Appel, James S. Mattson, David R. Tarditi
26 \end{center}
28 {\bf
29 This software comes with ABSOLUTELY NO WARRANTY. It is subject only to
30 the terms of the ML-Yacc NOTICE, LICENSE, and DISCLAIMER (in the
31 file COPYRIGHT distributed with this software).
32 }
34 \vspace{1in}
36 New in this version:
37 \begin{itemize}
38 \item REJECT is much less costly than before.
39 \item Lexical analyzers with more than 255 states can now compile in your
41 \end{itemize}
43 \newpage
44 \tableofcontents
45 \newpage
47 \section{General Description}
49 Computer programs often need to divide their input into words and
50 distinguish between different kinds of words. Compilers, for
51 example, need to distinguish between integers, reserved words, and
52 identifiers. Applications programs often need to be able to
53 recognize components of typed commands from users.
55 The problem of segmenting input into words and recognizing classes of
56 words is known as lexical analysis. Small cases of this problem,
57 such as reading text strings separated by spaces, can be solved by
58 using hand-written programs. Larger cases of this problem, such as
59 tokenizing an input stream for a compiler, can also be solved using
60 hand-written programs.
62 A hand-written program for a large lexical analysis problem, however,
63 suffers from two major problems. First, the program requires a fair
64 amount of programmer time to create. Second, the description of
65 classes of words is not explicit in the program. It must be inferred
66 from the program code. This makes it difficult to verify if the
67 program recognizes the correct words for each class. It also makes
68 future maintenance of the program difficult.
70 Lex, a programming tool for the Unix system, is a successful solution
71 to the general problem of lexical analysis. It uses regular
72 expressions to describe classes of words. A program fragment is
73 associated with each class of words. This information is given to
74 Lex as a specification (a Lex program). Lex produces a program for a
75 function that can be used to perform lexical analysis.
77 The function operates as follows. It finds the longest word starting
78 from the current position in the input stream that is in one of the
79 word classes. It executes the program fragment associated with the
80 class, and sets the current position in the input stream to be the
81 character after the word. The program fragment has the actual text
82 of the word available to it, and may be any piece of code. For many
83 applications it returns some kind of value.
85 Lex allows the programmer to make the language description explicit,
86 and to concentrate on what to do with the recognized words, not how
87 to recognize the words. It saves programmer time and increases
88 program maintainability.
90 Unfortunately, Lex is targeted only C. It also places artificial
91 limits on the size of strings that can be recognized.
93 ML-Lex is a variant of Lex for the ML programming language. ML-Lex
94 has a syntax similar to Lex, and produces an ML program instead of a
95 C program. ML-Lex produces a program that runs very efficiently.
96 Typically the program will be as fast or even faster than a
97 hand-coded lexer implemented in Standard ML.
99 The program typically uses only a small amount of space.
100 ML-Lex thus allows ML programmers the same benefits that Lex allows C
101 programmers. It also does not place artificial limits on the size of
102 recognized strings.
104 \section{ML-Lex specifications}
106 An ML-Lex specification has the general format:
108 \begin{quote}
109 {user declarations}
110 \verb|%%|
111 {ML-Lex definitions}
112 \verb|%%|
113 {rules}
114 \end{quote}
116 Each section is separated from the others by a \verb|%%| delimiter.
118 The rules are used to define the lexical analysis function. Each
119 rule has two parts---a regular expression and an action. The regular
120 expression defines the word class that a rule matches. The action is
121 a program fragment to be executed when a rule matches the input. The
122 actions are used to compute values, and must all return values of the
123 same type.
125 The user can define values available to all rule actions in the user
126 declarations section. The user must define two values in this
127 section---a type lexresult and a function eof. Lexresult defines the
128 type of values returned by the rule actions. The function "eof" is
129 called by the lexer when the end of the input stream is reached. It
130 will typically return a value signalling eof or raise an exception.
131 It is called with the same argument as lex (see \verb|%arg|, below),
132 and must return a value of type lexresult.
134 In the definitions section, the user can define named regular
135 expressions, a set of start states, and specify which of the various
136 bells and whistles of ML-Lex are desired.
138 The start states allow the user to control when certain rules are
139 matched. Rules may be defined to match only when the lexer is in
140 specific start states. The user may change the lexer's start state
141 in a rule action. This allows the user to specify special handling
142 of lexical objects.
144 This feature is typically used to handle quoted strings with escapes
145 to denote special characters. The rules to recognize the inside
146 contents of a string are defined for only one start state. This
147 start state is entered when the beginning of a string is recognized,
148 and exited when the end of the string is recognized.
150 \section{Regular expressions}
152 Regular expressions are a simple language for denoting classes of
153 strings. A regular expression is defined inductively over an
154 alphabet with a set of basic operations. The alphabet for ML-Lex is
155 the Ascii character set (character codes 0--127; or if
156 \verb|%full| is used, 0--255).
158 The syntax and semantics of regular expressions will be described in
159 order of decreasing precedence (from the most tightly binding operators
160 to the most weakly binding):
162 \begin{itemize}
163 \item An individual character stands for itself, except for the
164 reserved characters \verb@? * + | ( ) ^ $/ ; . = < > [ { " \@ 166 \item[\\] A backslash followed by one of the reserved characters stands 167 for that character. 169 \item A set of characters enclosed in square brackets [ ] stands 170 for any one of those characters. Inside the brackets, only 171 the symbols \verb|\ - ^| are reserved. An initial up-arrow 172 \verb|^| stands 173 for the complement of the characters listed, e.g. \verb|[^abc]| 174 stands any character except a, b, or c. The hyphen - denotes 175 a range of characters, e.g. \verb|[a-z]| stands for any lower-case 176 alphabetic character, and \verb|[0-9a-fA-F]| stands for any hexadecimal 177 digit. To include \verb|^| literally in a bracketed set, put it anywhere 178 but first; to include \verb|-| literally in a set, put it first or last. 180 \item[\verb|.|] The dot \verb|.| character stands for any character except newline, 181 i.e. the same as \verb|[^\n]| 183 \item The following special escape sequences are available, inside 184 or outside of square-brackets: 186 \begin{tabular}{ll} 187 \verb|\b|& backspace\\ 188 \verb|\n|& newline\\ 189 \verb|\r|& carriage return\\ 190 \verb|\t|& tab\\ 191 \verb|\h|& stands for all characters with codes$>127$,\\ 192 &~~~~ when 7-bit characters are used.\\ 193 \verb|\ddd|& where \verb|ddd| is a 3 digit decimal escape.\\ 195 \end{tabular} 197 \item[\verb|"|] A sequence of characters will stand for itself (reserved 198 characters will be taken literally) if it is enclosed in 199 double quotes \verb|" "|. 201 \item[\{\}] A named regular expression (defined in the definitions" 202 section) may be referred to by enclosing its name in 203 braces \verb|{ }|. 205 \item[()] Any regular expression may be enclosed in parentheses \verb|( )| 206 for syntactic (but, as usual, not semantic) effect. 208 \item[\verb|*|] The postfix operator \verb|*| stands for Kleene closure: 209 zero or more repetitions of the preceding expression. 211 \item[\verb|+|] The postfix operator \verb|+| stands for one or more repetitions 212 of the preceding expression. 214 \item[\verb|?|] The postfix operator \verb|?| stands for zero or one occurrence of 215 the preceding expression. 217 \item A postfix repetition range$\{n_1,n_2\}$where$n_1$and$n_2$are small 218 integers stands for any number of repetitions between$n_1$and$n_2$219 of the preceding expression. The notation$\{n_1\}$stands for 220 exactly$n_1$repetitions. 222 \item Concatenation of expressions denotes concatenation of strings. 223 The expression$e_1 e_2$stands for any string that results from 224 the concatenation of one string that matches$e_1$with another 225 string that matches$e_2$. 227 \item\verb-|- The infix operator \verb-|- stands for alternation. The expression 228$e_1$~\verb"|"~$e_2$stands for anything that either$e_1$or$e_2$stands for. 230 \item[\verb|/|] The infix operator \verb|/| denotes lookahead. Lookahead is not 231 implemented and cannot be used, because there is a bug 232 in the algorithm for generating lexers with lookahead. If 233 it could be used, the expression$e_1 / e_2$would match any string 234 that$e_1$stands for, but only when that string is followed by a 235 string that matches$e_2$. 237 \item When the up-arrow \verb|^| occurs at the beginning of an expression, 238 that expression will only match strings that occur at the 239 beginning of a line (right after a newline character). 241 \item[\$] The dollar sign of C Lex \$is not implemented, since it is an abbreviation 242 for lookahead involving the newline character (that is, it 243 is an abbreviation for \verb|/\n|). 244 \end{itemize} 246 Here are some examples of regular expressions, and descriptions of the 247 set of strings they denote: 249 \begin{tabular}{ll} 250 \verb~0 | 1 | 2 | 3~& A single digit between 0 and 3\\ 251 \verb|[0123]|& A single digit between 0 and 3\\ 252 \verb|0123|& The string 0123"\\ 253 \verb|0*|& All strings of 0 or more 0's\\ 254 \verb|00*|& All strings of 1 or more 0's\\ 255 \verb|0+|& All strings of 1 or more 0's\\ 256 \verb|[0-9]{3}|& Any three-digit decimal number.\\ 257 \verb|\\[ntb]|& A newline, tab, or backspace.\\ 258 \verb|(00)*|& Any string with an even number of 0's. 259 \end{tabular} 261 \section{ML-Lex syntax summary} 263 \subsection{User declarations} 265 Anything up to the first \verb|%%| is in the user declarations section. The 266 user should note that no symbolic identifier containing 267 \verb|%%| can be 268 used in this section. 270 \subsection{ML-Lex definitions} 272 Start states can be defined with 273 \begin{quote} 274 \verb|%s| {identifier list} \verb|;| 275 \end{quote} 277 An identifier list consists of one or more identifiers. 279 An identifier consists of one or more letters, digits, underscores, 280 or primes, and must begin with a letter. 282 Named expressions can be defined with 284 \begin{quote} 285 {identifier} = {regular expression} ; 286 \end{quote} 288 Regular expressions are defined below. 290 The following \% commands are also available: 292 \begin{description} 293 \item[\tt \%reject] create REJECT() function 294 \item[\tt \%count] count newlines using yylineno 295 \item[\tt \%posarg] pass initial-position argument to makeLexer 296 \item[\tt \%full] create lexer for the full 8-bit character set, 297 with characters in the range 0--255 permitted 298 as input. 299 \item[\tt \%structure \{identifier\}] name the structure in the output program 300 {identifier} instead of Mlex 301 \item[\tt \%header] use code following it to create header for lexer 302 structure 303 \item[\tt \%arg] extra (curried) formal parameter argument to be 304 passed to the lex functions, and to be passed 305 to the eof function in place of () 306 \item[\tt \%posint \{identifier\}] use the {\tt INTEGER} structure for the 307 type of {\tt yypos}; use {\tt Int64} or {\tt Position} 308 to allow lexing of multi-gigabyte input files 309 \end{description} 310 These functions are discussed in section~\ref{avail}. 312 \subsection{Rules} 314 Each rule has the format: 316 \begin{quote} 317 \verb|<|{\it start state list}\verb|>| {\it regular expression} \verb|=> (| {\it code} \verb|);| 318 \end{quote} 320 All parentheses in {\it code} must be balanced, including those 321 used in strings and comments. 323 The {\it start state list} is optional. It consists of a list of 324 identifiers separated by commas, and is delimited by triangle 325 brackets \verb|< >|. Each identifier must be a start state defined in the 326 \verb|%s| section above. 328 The regular expression is only recognized when the lexer is in one of 329 the start states in the start state list. If no start state list is 330 given, the expression is recognized in all start states. 332 The lexer begins in a pre-defined start state called \verb|INITIAL|. 334 The lexer resolves conflicts among rules by choosing the rule with 335 the longest match, and in the case two rules match the same string, 336 choosing the rule listed first in the specification. 338 The rules should match all possible input. If some input occurs that 339 does not match any rule, the lexer created by ML-Lex will raise an 340 exception LexError. Note that this differs from C Lex, which prints 341 any unmatched input on the standard output. 343 \section{Values available inside the code associated with a rule.} 344 \label{avail} 346 ML-Lex places the value of the string matched by a regular expression 347 in \verb|yytext|, a string variable. 349 The user may recursively 350 call the lexing function with \verb|lex()|. (If \verb|%arg| is used, the 351 lexing function may be re-invoked with the same argument by using 352 continue().) This is convenient for ignoring white space or comments silently: 354 \begin{verbatim} 355 [\ \t\n]+ => ( lex()); 356 \end{verbatim} 358 To switch start states, the user may call \verb|YYBEGIN| with the name of a 359 start state. 361 The following values will be available only if the corresponding \verb|%| 362 command is in the ML-Lex definitions sections: 364 \begin{tabular}{lll} 365 \\ 366 {\bf Value}&{\bf \% command}&{\bf description}\\ 367 \hline 368 {\tt REJECT} &{\tt\%reject}&\parbox[t]{2.6in}{{\tt REJECT()} causes the current 369 rule to be rejected.'' 370 The lexer behaves as if the 371 current rule had not matched; 372 another rule that matches this 373 string, or that matches the longest 374 possible prefix of this string, 375 is used instead.} \\ 376 {\tt yypos} & & \parbox[t]{2.6in}{The position of the first character 377 of {\tt yytext}, relative to the beginning of the file.}\\ 378 {\tt yylineno } & {\tt \%count} & Current line number\\ 379 \\ 380 \end{tabular} 383 These values should be used only if necessary. Adding {\tt REJECT} to a 384 lexer will slow it down by 20\%; adding {\tt yylineno} will slow it down by 385 another 20\%, or more. (It is much more efficient to 386 recognize \verb|\n| and 387 have an action that increments the line-number variable.) The use of 388 the lookahead operator {\tt /} will also slow down the entire lexer. 389 The character-position, {\tt yypos}, is not costly to maintain, however. 391 \paragraph{Bug.} The position of the first character in the file 392 is reported as 2 (unless the {\tt \%posarg} feature is used). 393 To preserve compatibility, this bug has not been fixed. 395 \section{Running ML-Lex} 397 From the Unix shell, run {\tt sml-lex~myfile.lex} 398 The output file will be myfile.lex.sml. The extension {\tt .lex} is not 399 required but is recommended. 401 Within an interactive system [not the preferred method]: 402 Use {\tt lexgen.sml}; this will create a structure LexGen. The function 403 LexGen.lexGen creates a program for a lexer from an input 404 specification. It takes a string argument -- the name of the file 405 containing the input specification. The output file name is 406 determined by appending {\tt .sml}'' to the input file name. 408 \section{Using the program produced by ML-Lex} 410 When the output file is loaded, it will create a structure Mlex that 411 contains the function {\tt makeLexer} which takes a function from 412${\it int} \rightarrow {\it string}$and returns a lexing function: 414 \begin{verbatim} 415 val makeLexer : (int->string) -> yyarg -> lexresult 416 \end{verbatim} 417 where {\tt yyarg} is the type given in the {\tt \%yyarg} directive, 418 or {\tt unit} if there is no {\tt \%yyarg} directive. 420 For example, 422 \begin{verbatim} 423 val lexer = Mlex.makeLexer (inputc (open_in "f")) 424 \end{verbatim} 426 creates a lexer that operates on the file whose name is f. 428 When the {\tt \%posarg} directive is used, the type of 429 {\tt makeLexer} is 430 \begin{verbatim} 431 val makeLexer : ((int->string)*int) -> yyarg -> lexresult 432 \end{verbatim} 433 where the extra {\tt int} argument is one less than the {\tt yypos} 434 of the first character in the input. The value$k$would be used, 435 for example, when creating 436 a lexer to start in the middle of a file, when$k$characters have 437 already been read. At the beginning of the file,$k=0$should be used. 439 The${\it int} \rightarrow {\it string}$function 440 should read a string of characters 441 from the input stream. It should return a null string to indicate 442 that the end of the stream has been reached. The integer is the 443 number of characters that the lexer wishes to read; the function may 444 return any non-zero number of characters. For example, 446 \begin{verbatim} 447 val lexer = 448 let val input_line = fn f => 449 let fun loop result = 450 let val c = input (f,1) 451 val result = c :: result 452 in if String.size c = 0 orelse c = "\n" then 453 String.implode (rev result) 454 else loop result 455 end 456 in loop nil 457 end 458 in Mlex.makeLexer (fn n => input_line std_in) 459 end 460 \end{verbatim} 462 is appropriate for interactive streams where prompting, etc. occurs; 463 the lexer won't care that \verb|input_line| might return a string of more 464 than or less than$n\$ characters.
466 The lexer tries to read a large number of characters from the input
467 function at once, and it is desirable that the input function return
468 as many as possible. Reading many characters at once makes the lexer
469 more efficient. Fewer input calls and buffering operations are
470 needed, and input is more efficient in large block reads. For
471 interactive streams this is less of a concern, as the limiting factor
472 is the speed at which the user can type.
474 To obtain a value, invoke the lexer by passing it a unit:
476 \begin{verbatim}
477 val nextToken = lexer()
478 \end{verbatim}
480 If one wanted to restart the lexer, one would just discard {\tt lexer}
481 and create a new lexer on the same stream with another call to
482 {\tt makeLexer}. This is the best way to discard any characters buffered
483 internally by the lexer.
485 All code in the user declarations section is placed inside a
486 structure UserDeclarations. To access this structure, use the path name
487 {\tt Mlex.UserDeclarations}.
489 If any input cannot be matched, the program will raise the exception
490 {\tt Mlex.LexError}. An internal error (i.e. bug) will cause the
491 exception {\tt Internal.LexerError} to be raised.
493 If {\tt \%structure} is used, remember that the structure name will no
494 longer be Mlex, but the one specified in the command.
496 \section{Sample}
498 Here is a sample lexer for a calculator program:
500 \small
501 \begin{verbatim}
502 datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN |
503 NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES
505 val linenum = ref 1
506 val error = fn x => output(std_out,x ^ "\n")
507 val eof = fn () => EOF
508 %%
509 %structure CalcLex
510 alpha=[A-Za-z];
511 digit=[0-9];
512 ws = [\ \t];
513 %%
514 \n => (inc linenum; lex());
515 {ws}+ => (lex());
516 "/" => (DIV);
517 ";" => (EOS);
518 "(" => (LPAREN);
519 {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0));
520 ")" => (RPAREN);
521 "+" => (PLUS);
522 {alpha}+ => (if yytext="print" then PRINT else ID yytext);
523 "-" => (SUB);
524 "*" => (TIMES);
525 . => (error ("calc: ignoring bad character "^yytext); lex());
526 \end{verbatim}
529 Here is the parser for the calculator:
530 \begin{verbatim}
532 (* Sample interactive calculator to demonstrate use of lexer
534 The original grammar was
536 stmt_list -> stmt_list stmt
537 stmt -> print exp ; | exp ;
538 exp -> exp + t | exp - t | t
539 t -> t * f | t/f | f
540 f -> (exp) | id | num
542 The function parse takes a stream and parses it for the calculator
543 program.
545 If a syntax error occurs, parse prints an error message and calls
546 itself on the stream. On this system that has the effect of ignoring
547 all input to the end of a line.
548 *)
550 structure Calc =
551 struct
552 open CalcLex
553 open UserDeclarations
554 exception Error
555 fun parse strm =
556 let
557 val say = fn s => output(std_out,s)
558 val input_line = fn f =>
559 let fun loop result =
560 let val c = input (f,1)
561 val result = c :: result
562 in if String.size c = 0 orelse c = "\n" then
563 String.implode (rev result)
564 else loop result
565 end
566 in loop nil
567 end
568 val lexer = makeLexer (fn n => input_line strm)
569 val nexttok = ref (lexer())
570 val advance = fn () => (nexttok := lexer(); !nexttok)
571 val error = fn () => (say ("calc: syntax error on line" ^
572 (makestring(!linenum)) ^ "\n"); raise Error)
573 val lookup = fn i =>
574 if i = "ONE" then 1
575 else if i = "TWO" then 2
576 else (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error)
577 fun STMT_LIST () =
578 case !nexttok of
579 EOF => ()
580 | _ => (STMT(); STMT_LIST())
582 and STMT() =
583 (case !nexttok
584 of EOS => ()
585 | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ())
586 | _ => (E(); ());
587 case !nexttok
589 | _ => error())
590 and E () = E' (T())
591 and E' (i : int ) =
592 case !nexttok of
593 PLUS => (advance (); E'(i+T()))
594 | SUB => (advance (); E'(i-T()))
595 | RPAREN => i
596 | EOF => i
597 | EOS => i
598 | _ => error()
599 and T () = T'(F())
600 and T' i =
601 case !nexttok of
602 PLUS => i
603 | SUB => i
604 | TIMES => (advance(); T'(i*F()))
605 | DIV => (advance (); T'(i div F()))
606 | EOF => i
607 | EOS => i
608 | RPAREN => i
609 | _ => error()
610 and F () =
611 case !nexttok of
612 ID i => (advance(); lookup i)
613 | LPAREN =>
614 let val v = (advance(); E())
615 in if !nexttok = RPAREN then (advance (); v) else error()
616 end
617 | NUM i => (advance(); i)
618 | _ => error()
619 in STMT_LIST () handle Error => parse strm
620 end
621 end
622 \end{verbatim}
623 \end{document}