Import Debian changes 20180207-1
[hcoop/debian/mlton.git] / mllex / lexgen.doc
CommitLineData
7f918cf1
CE
1 A lexical analyzer generator for Standard ML.
2
3THIS TEXT FILE IS OBSOLETE and IS NOT MAINTAINED.
4The current (maintained) documentation is in lexgen.tex.
5
6
7 Andrew W. Appel
8 James S. Mattson
9 David R. Tarditi
10
11 Princeton University
12
13 Version 1.6, October 1994
14
15Copyright (c) 1989-1992 by Andrew W. Appel, James S. Mattson, David R. Tarditi
16
17This software comes with ABSOLUTELY NO WARRANTY.
18This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
19COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
20distributed with this software). You may copy and distribute this software;
21see the COPYRIGHT NOTICE for details and restrictions.
22
23I. General Description
24
25Computer programs often need to divide their input into words and
26distinguish between different kinds of words. Compilers, for
27example, need to distinguish between integers, reserved words, and
28identifiers. Applications programs often need to be able to
29recognize components of typed commands from users.
30
31The problem of segmenting input into words and recognizing classes of
32words is known as lexical analysis. Small cases of this problem,
33such as reading text strings separated by spaces, can be solved by
34using hand-written programs. Larger cases of this problem, such as
35tokenizing an input stream for a compiler, can also be solved using
36hand-written programs.
37
38A hand-written program for a large lexical analysis problem, however,
39suffers from two major problems. First, the program requires a fair
40amount of programmer time to create. Second, the description of
41classes of words is not explicit in the program. It must be inferred
42from the program code. This makes it difficult to verify if the
43program recognizes the correct words for each class. It also makes
44future maintenance of the program difficult.
45
46Lex, a programming tool for the Unix system, is a successful solution
47to the general problem of lexical analysis. It uses regular
48expressions to describe classes of words. A program fragment is
49associated with each class of words. This information is given to
50Lex as a specification (a Lex program). Lex produces a program for a
51function that can be used to perform lexical analysis.
52
53The function operates as follows. It finds the longest word starting
54from the current position in the input stream that is in one of the
55word classes. It executes the program fragment associated with the
56class, and sets the current position in the input stream to be the
57character after the word. The program fragment has the actual text
58of the word available to it, and may be any piece of code. For many
59applications it returns some kind of value.
60
61Lex allows the programmer to make the language description explicit,
62and to concentrate on what to do with the recognized words, not how
63to recognize the words. It saves programmer time and increases
64program maintainability.
65
66Unfortunately, Lex is targeted only C. It also places artificial
67limits on the size of strings that can be recognized.
68
69ML-Lex is a variant of Lex for the ML programming language. ML-Lex
70has a syntax similar to Lex, and produces an ML program instead of a
71C program. ML-Lex produces a program that runs very efficiently.
72Typically the program will be as fast or even faster than a
73hand-coded lexer implemented in Standard ML.
74
75The program typically uses only a small amount of space.
76ML-Lex thus allows ML programmers the same benefits that Lex allows C
77programmers. It also does not place artificial limits on the size of
78recognized strings.
79
80II. ML-Lex specifications
81
82An ML-Lex specification has the general format:
83
84 {user declarations}
85 %%
86 {ML-Lex definitions}
87 %%
88 {rules}
89
90Each section is separated from the others by a '%%' delimiter.
91
92The rules are used to define the lexical analysis function. Each
93rule has two parts - a regular expression and an action. The regular
94expression defines the word class that a rule matches. The action is
95a program fragment to be executed when a rule matches the input. The
96actions are used to compute values, and must all return values of the
97same type.
98
99The user can define values available to all rule actions in the user
100declarations section. The user must define two values in this
101section - a type lexresult and a function eof. Lexresult defines the
102type of values returned by the rule actions. The function "eof" is
103called by the lexer when the end of the input stream is reached. It
104will typically return a value signalling eof or raise an exception.
105It is called with the same argument as lex (see %arg, below),
106and must return a value of type lexresult.
107
108In the definitions section, the user can define named regular
109expressions, a set of start states, and specify which of the various
110bells and whistles of ML-Lex are desired.
111
112The start states allow the user to control when certain rules are
113matched. Rules may be defined to match only when the lexer is in
114specific start states. The user may change the lexer's start state
115in a rule action. This allows the user to specify special handling
116of lexical objects.
117
118This feature is typically used to handle quoted strings with escapes
119to denote special characters. The rules to recognize the inside
120contents of a string are defined for only one start state. This
121start state is entered when the beginning of a string is recognized,
122and exited when the end of the string is recognized.
123
124III. Regular expressions.
125
126Regular expressions are a simple language for denoting classes of
127strings. A regular expression is defined inductively over an
128alphabet with a set of basic operations. The alphabet for ML-Lex is
129the Ascii character set (character codes 0-127; or if %full is used, 0-255).
130
131The syntax and semantics of regular expressions will be described in
132order of decreasing precedence (from the most tightly-binding operators
133to the most weakly-binding):
134
135 An individual character stands for itself, except for the
136 reserved characters ? * + | ( ) ^ $ / ; . = < > [ { " \
137
138 A backslash followed by one of the reserved characters stands
139 for that character.
140
141 A set of characters enclosed in square brackets [ ] stands
142 for any one of those characters. Inside the brackets, only
143 the symbols \ - ^ are reserved. An initial up-arrow ^ stands
144 for the complement of the characters listed, e.g. [^abc]
145 stands any character except a, b, or c. The hyphen - denotes
146 a range of characters, e.g. [a-z] stands for any lower-case
147 alphabetic character, and [0-9a-fA-F] stands for any hexadecimal
148 digit. To include ^ literally in a bracketed set, put it anywhere
149 but first; to include - literally in a set, put it first or last.
150
151 The dot . character stands for any character except newline,
152 i.e. the same as [^\n]
153
154 The following special escape sequences are available, inside
155 or outside of square-brackets:
156 \b - backspace
157 \n - newline
158 \t - tab
159 \h - stands for all characters with codes >127,
160 when 7-bit characters are used.
161 \ddd - where ddd is a 3 digit decimal escape.
162
163 A sequence of characters will stand for itself (reserved
164 characters will be taken literally) if it is enclosed in
165 double quotes " ".
166
167 A named regular expression (defined in the "definitions"
168 section) may be referred to by enclosing its name in
169 braces { }.
170
171 Any regular expression may be enclosed in parentheses ( )
172 for syntactic (but, as usual, not semantic) effect.
173
174 The postfix operator * stands for Kleene closure:
175 zero or more repetitions of the preceding expression.
176
177 The postfix operator + stands for one or more repetitions
178 of the preceding expression.
179
180 The postfix operator ? stands for zero or one occurrence of
181 the preceding expression.
182
183 A postfix repetition range {n1,n2} where n1 and n2 are small
184 integers stands for any number of repetitions between n1 and n2
185 of the preceding expression. The notation {n1} stands for
186 exactly n1 repetitions.
187
188 Concatenation of expressions denotes concatenation of strings.
189 The expression e1 e2 stands for any string that results from
190 the concatenation of one string that matches e1 with another
191 string that matches e2.
192
193 The infix operator | stands for alternation. The expression
194 e1 | e2 stands for anything that either e1 or e2 stands for.
195
196 The infix operator / denotes lookahead. Lookahead is not
197 implemented and cannot be used, because there is a bug
198 in the algorithm for generating lexers with lookahead. If
199 it could be used, the expression e1 / e2 would match any string
200 that e1 stands for, but only when that string is followed by a
201 string that matches e2.
202
203 When the up-arrow ^ occurs at the beginning of an expression,
204 that expression will only match strings that occur at the
205 beginning of a line (right after a newline character).
206
207 The dollar sign $ is not implemented, since it is an abbreviation
208 for lookahead involving the newline character (that is, it
209 is an abbreviation /\n). If it could be used, when the dollar
210 sign $ occurred at the end of an expression, that expression
211 would only match strings that occur at the end of a line
212 (right before a newline character).
213
214Here are some examples of regular expressions, and descriptions of the
215set of strings they denote:
216
217 0 | 1 | 2 | 3 A single digit between 0 and 3
218 [0123] A single digit between 0 and 3
219 0123 The string "0123"
220 0* All strings of 0 or more 0's
221 00* All strings of 1 or more 0's
222 0+ All strings of 1 or more 0's
223 [0-9]{3} Any three-digit decimal number.
224 \\[ntb] The strings "\n" "\t" "\b"
225 (00)* Any string with an even number of 0's.
226
227IV. ML-Lex syntax summary
228
229A. User declarations
230
231Anything up to the first %% is in the user declarations section. The
232user should note that no symbolic identifier containing '%%' can be
233used in this section.
234
235B. ML-Lex definitions
236
237Start states can be defined with
238
239 %s {identifier list} ;
240
241 or %S {identifier list} ;
242
243An identifier list consists of one or more identifiers.
244
245An identifier consists of one or more letters, digits, underscores,
246or primes. It must begin with a letter.
247
248Named expressions can be defined with
249
250 {identifier} = {regular expression} ;
251
252Regular expressions are defined below.
253
254The following % commands are also available:
255
256 %reject - create REJECT() function
257 %count - count newlines using yylineno
258 %full - create lexer for the full 8-bit character set,
259 with characters in the range 0-255 permitted
260 as input.
261 %structure {identifier} - name the structure in the output program
262 {identifier} instead of Mlex
263 %header - use code following it to create header for lexer
264 structure
265 %arg - extra (curried) formal parameter argument to be
266 passed to the lex functions, and to be passed
267 to the eof function in place of ()
268 These functions are discussed below, under values available to
269 actions.
270
271C. Rules
272
273Each rule has the format:
274
275 <start state list> {regular expression} => ( ... code ... );
276
277All parentheses in ... code ... must be balanced, including those
278used in strings and comments.
279
280The start state list is optional. It consists of a list of
281identifiers separated by commas, and is delimited by triangle
282brackets < >. Each identifier must be a start state defined in the
283%s section above.
284
285The regular expression is only recognized when the lexer is in one of
286the start states in the start state list. If no start state list is
287given, the expression is recognized in all start states.
288
289The lexer begins in a pre-defined start state called INITIAL.
290
291The lexer resolves conflicts among rules by choosing the rule with
292the longest match, and in the case two rules match the same string,
293choosing the rule listed first in the specification.
294
295The rules should match all possible input. If some input occurs that
296does not match any rule, the lexer created by ML-Lex will raise an
297exception LexError. Note that this differs from C Lex, which prints
298any unmatched input on the standard output.
299
300V. Values available inside the code associated with a rule.
301
302Mlex places the value of the string matched by a regular expression
303in yytext, a string variable.
304
305The user may recursively
306call the lexing function with lex(). (If %arg is used, the
307lexing function may be re-invoked with the same argument by using
308continue().) This is convenient for ignoring white space or comments silently:
309
310 [\ \t\n]+ => ( lex());
311
312To switch start states, the user may call YYBEGIN with the name of a
313start state.
314
315The following values will be available only if the corresponding %
316command is in the ML-Lex definitions sections:
317
318 value %command description
319 ----- -------- -----------
320 REJECT %reject REJECT() causes the current
321 rule to be "rejected."
322 The lexer behaves as if the
323 current rule had not matched;
324 another rule that matches this
325 string, or that matches the longest
326 possible prefix of this string,
327 is used instead.
328
329 yypos Current character position from
330 beginning of file.
331
332 yylineno %count Current line number
333
334
335These values should be used only if necessary. Adding REJECT to a
336lexer will slow it down by 20%; adding yylineno will slow it down by
337another 20%, or more. (It is much more efficient to recognize \n and
338have an action that increments the line-number variable.) The use of
339the lookahead operator / will also slow down the entire lexer.
340The character-position, yypos, is not costly to maintain, however.
341
342VI. Running ML-Lex
343
344From the Unix shell, run sml-lex myfile.lex
345The output file will be myfile.lex.sml. The extension ".lex" is not
346required but is recommended.
347
348Within an interactive system [not the preferred method]:
349Use "lexgen.sml"; this will create a structure LexGen. The function
350LexGen.lexGen creates a program for a lexer from an input
351specification. It takes a string argument -- the name of the file
352containing the input specification. The output file name is
353determined by appending ".sml" to the input file name.
354
355VII. Using the program produced by ML-Lex.
356
357When the output file is loaded, it will create a structure Mlex that
358contains the function makeLexer. makeLexer takes a function from int
359-> string and returns a lexing function.
360
361For example,
362
363 val lexer = Mlex.makeLexer (inputc (open_in "f"))
364
365creates a lexer that operates on the file whose name is f.
366
367The function from int -> string should read a string of characters
368from the input stream. It should return a null string to indicate
369that the end of the stream has been reached. The integer is the
370number of characters that the lexer wishes to read; the function may
371return any non-zero number of characters. For example,
372
373 val lexer =
374 let val input_line = fn f =>
375 let fun loop result =
376 let val c = input (f,1)
377 val result = c :: result
378 in if String.size c = 0 orelse c = "\n" then
379 String.implode (rev result)
380 else loop result
381 end
382 in loop nil
383 end
384 in Mlex.makeLexer (fn n => input_line std_in)
385 end
386is appropriate for interactive streams where prompting, etc. occurs;
387the lexer won't care that input_line might return a string of more
388than or less than n characters.
389
390The lexer tries to read a large number of characters from the input
391function at once, and it is desirable that the input function return
392as many as possible. Reading many characters at once makes the lexer
393more efficient. Fewer input calls and buffering operations are
394needed, and input is more efficient in large block reads. For
395interactive streams this is less of a concern, as the limiting factor
396is the speed at which the user can type.
397
398To obtain a value, invoke the lexer by passing it a unit:
399
400 val nextToken = lexer()
401
402If one wanted to restart the lexer, one would just discard "lexer"
403and create a new lexer on the same stream with another call to
404makeLexer. This is the best way to discard any characters buffered
405internally by the lexer.
406
407All code in the user declarations section is placed inside a
408structure UserDeclarations. To access this structure, use the path name
409Mlex.UserDeclarations.
410
411If any input cannot be matched, the program will raise the exception
412Mlex.LexError. An internal error (i.e. bug) will cause the
413exception Internal.LexerError to be raised.
414
415If %structure is used, remember that the structure name will no
416longer be Mlex, but the one specified in the command.
417
418VIII. Sample
419
420Here is a sample lexer for a calculator program:
421
422datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN |
423 NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES
424
425val linenum = ref 1
426val error = fn x => output(std_out,x ^ "\n")
427val eof = fn () => EOF
428%%
429%structure CalcLex
430alpha=[A-Za-z];
431digit=[0-9];
432ws = [\ \t];
433%%
434\n => (inc linenum; lex());
435{ws}+ => (lex());
436"/" => (DIV);
437";" => (EOS);
438"(" => (LPAREN);
439{digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0));
440")" => (RPAREN);
441"+" => (PLUS);
442{alpha}+ => (if yytext="print" then PRINT else ID yytext);
443"-" => (SUB);
444"*" => (TIMES);
445. => (error ("calc: ignoring bad character "^yytext); lex());
446
447
448Here is the parser for the calculator:
449
450(* Sample interactive calculator to demonstrate use of lexer produced by ML-Lex
451
452 The original grammar was
453
454 stmt_list -> stmt_list stmt
455 stmt -> print exp ; | exp ;
456 exp -> exp + t | exp - t | t
457 t -> t * f | t/f | f
458 f -> (exp) | id | num
459
460 The function parse takes a stream and parses it for the calculator
461 program.
462
463 If a syntax error occurs, parse prints an error message and calls itself
464 on the stream. On this system that has the effect of ignoring all input
465 to the end of a line.
466*)
467
468structure Calc =
469 struct
470 open CalcLex
471 open UserDeclarations
472 exception Error
473 fun parse strm =
474 let
475 val say = fn s => output(std_out,s)
476 val input_line = fn f =>
477 let fun loop result =
478 let val c = input (f,1)
479 val result = c :: result
480 in if String.size c = 0 orelse c = "\n" then
481 String.implode (rev result)
482 else loop result
483 end
484 in loop nil
485 end
486 val lexer = makeLexer (fn n => input_line strm)
487 val nexttok = ref (lexer())
488 val advance = fn () => (nexttok := lexer(); !nexttok)
489 val error = fn () => (say ("calc: syntax error on line" ^
490 (makestring(!linenum)) ^ "\n"); raise Error)
491 val lookup = fn i =>
492 if i = "ONE" then 1
493 else if i = "TWO" then 2
494 else (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error)
495 fun STMT_LIST () =
496 case !nexttok of
497 EOF => ()
498 | _ => (STMT(); STMT_LIST())
499
500 and STMT() =
501 (case !nexttok
502 of EOS => ()
503 | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ())
504 | _ => (E(); ());
505 case !nexttok
506 of EOS => (advance())
507 | _ => error())
508 and E () = E' (T())
509 and E' (i : int ) =
510 case !nexttok of
511 PLUS => (advance (); E'(i+T()))
512 | SUB => (advance (); E'(i-T()))
513 | RPAREN => i
514 | EOF => i
515 | EOS => i
516 | _ => error()
517 and T () = T'(F())
518 and T' i =
519 case !nexttok of
520 PLUS => i
521 | SUB => i
522 | TIMES => (advance(); T'(i*F()))
523 | DIV => (advance (); T'(i div F()))
524 | EOF => i
525 | EOS => i
526 | RPAREN => i
527 | _ => error()
528 and F () =
529 case !nexttok of
530 ID i => (advance(); lookup i)
531 | LPAREN =>
532 let val v = (advance(); E())
533 in if !nexttok = RPAREN then (advance (); v) else error()
534 end
535 | NUM i => (advance(); i)
536 | _ => error()
537 in STMT_LIST () handle Error => parse strm
538 end
539 end