Backport from sid to buster
[hcoop/debian/mlton.git] / mllex / lexgen.doc
1 A lexical analyzer generator for Standard ML.
2
3 THIS TEXT FILE IS OBSOLETE and IS NOT MAINTAINED.
4 The current (maintained) documentation is in lexgen.tex.
5
6
7 Andrew W. Appel
8 James S. Mattson
9 David R. Tarditi
10
11 Princeton University
12
13 Version 1.6, October 1994
14
15 Copyright (c) 1989-1992 by Andrew W. Appel, James S. Mattson, David R. Tarditi
16
17 This software comes with ABSOLUTELY NO WARRANTY.
18 This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
19 COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
20 distributed with this software). You may copy and distribute this software;
21 see the COPYRIGHT NOTICE for details and restrictions.
22
23 I. General Description
24
25 Computer programs often need to divide their input into words and
26 distinguish between different kinds of words. Compilers, for
27 example, need to distinguish between integers, reserved words, and
28 identifiers. Applications programs often need to be able to
29 recognize components of typed commands from users.
30
31 The problem of segmenting input into words and recognizing classes of
32 words is known as lexical analysis. Small cases of this problem,
33 such as reading text strings separated by spaces, can be solved by
34 using hand-written programs. Larger cases of this problem, such as
35 tokenizing an input stream for a compiler, can also be solved using
36 hand-written programs.
37
38 A hand-written program for a large lexical analysis problem, however,
39 suffers from two major problems. First, the program requires a fair
40 amount of programmer time to create. Second, the description of
41 classes of words is not explicit in the program. It must be inferred
42 from the program code. This makes it difficult to verify if the
43 program recognizes the correct words for each class. It also makes
44 future maintenance of the program difficult.
45
46 Lex, a programming tool for the Unix system, is a successful solution
47 to the general problem of lexical analysis. It uses regular
48 expressions to describe classes of words. A program fragment is
49 associated with each class of words. This information is given to
50 Lex as a specification (a Lex program). Lex produces a program for a
51 function that can be used to perform lexical analysis.
52
53 The function operates as follows. It finds the longest word starting
54 from the current position in the input stream that is in one of the
55 word classes. It executes the program fragment associated with the
56 class, and sets the current position in the input stream to be the
57 character after the word. The program fragment has the actual text
58 of the word available to it, and may be any piece of code. For many
59 applications it returns some kind of value.
60
61 Lex allows the programmer to make the language description explicit,
62 and to concentrate on what to do with the recognized words, not how
63 to recognize the words. It saves programmer time and increases
64 program maintainability.
65
66 Unfortunately, Lex is targeted only C. It also places artificial
67 limits on the size of strings that can be recognized.
68
69 ML-Lex is a variant of Lex for the ML programming language. ML-Lex
70 has a syntax similar to Lex, and produces an ML program instead of a
71 C program. ML-Lex produces a program that runs very efficiently.
72 Typically the program will be as fast or even faster than a
73 hand-coded lexer implemented in Standard ML.
74
75 The program typically uses only a small amount of space.
76 ML-Lex thus allows ML programmers the same benefits that Lex allows C
77 programmers. It also does not place artificial limits on the size of
78 recognized strings.
79
80 II. ML-Lex specifications
81
82 An ML-Lex specification has the general format:
83
84 {user declarations}
85 %%
86 {ML-Lex definitions}
87 %%
88 {rules}
89
90 Each section is separated from the others by a '%%' delimiter.
91
92 The rules are used to define the lexical analysis function. Each
93 rule has two parts - a regular expression and an action. The regular
94 expression defines the word class that a rule matches. The action is
95 a program fragment to be executed when a rule matches the input. The
96 actions are used to compute values, and must all return values of the
97 same type.
98
99 The user can define values available to all rule actions in the user
100 declarations section. The user must define two values in this
101 section - a type lexresult and a function eof. Lexresult defines the
102 type of values returned by the rule actions. The function "eof" is
103 called by the lexer when the end of the input stream is reached. It
104 will typically return a value signalling eof or raise an exception.
105 It is called with the same argument as lex (see %arg, below),
106 and must return a value of type lexresult.
107
108 In the definitions section, the user can define named regular
109 expressions, a set of start states, and specify which of the various
110 bells and whistles of ML-Lex are desired.
111
112 The start states allow the user to control when certain rules are
113 matched. Rules may be defined to match only when the lexer is in
114 specific start states. The user may change the lexer's start state
115 in a rule action. This allows the user to specify special handling
116 of lexical objects.
117
118 This feature is typically used to handle quoted strings with escapes
119 to denote special characters. The rules to recognize the inside
120 contents of a string are defined for only one start state. This
121 start state is entered when the beginning of a string is recognized,
122 and exited when the end of the string is recognized.
123
124 III. Regular expressions.
125
126 Regular expressions are a simple language for denoting classes of
127 strings. A regular expression is defined inductively over an
128 alphabet with a set of basic operations. The alphabet for ML-Lex is
129 the Ascii character set (character codes 0-127; or if %full is used, 0-255).
130
131 The syntax and semantics of regular expressions will be described in
132 order of decreasing precedence (from the most tightly-binding operators
133 to the most weakly-binding):
134
135 An individual character stands for itself, except for the
136 reserved characters ? * + | ( ) ^ $ / ; . = < > [ { " \
137
138 A backslash followed by one of the reserved characters stands
139 for that character.
140
141 A set of characters enclosed in square brackets [ ] stands
142 for any one of those characters. Inside the brackets, only
143 the symbols \ - ^ are reserved. An initial up-arrow ^ stands
144 for the complement of the characters listed, e.g. [^abc]
145 stands any character except a, b, or c. The hyphen - denotes
146 a range of characters, e.g. [a-z] stands for any lower-case
147 alphabetic character, and [0-9a-fA-F] stands for any hexadecimal
148 digit. To include ^ literally in a bracketed set, put it anywhere
149 but first; to include - literally in a set, put it first or last.
150
151 The dot . character stands for any character except newline,
152 i.e. the same as [^\n]
153
154 The following special escape sequences are available, inside
155 or outside of square-brackets:
156 \b - backspace
157 \n - newline
158 \t - tab
159 \h - stands for all characters with codes >127,
160 when 7-bit characters are used.
161 \ddd - where ddd is a 3 digit decimal escape.
162
163 A sequence of characters will stand for itself (reserved
164 characters will be taken literally) if it is enclosed in
165 double quotes " ".
166
167 A named regular expression (defined in the "definitions"
168 section) may be referred to by enclosing its name in
169 braces { }.
170
171 Any regular expression may be enclosed in parentheses ( )
172 for syntactic (but, as usual, not semantic) effect.
173
174 The postfix operator * stands for Kleene closure:
175 zero or more repetitions of the preceding expression.
176
177 The postfix operator + stands for one or more repetitions
178 of the preceding expression.
179
180 The postfix operator ? stands for zero or one occurrence of
181 the preceding expression.
182
183 A postfix repetition range {n1,n2} where n1 and n2 are small
184 integers stands for any number of repetitions between n1 and n2
185 of the preceding expression. The notation {n1} stands for
186 exactly n1 repetitions.
187
188 Concatenation of expressions denotes concatenation of strings.
189 The expression e1 e2 stands for any string that results from
190 the concatenation of one string that matches e1 with another
191 string that matches e2.
192
193 The infix operator | stands for alternation. The expression
194 e1 | e2 stands for anything that either e1 or e2 stands for.
195
196 The infix operator / denotes lookahead. Lookahead is not
197 implemented and cannot be used, because there is a bug
198 in the algorithm for generating lexers with lookahead. If
199 it could be used, the expression e1 / e2 would match any string
200 that e1 stands for, but only when that string is followed by a
201 string that matches e2.
202
203 When the up-arrow ^ occurs at the beginning of an expression,
204 that expression will only match strings that occur at the
205 beginning of a line (right after a newline character).
206
207 The dollar sign $ is not implemented, since it is an abbreviation
208 for lookahead involving the newline character (that is, it
209 is an abbreviation /\n). If it could be used, when the dollar
210 sign $ occurred at the end of an expression, that expression
211 would only match strings that occur at the end of a line
212 (right before a newline character).
213
214 Here are some examples of regular expressions, and descriptions of the
215 set of strings they denote:
216
217 0 | 1 | 2 | 3 A single digit between 0 and 3
218 [0123] A single digit between 0 and 3
219 0123 The string "0123"
220 0* All strings of 0 or more 0's
221 00* All strings of 1 or more 0's
222 0+ All strings of 1 or more 0's
223 [0-9]{3} Any three-digit decimal number.
224 \\[ntb] The strings "\n" "\t" "\b"
225 (00)* Any string with an even number of 0's.
226
227 IV. ML-Lex syntax summary
228
229 A. User declarations
230
231 Anything up to the first %% is in the user declarations section. The
232 user should note that no symbolic identifier containing '%%' can be
233 used in this section.
234
235 B. ML-Lex definitions
236
237 Start states can be defined with
238
239 %s {identifier list} ;
240
241 or %S {identifier list} ;
242
243 An identifier list consists of one or more identifiers.
244
245 An identifier consists of one or more letters, digits, underscores,
246 or primes. It must begin with a letter.
247
248 Named expressions can be defined with
249
250 {identifier} = {regular expression} ;
251
252 Regular expressions are defined below.
253
254 The following % commands are also available:
255
256 %reject - create REJECT() function
257 %count - count newlines using yylineno
258 %full - create lexer for the full 8-bit character set,
259 with characters in the range 0-255 permitted
260 as input.
261 %structure {identifier} - name the structure in the output program
262 {identifier} instead of Mlex
263 %header - use code following it to create header for lexer
264 structure
265 %arg - extra (curried) formal parameter argument to be
266 passed to the lex functions, and to be passed
267 to the eof function in place of ()
268 These functions are discussed below, under values available to
269 actions.
270
271 C. Rules
272
273 Each rule has the format:
274
275 <start state list> {regular expression} => ( ... code ... );
276
277 All parentheses in ... code ... must be balanced, including those
278 used in strings and comments.
279
280 The start state list is optional. It consists of a list of
281 identifiers separated by commas, and is delimited by triangle
282 brackets < >. Each identifier must be a start state defined in the
283 %s section above.
284
285 The regular expression is only recognized when the lexer is in one of
286 the start states in the start state list. If no start state list is
287 given, the expression is recognized in all start states.
288
289 The lexer begins in a pre-defined start state called INITIAL.
290
291 The lexer resolves conflicts among rules by choosing the rule with
292 the longest match, and in the case two rules match the same string,
293 choosing the rule listed first in the specification.
294
295 The rules should match all possible input. If some input occurs that
296 does not match any rule, the lexer created by ML-Lex will raise an
297 exception LexError. Note that this differs from C Lex, which prints
298 any unmatched input on the standard output.
299
300 V. Values available inside the code associated with a rule.
301
302 Mlex places the value of the string matched by a regular expression
303 in yytext, a string variable.
304
305 The user may recursively
306 call the lexing function with lex(). (If %arg is used, the
307 lexing function may be re-invoked with the same argument by using
308 continue().) This is convenient for ignoring white space or comments silently:
309
310 [\ \t\n]+ => ( lex());
311
312 To switch start states, the user may call YYBEGIN with the name of a
313 start state.
314
315 The following values will be available only if the corresponding %
316 command is in the ML-Lex definitions sections:
317
318 value %command description
319 ----- -------- -----------
320 REJECT %reject REJECT() causes the current
321 rule to be "rejected."
322 The lexer behaves as if the
323 current rule had not matched;
324 another rule that matches this
325 string, or that matches the longest
326 possible prefix of this string,
327 is used instead.
328
329 yypos Current character position from
330 beginning of file.
331
332 yylineno %count Current line number
333
334
335 These values should be used only if necessary. Adding REJECT to a
336 lexer will slow it down by 20%; adding yylineno will slow it down by
337 another 20%, or more. (It is much more efficient to recognize \n and
338 have an action that increments the line-number variable.) The use of
339 the lookahead operator / will also slow down the entire lexer.
340 The character-position, yypos, is not costly to maintain, however.
341
342 VI. Running ML-Lex
343
344 From the Unix shell, run sml-lex myfile.lex
345 The output file will be myfile.lex.sml. The extension ".lex" is not
346 required but is recommended.
347
348 Within an interactive system [not the preferred method]:
349 Use "lexgen.sml"; this will create a structure LexGen. The function
350 LexGen.lexGen creates a program for a lexer from an input
351 specification. It takes a string argument -- the name of the file
352 containing the input specification. The output file name is
353 determined by appending ".sml" to the input file name.
354
355 VII. Using the program produced by ML-Lex.
356
357 When the output file is loaded, it will create a structure Mlex that
358 contains the function makeLexer. makeLexer takes a function from int
359 -> string and returns a lexing function.
360
361 For example,
362
363 val lexer = Mlex.makeLexer (inputc (open_in "f"))
364
365 creates a lexer that operates on the file whose name is f.
366
367 The function from int -> string should read a string of characters
368 from the input stream. It should return a null string to indicate
369 that the end of the stream has been reached. The integer is the
370 number of characters that the lexer wishes to read; the function may
371 return any non-zero number of characters. For example,
372
373 val lexer =
374 let val input_line = fn f =>
375 let fun loop result =
376 let val c = input (f,1)
377 val result = c :: result
378 in if String.size c = 0 orelse c = "\n" then
379 String.implode (rev result)
380 else loop result
381 end
382 in loop nil
383 end
384 in Mlex.makeLexer (fn n => input_line std_in)
385 end
386 is appropriate for interactive streams where prompting, etc. occurs;
387 the lexer won't care that input_line might return a string of more
388 than or less than n characters.
389
390 The lexer tries to read a large number of characters from the input
391 function at once, and it is desirable that the input function return
392 as many as possible. Reading many characters at once makes the lexer
393 more efficient. Fewer input calls and buffering operations are
394 needed, and input is more efficient in large block reads. For
395 interactive streams this is less of a concern, as the limiting factor
396 is the speed at which the user can type.
397
398 To obtain a value, invoke the lexer by passing it a unit:
399
400 val nextToken = lexer()
401
402 If one wanted to restart the lexer, one would just discard "lexer"
403 and create a new lexer on the same stream with another call to
404 makeLexer. This is the best way to discard any characters buffered
405 internally by the lexer.
406
407 All code in the user declarations section is placed inside a
408 structure UserDeclarations. To access this structure, use the path name
409 Mlex.UserDeclarations.
410
411 If any input cannot be matched, the program will raise the exception
412 Mlex.LexError. An internal error (i.e. bug) will cause the
413 exception Internal.LexerError to be raised.
414
415 If %structure is used, remember that the structure name will no
416 longer be Mlex, but the one specified in the command.
417
418 VIII. Sample
419
420 Here is a sample lexer for a calculator program:
421
422 datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN |
423 NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES
424
425 val linenum = ref 1
426 val error = fn x => output(std_out,x ^ "\n")
427 val eof = fn () => EOF
428 %%
429 %structure CalcLex
430 alpha=[A-Za-z];
431 digit=[0-9];
432 ws = [\ \t];
433 %%
434 \n => (inc linenum; lex());
435 {ws}+ => (lex());
436 "/" => (DIV);
437 ";" => (EOS);
438 "(" => (LPAREN);
439 {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0));
440 ")" => (RPAREN);
441 "+" => (PLUS);
442 {alpha}+ => (if yytext="print" then PRINT else ID yytext);
443 "-" => (SUB);
444 "*" => (TIMES);
445 . => (error ("calc: ignoring bad character "^yytext); lex());
446
447
448 Here is the parser for the calculator:
449
450 (* Sample interactive calculator to demonstrate use of lexer produced by ML-Lex
451
452 The original grammar was
453
454 stmt_list -> stmt_list stmt
455 stmt -> print exp ; | exp ;
456 exp -> exp + t | exp - t | t
457 t -> t * f | t/f | f
458 f -> (exp) | id | num
459
460 The function parse takes a stream and parses it for the calculator
461 program.
462
463 If a syntax error occurs, parse prints an error message and calls itself
464 on the stream. On this system that has the effect of ignoring all input
465 to the end of a line.
466 *)
467
468 structure Calc =
469 struct
470 open CalcLex
471 open UserDeclarations
472 exception Error
473 fun parse strm =
474 let
475 val say = fn s => output(std_out,s)
476 val input_line = fn f =>
477 let fun loop result =
478 let val c = input (f,1)
479 val result = c :: result
480 in if String.size c = 0 orelse c = "\n" then
481 String.implode (rev result)
482 else loop result
483 end
484 in loop nil
485 end
486 val lexer = makeLexer (fn n => input_line strm)
487 val nexttok = ref (lexer())
488 val advance = fn () => (nexttok := lexer(); !nexttok)
489 val error = fn () => (say ("calc: syntax error on line" ^
490 (makestring(!linenum)) ^ "\n"); raise Error)
491 val lookup = fn i =>
492 if i = "ONE" then 1
493 else if i = "TWO" then 2
494 else (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error)
495 fun STMT_LIST () =
496 case !nexttok of
497 EOF => ()
498 | _ => (STMT(); STMT_LIST())
499
500 and STMT() =
501 (case !nexttok
502 of EOS => ()
503 | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ())
504 | _ => (E(); ());
505 case !nexttok
506 of EOS => (advance())
507 | _ => error())
508 and E () = E' (T())
509 and E' (i : int ) =
510 case !nexttok of
511 PLUS => (advance (); E'(i+T()))
512 | SUB => (advance (); E'(i-T()))
513 | RPAREN => i
514 | EOF => i
515 | EOS => i
516 | _ => error()
517 and T () = T'(F())
518 and T' i =
519 case !nexttok of
520 PLUS => i
521 | SUB => i
522 | TIMES => (advance(); T'(i*F()))
523 | DIV => (advance (); T'(i div F()))
524 | EOF => i
525 | EOS => i
526 | RPAREN => i
527 | _ => error()
528 and F () =
529 case !nexttok of
530 ID i => (advance(); lookup i)
531 | LPAREN =>
532 let val v = (advance(); E())
533 in if !nexttok = RPAREN then (advance (); v) else error()
534 end
535 | NUM i => (advance(); i)
536 | _ => error()
537 in STMT_LIST () handle Error => parse strm
538 end
539 end