Commit | Line | Data |
---|---|---|
7f918cf1 CE |
1 | A lexical analyzer generator for Standard ML. |
2 | ||
3 | THIS TEXT FILE IS OBSOLETE and IS NOT MAINTAINED. | |
4 | The current (maintained) documentation is in lexgen.tex. | |
5 | ||
6 | ||
7 | Andrew W. Appel | |
8 | James S. Mattson | |
9 | David R. Tarditi | |
10 | ||
11 | Princeton University | |
12 | ||
13 | Version 1.6, October 1994 | |
14 | ||
15 | Copyright (c) 1989-1992 by Andrew W. Appel, James S. Mattson, David R. Tarditi | |
16 | ||
17 | This software comes with ABSOLUTELY NO WARRANTY. | |
18 | This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY | |
19 | COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT", | |
20 | distributed with this software). You may copy and distribute this software; | |
21 | see the COPYRIGHT NOTICE for details and restrictions. | |
22 | ||
23 | I. General Description | |
24 | ||
25 | Computer programs often need to divide their input into words and | |
26 | distinguish between different kinds of words. Compilers, for | |
27 | example, need to distinguish between integers, reserved words, and | |
28 | identifiers. Applications programs often need to be able to | |
29 | recognize components of typed commands from users. | |
30 | ||
31 | The problem of segmenting input into words and recognizing classes of | |
32 | words is known as lexical analysis. Small cases of this problem, | |
33 | such as reading text strings separated by spaces, can be solved by | |
34 | using hand-written programs. Larger cases of this problem, such as | |
35 | tokenizing an input stream for a compiler, can also be solved using | |
36 | hand-written programs. | |
37 | ||
38 | A hand-written program for a large lexical analysis problem, however, | |
39 | suffers from two major problems. First, the program requires a fair | |
40 | amount of programmer time to create. Second, the description of | |
41 | classes of words is not explicit in the program. It must be inferred | |
42 | from the program code. This makes it difficult to verify if the | |
43 | program recognizes the correct words for each class. It also makes | |
44 | future maintenance of the program difficult. | |
45 | ||
46 | Lex, a programming tool for the Unix system, is a successful solution | |
47 | to the general problem of lexical analysis. It uses regular | |
48 | expressions to describe classes of words. A program fragment is | |
49 | associated with each class of words. This information is given to | |
50 | Lex as a specification (a Lex program). Lex produces a program for a | |
51 | function that can be used to perform lexical analysis. | |
52 | ||
53 | The function operates as follows. It finds the longest word starting | |
54 | from the current position in the input stream that is in one of the | |
55 | word classes. It executes the program fragment associated with the | |
56 | class, and sets the current position in the input stream to be the | |
57 | character after the word. The program fragment has the actual text | |
58 | of the word available to it, and may be any piece of code. For many | |
59 | applications it returns some kind of value. | |
60 | ||
61 | Lex allows the programmer to make the language description explicit, | |
62 | and to concentrate on what to do with the recognized words, not how | |
63 | to recognize the words. It saves programmer time and increases | |
64 | program maintainability. | |
65 | ||
66 | Unfortunately, Lex is targeted only C. It also places artificial | |
67 | limits on the size of strings that can be recognized. | |
68 | ||
69 | ML-Lex is a variant of Lex for the ML programming language. ML-Lex | |
70 | has a syntax similar to Lex, and produces an ML program instead of a | |
71 | C program. ML-Lex produces a program that runs very efficiently. | |
72 | Typically the program will be as fast or even faster than a | |
73 | hand-coded lexer implemented in Standard ML. | |
74 | ||
75 | The program typically uses only a small amount of space. | |
76 | ML-Lex thus allows ML programmers the same benefits that Lex allows C | |
77 | programmers. It also does not place artificial limits on the size of | |
78 | recognized strings. | |
79 | ||
80 | II. ML-Lex specifications | |
81 | ||
82 | An ML-Lex specification has the general format: | |
83 | ||
84 | {user declarations} | |
85 | %% | |
86 | {ML-Lex definitions} | |
87 | %% | |
88 | {rules} | |
89 | ||
90 | Each section is separated from the others by a '%%' delimiter. | |
91 | ||
92 | The rules are used to define the lexical analysis function. Each | |
93 | rule has two parts - a regular expression and an action. The regular | |
94 | expression defines the word class that a rule matches. The action is | |
95 | a program fragment to be executed when a rule matches the input. The | |
96 | actions are used to compute values, and must all return values of the | |
97 | same type. | |
98 | ||
99 | The user can define values available to all rule actions in the user | |
100 | declarations section. The user must define two values in this | |
101 | section - a type lexresult and a function eof. Lexresult defines the | |
102 | type of values returned by the rule actions. The function "eof" is | |
103 | called by the lexer when the end of the input stream is reached. It | |
104 | will typically return a value signalling eof or raise an exception. | |
105 | It is called with the same argument as lex (see %arg, below), | |
106 | and must return a value of type lexresult. | |
107 | ||
108 | In the definitions section, the user can define named regular | |
109 | expressions, a set of start states, and specify which of the various | |
110 | bells and whistles of ML-Lex are desired. | |
111 | ||
112 | The start states allow the user to control when certain rules are | |
113 | matched. Rules may be defined to match only when the lexer is in | |
114 | specific start states. The user may change the lexer's start state | |
115 | in a rule action. This allows the user to specify special handling | |
116 | of lexical objects. | |
117 | ||
118 | This feature is typically used to handle quoted strings with escapes | |
119 | to denote special characters. The rules to recognize the inside | |
120 | contents of a string are defined for only one start state. This | |
121 | start state is entered when the beginning of a string is recognized, | |
122 | and exited when the end of the string is recognized. | |
123 | ||
124 | III. Regular expressions. | |
125 | ||
126 | Regular expressions are a simple language for denoting classes of | |
127 | strings. A regular expression is defined inductively over an | |
128 | alphabet with a set of basic operations. The alphabet for ML-Lex is | |
129 | the Ascii character set (character codes 0-127; or if %full is used, 0-255). | |
130 | ||
131 | The syntax and semantics of regular expressions will be described in | |
132 | order of decreasing precedence (from the most tightly-binding operators | |
133 | to the most weakly-binding): | |
134 | ||
135 | An individual character stands for itself, except for the | |
136 | reserved characters ? * + | ( ) ^ $ / ; . = < > [ { " \ | |
137 | ||
138 | A backslash followed by one of the reserved characters stands | |
139 | for that character. | |
140 | ||
141 | A set of characters enclosed in square brackets [ ] stands | |
142 | for any one of those characters. Inside the brackets, only | |
143 | the symbols \ - ^ are reserved. An initial up-arrow ^ stands | |
144 | for the complement of the characters listed, e.g. [^abc] | |
145 | stands any character except a, b, or c. The hyphen - denotes | |
146 | a range of characters, e.g. [a-z] stands for any lower-case | |
147 | alphabetic character, and [0-9a-fA-F] stands for any hexadecimal | |
148 | digit. To include ^ literally in a bracketed set, put it anywhere | |
149 | but first; to include - literally in a set, put it first or last. | |
150 | ||
151 | The dot . character stands for any character except newline, | |
152 | i.e. the same as [^\n] | |
153 | ||
154 | The following special escape sequences are available, inside | |
155 | or outside of square-brackets: | |
156 | \b - backspace | |
157 | \n - newline | |
158 | \t - tab | |
159 | \h - stands for all characters with codes >127, | |
160 | when 7-bit characters are used. | |
161 | \ddd - where ddd is a 3 digit decimal escape. | |
162 | ||
163 | A sequence of characters will stand for itself (reserved | |
164 | characters will be taken literally) if it is enclosed in | |
165 | double quotes " ". | |
166 | ||
167 | A named regular expression (defined in the "definitions" | |
168 | section) may be referred to by enclosing its name in | |
169 | braces { }. | |
170 | ||
171 | Any regular expression may be enclosed in parentheses ( ) | |
172 | for syntactic (but, as usual, not semantic) effect. | |
173 | ||
174 | The postfix operator * stands for Kleene closure: | |
175 | zero or more repetitions of the preceding expression. | |
176 | ||
177 | The postfix operator + stands for one or more repetitions | |
178 | of the preceding expression. | |
179 | ||
180 | The postfix operator ? stands for zero or one occurrence of | |
181 | the preceding expression. | |
182 | ||
183 | A postfix repetition range {n1,n2} where n1 and n2 are small | |
184 | integers stands for any number of repetitions between n1 and n2 | |
185 | of the preceding expression. The notation {n1} stands for | |
186 | exactly n1 repetitions. | |
187 | ||
188 | Concatenation of expressions denotes concatenation of strings. | |
189 | The expression e1 e2 stands for any string that results from | |
190 | the concatenation of one string that matches e1 with another | |
191 | string that matches e2. | |
192 | ||
193 | The infix operator | stands for alternation. The expression | |
194 | e1 | e2 stands for anything that either e1 or e2 stands for. | |
195 | ||
196 | The infix operator / denotes lookahead. Lookahead is not | |
197 | implemented and cannot be used, because there is a bug | |
198 | in the algorithm for generating lexers with lookahead. If | |
199 | it could be used, the expression e1 / e2 would match any string | |
200 | that e1 stands for, but only when that string is followed by a | |
201 | string that matches e2. | |
202 | ||
203 | When the up-arrow ^ occurs at the beginning of an expression, | |
204 | that expression will only match strings that occur at the | |
205 | beginning of a line (right after a newline character). | |
206 | ||
207 | The dollar sign $ is not implemented, since it is an abbreviation | |
208 | for lookahead involving the newline character (that is, it | |
209 | is an abbreviation /\n). If it could be used, when the dollar | |
210 | sign $ occurred at the end of an expression, that expression | |
211 | would only match strings that occur at the end of a line | |
212 | (right before a newline character). | |
213 | ||
214 | Here are some examples of regular expressions, and descriptions of the | |
215 | set of strings they denote: | |
216 | ||
217 | 0 | 1 | 2 | 3 A single digit between 0 and 3 | |
218 | [0123] A single digit between 0 and 3 | |
219 | 0123 The string "0123" | |
220 | 0* All strings of 0 or more 0's | |
221 | 00* All strings of 1 or more 0's | |
222 | 0+ All strings of 1 or more 0's | |
223 | [0-9]{3} Any three-digit decimal number. | |
224 | \\[ntb] The strings "\n" "\t" "\b" | |
225 | (00)* Any string with an even number of 0's. | |
226 | ||
227 | IV. ML-Lex syntax summary | |
228 | ||
229 | A. User declarations | |
230 | ||
231 | Anything up to the first %% is in the user declarations section. The | |
232 | user should note that no symbolic identifier containing '%%' can be | |
233 | used in this section. | |
234 | ||
235 | B. ML-Lex definitions | |
236 | ||
237 | Start states can be defined with | |
238 | ||
239 | %s {identifier list} ; | |
240 | ||
241 | or %S {identifier list} ; | |
242 | ||
243 | An identifier list consists of one or more identifiers. | |
244 | ||
245 | An identifier consists of one or more letters, digits, underscores, | |
246 | or primes. It must begin with a letter. | |
247 | ||
248 | Named expressions can be defined with | |
249 | ||
250 | {identifier} = {regular expression} ; | |
251 | ||
252 | Regular expressions are defined below. | |
253 | ||
254 | The following % commands are also available: | |
255 | ||
256 | %reject - create REJECT() function | |
257 | %count - count newlines using yylineno | |
258 | %full - create lexer for the full 8-bit character set, | |
259 | with characters in the range 0-255 permitted | |
260 | as input. | |
261 | %structure {identifier} - name the structure in the output program | |
262 | {identifier} instead of Mlex | |
263 | %header - use code following it to create header for lexer | |
264 | structure | |
265 | %arg - extra (curried) formal parameter argument to be | |
266 | passed to the lex functions, and to be passed | |
267 | to the eof function in place of () | |
268 | These functions are discussed below, under values available to | |
269 | actions. | |
270 | ||
271 | C. Rules | |
272 | ||
273 | Each rule has the format: | |
274 | ||
275 | <start state list> {regular expression} => ( ... code ... ); | |
276 | ||
277 | All parentheses in ... code ... must be balanced, including those | |
278 | used in strings and comments. | |
279 | ||
280 | The start state list is optional. It consists of a list of | |
281 | identifiers separated by commas, and is delimited by triangle | |
282 | brackets < >. Each identifier must be a start state defined in the | |
283 | %s section above. | |
284 | ||
285 | The regular expression is only recognized when the lexer is in one of | |
286 | the start states in the start state list. If no start state list is | |
287 | given, the expression is recognized in all start states. | |
288 | ||
289 | The lexer begins in a pre-defined start state called INITIAL. | |
290 | ||
291 | The lexer resolves conflicts among rules by choosing the rule with | |
292 | the longest match, and in the case two rules match the same string, | |
293 | choosing the rule listed first in the specification. | |
294 | ||
295 | The rules should match all possible input. If some input occurs that | |
296 | does not match any rule, the lexer created by ML-Lex will raise an | |
297 | exception LexError. Note that this differs from C Lex, which prints | |
298 | any unmatched input on the standard output. | |
299 | ||
300 | V. Values available inside the code associated with a rule. | |
301 | ||
302 | Mlex places the value of the string matched by a regular expression | |
303 | in yytext, a string variable. | |
304 | ||
305 | The user may recursively | |
306 | call the lexing function with lex(). (If %arg is used, the | |
307 | lexing function may be re-invoked with the same argument by using | |
308 | continue().) This is convenient for ignoring white space or comments silently: | |
309 | ||
310 | [\ \t\n]+ => ( lex()); | |
311 | ||
312 | To switch start states, the user may call YYBEGIN with the name of a | |
313 | start state. | |
314 | ||
315 | The following values will be available only if the corresponding % | |
316 | command is in the ML-Lex definitions sections: | |
317 | ||
318 | value %command description | |
319 | ----- -------- ----------- | |
320 | REJECT %reject REJECT() causes the current | |
321 | rule to be "rejected." | |
322 | The lexer behaves as if the | |
323 | current rule had not matched; | |
324 | another rule that matches this | |
325 | string, or that matches the longest | |
326 | possible prefix of this string, | |
327 | is used instead. | |
328 | ||
329 | yypos Current character position from | |
330 | beginning of file. | |
331 | ||
332 | yylineno %count Current line number | |
333 | ||
334 | ||
335 | These values should be used only if necessary. Adding REJECT to a | |
336 | lexer will slow it down by 20%; adding yylineno will slow it down by | |
337 | another 20%, or more. (It is much more efficient to recognize \n and | |
338 | have an action that increments the line-number variable.) The use of | |
339 | the lookahead operator / will also slow down the entire lexer. | |
340 | The character-position, yypos, is not costly to maintain, however. | |
341 | ||
342 | VI. Running ML-Lex | |
343 | ||
344 | From the Unix shell, run sml-lex myfile.lex | |
345 | The output file will be myfile.lex.sml. The extension ".lex" is not | |
346 | required but is recommended. | |
347 | ||
348 | Within an interactive system [not the preferred method]: | |
349 | Use "lexgen.sml"; this will create a structure LexGen. The function | |
350 | LexGen.lexGen creates a program for a lexer from an input | |
351 | specification. It takes a string argument -- the name of the file | |
352 | containing the input specification. The output file name is | |
353 | determined by appending ".sml" to the input file name. | |
354 | ||
355 | VII. Using the program produced by ML-Lex. | |
356 | ||
357 | When the output file is loaded, it will create a structure Mlex that | |
358 | contains the function makeLexer. makeLexer takes a function from int | |
359 | -> string and returns a lexing function. | |
360 | ||
361 | For example, | |
362 | ||
363 | val lexer = Mlex.makeLexer (inputc (open_in "f")) | |
364 | ||
365 | creates a lexer that operates on the file whose name is f. | |
366 | ||
367 | The function from int -> string should read a string of characters | |
368 | from the input stream. It should return a null string to indicate | |
369 | that the end of the stream has been reached. The integer is the | |
370 | number of characters that the lexer wishes to read; the function may | |
371 | return any non-zero number of characters. For example, | |
372 | ||
373 | val lexer = | |
374 | let val input_line = fn f => | |
375 | let fun loop result = | |
376 | let val c = input (f,1) | |
377 | val result = c :: result | |
378 | in if String.size c = 0 orelse c = "\n" then | |
379 | String.implode (rev result) | |
380 | else loop result | |
381 | end | |
382 | in loop nil | |
383 | end | |
384 | in Mlex.makeLexer (fn n => input_line std_in) | |
385 | end | |
386 | is appropriate for interactive streams where prompting, etc. occurs; | |
387 | the lexer won't care that input_line might return a string of more | |
388 | than or less than n characters. | |
389 | ||
390 | The lexer tries to read a large number of characters from the input | |
391 | function at once, and it is desirable that the input function return | |
392 | as many as possible. Reading many characters at once makes the lexer | |
393 | more efficient. Fewer input calls and buffering operations are | |
394 | needed, and input is more efficient in large block reads. For | |
395 | interactive streams this is less of a concern, as the limiting factor | |
396 | is the speed at which the user can type. | |
397 | ||
398 | To obtain a value, invoke the lexer by passing it a unit: | |
399 | ||
400 | val nextToken = lexer() | |
401 | ||
402 | If one wanted to restart the lexer, one would just discard "lexer" | |
403 | and create a new lexer on the same stream with another call to | |
404 | makeLexer. This is the best way to discard any characters buffered | |
405 | internally by the lexer. | |
406 | ||
407 | All code in the user declarations section is placed inside a | |
408 | structure UserDeclarations. To access this structure, use the path name | |
409 | Mlex.UserDeclarations. | |
410 | ||
411 | If any input cannot be matched, the program will raise the exception | |
412 | Mlex.LexError. An internal error (i.e. bug) will cause the | |
413 | exception Internal.LexerError to be raised. | |
414 | ||
415 | If %structure is used, remember that the structure name will no | |
416 | longer be Mlex, but the one specified in the command. | |
417 | ||
418 | VIII. Sample | |
419 | ||
420 | Here is a sample lexer for a calculator program: | |
421 | ||
422 | datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN | | |
423 | NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES | |
424 | ||
425 | val linenum = ref 1 | |
426 | val error = fn x => output(std_out,x ^ "\n") | |
427 | val eof = fn () => EOF | |
428 | %% | |
429 | %structure CalcLex | |
430 | alpha=[A-Za-z]; | |
431 | digit=[0-9]; | |
432 | ws = [\ \t]; | |
433 | %% | |
434 | \n => (inc linenum; lex()); | |
435 | {ws}+ => (lex()); | |
436 | "/" => (DIV); | |
437 | ";" => (EOS); | |
438 | "(" => (LPAREN); | |
439 | {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0)); | |
440 | ")" => (RPAREN); | |
441 | "+" => (PLUS); | |
442 | {alpha}+ => (if yytext="print" then PRINT else ID yytext); | |
443 | "-" => (SUB); | |
444 | "*" => (TIMES); | |
445 | . => (error ("calc: ignoring bad character "^yytext); lex()); | |
446 | ||
447 | ||
448 | Here is the parser for the calculator: | |
449 | ||
450 | (* Sample interactive calculator to demonstrate use of lexer produced by ML-Lex | |
451 | ||
452 | The original grammar was | |
453 | ||
454 | stmt_list -> stmt_list stmt | |
455 | stmt -> print exp ; | exp ; | |
456 | exp -> exp + t | exp - t | t | |
457 | t -> t * f | t/f | f | |
458 | f -> (exp) | id | num | |
459 | ||
460 | The function parse takes a stream and parses it for the calculator | |
461 | program. | |
462 | ||
463 | If a syntax error occurs, parse prints an error message and calls itself | |
464 | on the stream. On this system that has the effect of ignoring all input | |
465 | to the end of a line. | |
466 | *) | |
467 | ||
468 | structure Calc = | |
469 | struct | |
470 | open CalcLex | |
471 | open UserDeclarations | |
472 | exception Error | |
473 | fun parse strm = | |
474 | let | |
475 | val say = fn s => output(std_out,s) | |
476 | val input_line = fn f => | |
477 | let fun loop result = | |
478 | let val c = input (f,1) | |
479 | val result = c :: result | |
480 | in if String.size c = 0 orelse c = "\n" then | |
481 | String.implode (rev result) | |
482 | else loop result | |
483 | end | |
484 | in loop nil | |
485 | end | |
486 | val lexer = makeLexer (fn n => input_line strm) | |
487 | val nexttok = ref (lexer()) | |
488 | val advance = fn () => (nexttok := lexer(); !nexttok) | |
489 | val error = fn () => (say ("calc: syntax error on line" ^ | |
490 | (makestring(!linenum)) ^ "\n"); raise Error) | |
491 | val lookup = fn i => | |
492 | if i = "ONE" then 1 | |
493 | else if i = "TWO" then 2 | |
494 | else (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error) | |
495 | fun STMT_LIST () = | |
496 | case !nexttok of | |
497 | EOF => () | |
498 | | _ => (STMT(); STMT_LIST()) | |
499 | ||
500 | and STMT() = | |
501 | (case !nexttok | |
502 | of EOS => () | |
503 | | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ()) | |
504 | | _ => (E(); ()); | |
505 | case !nexttok | |
506 | of EOS => (advance()) | |
507 | | _ => error()) | |
508 | and E () = E' (T()) | |
509 | and E' (i : int ) = | |
510 | case !nexttok of | |
511 | PLUS => (advance (); E'(i+T())) | |
512 | | SUB => (advance (); E'(i-T())) | |
513 | | RPAREN => i | |
514 | | EOF => i | |
515 | | EOS => i | |
516 | | _ => error() | |
517 | and T () = T'(F()) | |
518 | and T' i = | |
519 | case !nexttok of | |
520 | PLUS => i | |
521 | | SUB => i | |
522 | | TIMES => (advance(); T'(i*F())) | |
523 | | DIV => (advance (); T'(i div F())) | |
524 | | EOF => i | |
525 | | EOS => i | |
526 | | RPAREN => i | |
527 | | _ => error() | |
528 | and F () = | |
529 | case !nexttok of | |
530 | ID i => (advance(); lookup i) | |
531 | | LPAREN => | |
532 | let val v = (advance(); E()) | |
533 | in if !nexttok = RPAREN then (advance (); v) else error() | |
534 | end | |
535 | | NUM i => (advance(); i) | |
536 | | _ => error() | |
537 | in STMT_LIST () handle Error => parse strm | |
538 | end | |
539 | end |