Commit | Line | Data |
---|---|---|
7f918cf1 CE |
1 | % Modified by Matthew Fluet on 2007-11-07. |
2 | % Add %posint command. | |
3 | % | |
4 | % Modified by Matthew Fluet on 2007-10-31. | |
5 | % Add \r escape sequence (from Florian Weimer). | |
6 | % Fix TeX formatting bug (from Florian Weimer). | |
7 | % | |
8 | \documentstyle{article} | |
9 | \title{ A lexical analyzer generator for Standard ML.\\ | |
10 | Version 1.6.0, October 1994 | |
11 | } | |
12 | \author{ Andrew W. Appel$^1$\\ | |
13 | James S. Mattson\\ | |
14 | David R. Tarditi$^2$\\ | |
15 | \\ | |
16 | \small | |
17 | $^1$Department of Computer Science, Princeton University \\ | |
18 | \small | |
19 | $^2$School of Computer Science, Carnegie Mellon University | |
20 | } | |
21 | \date{} | |
22 | \begin{document} | |
23 | \maketitle | |
24 | \begin{center} | |
25 | (c) 1989-94 Andrew W. Appel, James S. Mattson, David R. Tarditi | |
26 | \end{center} | |
27 | ||
28 | {\bf | |
29 | This software comes with ABSOLUTELY NO WARRANTY. It is subject only to | |
30 | the terms of the ML-Yacc NOTICE, LICENSE, and DISCLAIMER (in the | |
31 | file COPYRIGHT distributed with this software). | |
32 | } | |
33 | ||
34 | \vspace{1in} | |
35 | ||
36 | New in this version: | |
37 | \begin{itemize} | |
38 | \item REJECT is much less costly than before. | |
39 | \item Lexical analyzers with more than 255 states can now compile in your | |
40 | lifetime. | |
41 | \end{itemize} | |
42 | ||
43 | \newpage | |
44 | \tableofcontents | |
45 | \newpage | |
46 | ||
47 | \section{General Description} | |
48 | ||
49 | Computer programs often need to divide their input into words and | |
50 | distinguish between different kinds of words. Compilers, for | |
51 | example, need to distinguish between integers, reserved words, and | |
52 | identifiers. Applications programs often need to be able to | |
53 | recognize components of typed commands from users. | |
54 | ||
55 | The problem of segmenting input into words and recognizing classes of | |
56 | words is known as lexical analysis. Small cases of this problem, | |
57 | such as reading text strings separated by spaces, can be solved by | |
58 | using hand-written programs. Larger cases of this problem, such as | |
59 | tokenizing an input stream for a compiler, can also be solved using | |
60 | hand-written programs. | |
61 | ||
62 | A hand-written program for a large lexical analysis problem, however, | |
63 | suffers from two major problems. First, the program requires a fair | |
64 | amount of programmer time to create. Second, the description of | |
65 | classes of words is not explicit in the program. It must be inferred | |
66 | from the program code. This makes it difficult to verify if the | |
67 | program recognizes the correct words for each class. It also makes | |
68 | future maintenance of the program difficult. | |
69 | ||
70 | Lex, a programming tool for the Unix system, is a successful solution | |
71 | to the general problem of lexical analysis. It uses regular | |
72 | expressions to describe classes of words. A program fragment is | |
73 | associated with each class of words. This information is given to | |
74 | Lex as a specification (a Lex program). Lex produces a program for a | |
75 | function that can be used to perform lexical analysis. | |
76 | ||
77 | The function operates as follows. It finds the longest word starting | |
78 | from the current position in the input stream that is in one of the | |
79 | word classes. It executes the program fragment associated with the | |
80 | class, and sets the current position in the input stream to be the | |
81 | character after the word. The program fragment has the actual text | |
82 | of the word available to it, and may be any piece of code. For many | |
83 | applications it returns some kind of value. | |
84 | ||
85 | Lex allows the programmer to make the language description explicit, | |
86 | and to concentrate on what to do with the recognized words, not how | |
87 | to recognize the words. It saves programmer time and increases | |
88 | program maintainability. | |
89 | ||
90 | Unfortunately, Lex is targeted only C. It also places artificial | |
91 | limits on the size of strings that can be recognized. | |
92 | ||
93 | ML-Lex is a variant of Lex for the ML programming language. ML-Lex | |
94 | has a syntax similar to Lex, and produces an ML program instead of a | |
95 | C program. ML-Lex produces a program that runs very efficiently. | |
96 | Typically the program will be as fast or even faster than a | |
97 | hand-coded lexer implemented in Standard ML. | |
98 | ||
99 | The program typically uses only a small amount of space. | |
100 | ML-Lex thus allows ML programmers the same benefits that Lex allows C | |
101 | programmers. It also does not place artificial limits on the size of | |
102 | recognized strings. | |
103 | ||
104 | \section{ML-Lex specifications} | |
105 | ||
106 | An ML-Lex specification has the general format: | |
107 | ||
108 | \begin{quote} | |
109 | {user declarations} | |
110 | \verb|%%| | |
111 | {ML-Lex definitions} | |
112 | \verb|%%| | |
113 | {rules} | |
114 | \end{quote} | |
115 | ||
116 | Each section is separated from the others by a \verb|%%| delimiter. | |
117 | ||
118 | The rules are used to define the lexical analysis function. Each | |
119 | rule has two parts---a regular expression and an action. The regular | |
120 | expression defines the word class that a rule matches. The action is | |
121 | a program fragment to be executed when a rule matches the input. The | |
122 | actions are used to compute values, and must all return values of the | |
123 | same type. | |
124 | ||
125 | The user can define values available to all rule actions in the user | |
126 | declarations section. The user must define two values in this | |
127 | section---a type lexresult and a function eof. Lexresult defines the | |
128 | type of values returned by the rule actions. The function "eof" is | |
129 | called by the lexer when the end of the input stream is reached. It | |
130 | will typically return a value signalling eof or raise an exception. | |
131 | It is called with the same argument as lex (see \verb|%arg|, below), | |
132 | and must return a value of type lexresult. | |
133 | ||
134 | In the definitions section, the user can define named regular | |
135 | expressions, a set of start states, and specify which of the various | |
136 | bells and whistles of ML-Lex are desired. | |
137 | ||
138 | The start states allow the user to control when certain rules are | |
139 | matched. Rules may be defined to match only when the lexer is in | |
140 | specific start states. The user may change the lexer's start state | |
141 | in a rule action. This allows the user to specify special handling | |
142 | of lexical objects. | |
143 | ||
144 | This feature is typically used to handle quoted strings with escapes | |
145 | to denote special characters. The rules to recognize the inside | |
146 | contents of a string are defined for only one start state. This | |
147 | start state is entered when the beginning of a string is recognized, | |
148 | and exited when the end of the string is recognized. | |
149 | ||
150 | \section{Regular expressions} | |
151 | ||
152 | Regular expressions are a simple language for denoting classes of | |
153 | strings. A regular expression is defined inductively over an | |
154 | alphabet with a set of basic operations. The alphabet for ML-Lex is | |
155 | the Ascii character set (character codes 0--127; or if | |
156 | \verb|%full| is used, 0--255). | |
157 | ||
158 | The syntax and semantics of regular expressions will be described in | |
159 | order of decreasing precedence (from the most tightly binding operators | |
160 | to the most weakly binding): | |
161 | ||
162 | \begin{itemize} | |
163 | \item An individual character stands for itself, except for the | |
164 | reserved characters \verb@? * + | ( ) ^ $ / ; . = < > [ { " \@ | |
165 | ||
166 | \item[\\] A backslash followed by one of the reserved characters stands | |
167 | for that character. | |
168 | ||
169 | \item A set of characters enclosed in square brackets [ ] stands | |
170 | for any one of those characters. Inside the brackets, only | |
171 | the symbols \verb|\ - ^| are reserved. An initial up-arrow | |
172 | \verb|^| stands | |
173 | for the complement of the characters listed, e.g. \verb|[^abc]| | |
174 | stands any character except a, b, or c. The hyphen - denotes | |
175 | a range of characters, e.g. \verb|[a-z]| stands for any lower-case | |
176 | alphabetic character, and \verb|[0-9a-fA-F]| stands for any hexadecimal | |
177 | digit. To include \verb|^| literally in a bracketed set, put it anywhere | |
178 | but first; to include \verb|-| literally in a set, put it first or last. | |
179 | ||
180 | \item[\verb|.|] The dot \verb|.| character stands for any character except newline, | |
181 | i.e. the same as \verb|[^\n]| | |
182 | ||
183 | \item The following special escape sequences are available, inside | |
184 | or outside of square-brackets: | |
185 | ||
186 | \begin{tabular}{ll} | |
187 | \verb|\b|& backspace\\ | |
188 | \verb|\n|& newline\\ | |
189 | \verb|\r|& carriage return\\ | |
190 | \verb|\t|& tab\\ | |
191 | \verb|\h|& stands for all characters with codes $>127$,\\ | |
192 | &~~~~ when 7-bit characters are used.\\ | |
193 | \verb|\ddd|& where \verb|ddd| is a 3 digit decimal escape.\\ | |
194 | ||
195 | \end{tabular} | |
196 | ||
197 | \item[\verb|"|] A sequence of characters will stand for itself (reserved | |
198 | characters will be taken literally) if it is enclosed in | |
199 | double quotes \verb|" "|. | |
200 | ||
201 | \item[\{\}] A named regular expression (defined in the ``definitions" | |
202 | section) may be referred to by enclosing its name in | |
203 | braces \verb|{ }|. | |
204 | ||
205 | \item[()] Any regular expression may be enclosed in parentheses \verb|( )| | |
206 | for syntactic (but, as usual, not semantic) effect. | |
207 | ||
208 | \item[\verb|*|] The postfix operator \verb|*| stands for Kleene closure: | |
209 | zero or more repetitions of the preceding expression. | |
210 | ||
211 | \item[\verb|+|] The postfix operator \verb|+| stands for one or more repetitions | |
212 | of the preceding expression. | |
213 | ||
214 | \item[\verb|?|] The postfix operator \verb|?| stands for zero or one occurrence of | |
215 | the preceding expression. | |
216 | ||
217 | \item A postfix repetition range $\{n_1,n_2\}$ where $n_1$ and $n_2$ are small | |
218 | integers stands for any number of repetitions between $n_1$ and $n_2$ | |
219 | of the preceding expression. The notation $\{n_1\}$ stands for | |
220 | exactly $n_1$ repetitions. | |
221 | ||
222 | \item Concatenation of expressions denotes concatenation of strings. | |
223 | The expression $e_1 e_2$ stands for any string that results from | |
224 | the concatenation of one string that matches $e_1$ with another | |
225 | string that matches $e_2$. | |
226 | ||
227 | \item\verb-|- The infix operator \verb-|- stands for alternation. The expression | |
228 | $e_1$~\verb"|"~$e_2$ stands for anything that either $e_1$ or $e_2$ stands for. | |
229 | ||
230 | \item[\verb|/|] The infix operator \verb|/| denotes lookahead. Lookahead is not | |
231 | implemented and cannot be used, because there is a bug | |
232 | in the algorithm for generating lexers with lookahead. If | |
233 | it could be used, the expression $e_1 / e_2$ would match any string | |
234 | that $e_1$ stands for, but only when that string is followed by a | |
235 | string that matches $e_2$. | |
236 | ||
237 | \item When the up-arrow \verb|^| occurs at the beginning of an expression, | |
238 | that expression will only match strings that occur at the | |
239 | beginning of a line (right after a newline character). | |
240 | ||
241 | \item[\$] The dollar sign of C Lex \$ is not implemented, since it is an abbreviation | |
242 | for lookahead involving the newline character (that is, it | |
243 | is an abbreviation for \verb|/\n|). | |
244 | \end{itemize} | |
245 | ||
246 | Here are some examples of regular expressions, and descriptions of the | |
247 | set of strings they denote: | |
248 | ||
249 | \begin{tabular}{ll} | |
250 | \verb~0 | 1 | 2 | 3~& A single digit between 0 and 3\\ | |
251 | \verb|[0123]|& A single digit between 0 and 3\\ | |
252 | \verb|0123|& The string ``0123"\\ | |
253 | \verb|0*|& All strings of 0 or more 0's\\ | |
254 | \verb|00*|& All strings of 1 or more 0's\\ | |
255 | \verb|0+|& All strings of 1 or more 0's\\ | |
256 | \verb|[0-9]{3}|& Any three-digit decimal number.\\ | |
257 | \verb|\\[ntb]|& A newline, tab, or backspace.\\ | |
258 | \verb|(00)*|& Any string with an even number of 0's. | |
259 | \end{tabular} | |
260 | ||
261 | \section{ML-Lex syntax summary} | |
262 | ||
263 | \subsection{User declarations} | |
264 | ||
265 | Anything up to the first \verb|%%| is in the user declarations section. The | |
266 | user should note that no symbolic identifier containing | |
267 | \verb|%%| can be | |
268 | used in this section. | |
269 | ||
270 | \subsection{ML-Lex definitions} | |
271 | ||
272 | Start states can be defined with | |
273 | \begin{quote} | |
274 | \verb|%s| {identifier list} \verb|;| | |
275 | \end{quote} | |
276 | ||
277 | An identifier list consists of one or more identifiers. | |
278 | ||
279 | An identifier consists of one or more letters, digits, underscores, | |
280 | or primes, and must begin with a letter. | |
281 | ||
282 | Named expressions can be defined with | |
283 | ||
284 | \begin{quote} | |
285 | {identifier} = {regular expression} ; | |
286 | \end{quote} | |
287 | ||
288 | Regular expressions are defined below. | |
289 | ||
290 | The following \% commands are also available: | |
291 | ||
292 | \begin{description} | |
293 | \item[\tt \%reject] create REJECT() function | |
294 | \item[\tt \%count] count newlines using yylineno | |
295 | \item[\tt \%posarg] pass initial-position argument to makeLexer | |
296 | \item[\tt \%full] create lexer for the full 8-bit character set, | |
297 | with characters in the range 0--255 permitted | |
298 | as input. | |
299 | \item[\tt \%structure \{identifier\}] name the structure in the output program | |
300 | {identifier} instead of Mlex | |
301 | \item[\tt \%header] use code following it to create header for lexer | |
302 | structure | |
303 | \item[\tt \%arg] extra (curried) formal parameter argument to be | |
304 | passed to the lex functions, and to be passed | |
305 | to the eof function in place of () | |
306 | \item[\tt \%posint \{identifier\}] use the {\tt INTEGER} structure for the | |
307 | type of {\tt yypos}; use {\tt Int64} or {\tt Position} | |
308 | to allow lexing of multi-gigabyte input files | |
309 | \end{description} | |
310 | These functions are discussed in section~\ref{avail}. | |
311 | ||
312 | \subsection{Rules} | |
313 | ||
314 | Each rule has the format: | |
315 | ||
316 | \begin{quote} | |
317 | \verb|<|{\it start state list}\verb|>| {\it regular expression} \verb|=> (| {\it code} \verb|);| | |
318 | \end{quote} | |
319 | ||
320 | All parentheses in {\it code} must be balanced, including those | |
321 | used in strings and comments. | |
322 | ||
323 | The {\it start state list} is optional. It consists of a list of | |
324 | identifiers separated by commas, and is delimited by triangle | |
325 | brackets \verb|< >|. Each identifier must be a start state defined in the | |
326 | \verb|%s| section above. | |
327 | ||
328 | The regular expression is only recognized when the lexer is in one of | |
329 | the start states in the start state list. If no start state list is | |
330 | given, the expression is recognized in all start states. | |
331 | ||
332 | The lexer begins in a pre-defined start state called \verb|INITIAL|. | |
333 | ||
334 | The lexer resolves conflicts among rules by choosing the rule with | |
335 | the longest match, and in the case two rules match the same string, | |
336 | choosing the rule listed first in the specification. | |
337 | ||
338 | The rules should match all possible input. If some input occurs that | |
339 | does not match any rule, the lexer created by ML-Lex will raise an | |
340 | exception LexError. Note that this differs from C Lex, which prints | |
341 | any unmatched input on the standard output. | |
342 | ||
343 | \section{Values available inside the code associated with a rule.} | |
344 | \label{avail} | |
345 | ||
346 | ML-Lex places the value of the string matched by a regular expression | |
347 | in \verb|yytext|, a string variable. | |
348 | ||
349 | The user may recursively | |
350 | call the lexing function with \verb|lex()|. (If \verb|%arg| is used, the | |
351 | lexing function may be re-invoked with the same argument by using | |
352 | continue().) This is convenient for ignoring white space or comments silently: | |
353 | ||
354 | \begin{verbatim} | |
355 | [\ \t\n]+ => ( lex()); | |
356 | \end{verbatim} | |
357 | ||
358 | To switch start states, the user may call \verb|YYBEGIN| with the name of a | |
359 | start state. | |
360 | ||
361 | The following values will be available only if the corresponding \verb|%| | |
362 | command is in the ML-Lex definitions sections: | |
363 | ||
364 | \begin{tabular}{lll} | |
365 | \\ | |
366 | {\bf Value}&{\bf \% command}&{\bf description}\\ | |
367 | \hline | |
368 | {\tt REJECT} &{\tt\%reject}&\parbox[t]{2.6in}{{\tt REJECT()} causes the current | |
369 | rule to be ``rejected.'' | |
370 | The lexer behaves as if the | |
371 | current rule had not matched; | |
372 | another rule that matches this | |
373 | string, or that matches the longest | |
374 | possible prefix of this string, | |
375 | is used instead.} \\ | |
376 | {\tt yypos} & & \parbox[t]{2.6in}{The position of the first character | |
377 | of {\tt yytext}, relative to the beginning of the file.}\\ | |
378 | {\tt yylineno } & {\tt \%count} & Current line number\\ | |
379 | \\ | |
380 | \end{tabular} | |
381 | ||
382 | ||
383 | These values should be used only if necessary. Adding {\tt REJECT} to a | |
384 | lexer will slow it down by 20\%; adding {\tt yylineno} will slow it down by | |
385 | another 20\%, or more. (It is much more efficient to | |
386 | recognize \verb|\n| and | |
387 | have an action that increments the line-number variable.) The use of | |
388 | the lookahead operator {\tt /} will also slow down the entire lexer. | |
389 | The character-position, {\tt yypos}, is not costly to maintain, however. | |
390 | ||
391 | \paragraph{Bug.} The position of the first character in the file | |
392 | is reported as 2 (unless the {\tt \%posarg} feature is used). | |
393 | To preserve compatibility, this bug has not been fixed. | |
394 | ||
395 | \section{Running ML-Lex} | |
396 | ||
397 | From the Unix shell, run {\tt sml-lex~myfile.lex} | |
398 | The output file will be myfile.lex.sml. The extension {\tt .lex} is not | |
399 | required but is recommended. | |
400 | ||
401 | Within an interactive system [not the preferred method]: | |
402 | Use {\tt lexgen.sml}; this will create a structure LexGen. The function | |
403 | LexGen.lexGen creates a program for a lexer from an input | |
404 | specification. It takes a string argument -- the name of the file | |
405 | containing the input specification. The output file name is | |
406 | determined by appending ``{\tt .sml}'' to the input file name. | |
407 | ||
408 | \section{Using the program produced by ML-Lex} | |
409 | ||
410 | When the output file is loaded, it will create a structure Mlex that | |
411 | contains the function {\tt makeLexer} which takes a function from | |
412 | ${\it int} \rightarrow {\it string}$ and returns a lexing function: | |
413 | ||
414 | \begin{verbatim} | |
415 | val makeLexer : (int->string) -> yyarg -> lexresult | |
416 | \end{verbatim} | |
417 | where {\tt yyarg} is the type given in the {\tt \%yyarg} directive, | |
418 | or {\tt unit} if there is no {\tt \%yyarg} directive. | |
419 | ||
420 | For example, | |
421 | ||
422 | \begin{verbatim} | |
423 | val lexer = Mlex.makeLexer (inputc (open_in "f")) | |
424 | \end{verbatim} | |
425 | ||
426 | creates a lexer that operates on the file whose name is f. | |
427 | ||
428 | When the {\tt \%posarg} directive is used, the type of | |
429 | {\tt makeLexer} is | |
430 | \begin{verbatim} | |
431 | val makeLexer : ((int->string)*int) -> yyarg -> lexresult | |
432 | \end{verbatim} | |
433 | where the extra {\tt int} argument is one less than the {\tt yypos} | |
434 | of the first character in the input. The value $k$ would be used, | |
435 | for example, when creating | |
436 | a lexer to start in the middle of a file, when $k$ characters have | |
437 | already been read. At the beginning of the file, $k=0$ should be used. | |
438 | ||
439 | The ${\it int} \rightarrow {\it string}$ function | |
440 | should read a string of characters | |
441 | from the input stream. It should return a null string to indicate | |
442 | that the end of the stream has been reached. The integer is the | |
443 | number of characters that the lexer wishes to read; the function may | |
444 | return any non-zero number of characters. For example, | |
445 | ||
446 | \begin{verbatim} | |
447 | val lexer = | |
448 | let val input_line = fn f => | |
449 | let fun loop result = | |
450 | let val c = input (f,1) | |
451 | val result = c :: result | |
452 | in if String.size c = 0 orelse c = "\n" then | |
453 | String.implode (rev result) | |
454 | else loop result | |
455 | end | |
456 | in loop nil | |
457 | end | |
458 | in Mlex.makeLexer (fn n => input_line std_in) | |
459 | end | |
460 | \end{verbatim} | |
461 | ||
462 | is appropriate for interactive streams where prompting, etc. occurs; | |
463 | the lexer won't care that \verb|input_line| might return a string of more | |
464 | than or less than $n$ characters. | |
465 | ||
466 | The lexer tries to read a large number of characters from the input | |
467 | function at once, and it is desirable that the input function return | |
468 | as many as possible. Reading many characters at once makes the lexer | |
469 | more efficient. Fewer input calls and buffering operations are | |
470 | needed, and input is more efficient in large block reads. For | |
471 | interactive streams this is less of a concern, as the limiting factor | |
472 | is the speed at which the user can type. | |
473 | ||
474 | To obtain a value, invoke the lexer by passing it a unit: | |
475 | ||
476 | \begin{verbatim} | |
477 | val nextToken = lexer() | |
478 | \end{verbatim} | |
479 | ||
480 | If one wanted to restart the lexer, one would just discard {\tt lexer} | |
481 | and create a new lexer on the same stream with another call to | |
482 | {\tt makeLexer}. This is the best way to discard any characters buffered | |
483 | internally by the lexer. | |
484 | ||
485 | All code in the user declarations section is placed inside a | |
486 | structure UserDeclarations. To access this structure, use the path name | |
487 | {\tt Mlex.UserDeclarations}. | |
488 | ||
489 | If any input cannot be matched, the program will raise the exception | |
490 | {\tt Mlex.LexError}. An internal error (i.e. bug) will cause the | |
491 | exception {\tt Internal.LexerError} to be raised. | |
492 | ||
493 | If {\tt \%structure} is used, remember that the structure name will no | |
494 | longer be Mlex, but the one specified in the command. | |
495 | ||
496 | \section{Sample} | |
497 | ||
498 | Here is a sample lexer for a calculator program: | |
499 | ||
500 | \small | |
501 | \begin{verbatim} | |
502 | datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN | | |
503 | NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES | |
504 | ||
505 | val linenum = ref 1 | |
506 | val error = fn x => output(std_out,x ^ "\n") | |
507 | val eof = fn () => EOF | |
508 | %% | |
509 | %structure CalcLex | |
510 | alpha=[A-Za-z]; | |
511 | digit=[0-9]; | |
512 | ws = [\ \t]; | |
513 | %% | |
514 | \n => (inc linenum; lex()); | |
515 | {ws}+ => (lex()); | |
516 | "/" => (DIV); | |
517 | ";" => (EOS); | |
518 | "(" => (LPAREN); | |
519 | {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0)); | |
520 | ")" => (RPAREN); | |
521 | "+" => (PLUS); | |
522 | {alpha}+ => (if yytext="print" then PRINT else ID yytext); | |
523 | "-" => (SUB); | |
524 | "*" => (TIMES); | |
525 | . => (error ("calc: ignoring bad character "^yytext); lex()); | |
526 | \end{verbatim} | |
527 | ||
528 | ||
529 | Here is the parser for the calculator: | |
530 | \begin{verbatim} | |
531 | ||
532 | (* Sample interactive calculator to demonstrate use of lexer | |
533 | ||
534 | The original grammar was | |
535 | ||
536 | stmt_list -> stmt_list stmt | |
537 | stmt -> print exp ; | exp ; | |
538 | exp -> exp + t | exp - t | t | |
539 | t -> t * f | t/f | f | |
540 | f -> (exp) | id | num | |
541 | ||
542 | The function parse takes a stream and parses it for the calculator | |
543 | program. | |
544 | ||
545 | If a syntax error occurs, parse prints an error message and calls | |
546 | itself on the stream. On this system that has the effect of ignoring | |
547 | all input to the end of a line. | |
548 | *) | |
549 | ||
550 | structure Calc = | |
551 | struct | |
552 | open CalcLex | |
553 | open UserDeclarations | |
554 | exception Error | |
555 | fun parse strm = | |
556 | let | |
557 | val say = fn s => output(std_out,s) | |
558 | val input_line = fn f => | |
559 | let fun loop result = | |
560 | let val c = input (f,1) | |
561 | val result = c :: result | |
562 | in if String.size c = 0 orelse c = "\n" then | |
563 | String.implode (rev result) | |
564 | else loop result | |
565 | end | |
566 | in loop nil | |
567 | end | |
568 | val lexer = makeLexer (fn n => input_line strm) | |
569 | val nexttok = ref (lexer()) | |
570 | val advance = fn () => (nexttok := lexer(); !nexttok) | |
571 | val error = fn () => (say ("calc: syntax error on line" ^ | |
572 | (makestring(!linenum)) ^ "\n"); raise Error) | |
573 | val lookup = fn i => | |
574 | if i = "ONE" then 1 | |
575 | else if i = "TWO" then 2 | |
576 | else (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error) | |
577 | fun STMT_LIST () = | |
578 | case !nexttok of | |
579 | EOF => () | |
580 | | _ => (STMT(); STMT_LIST()) | |
581 | ||
582 | and STMT() = | |
583 | (case !nexttok | |
584 | of EOS => () | |
585 | | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ()) | |
586 | | _ => (E(); ()); | |
587 | case !nexttok | |
588 | of EOS => (advance()) | |
589 | | _ => error()) | |
590 | and E () = E' (T()) | |
591 | and E' (i : int ) = | |
592 | case !nexttok of | |
593 | PLUS => (advance (); E'(i+T())) | |
594 | | SUB => (advance (); E'(i-T())) | |
595 | | RPAREN => i | |
596 | | EOF => i | |
597 | | EOS => i | |
598 | | _ => error() | |
599 | and T () = T'(F()) | |
600 | and T' i = | |
601 | case !nexttok of | |
602 | PLUS => i | |
603 | | SUB => i | |
604 | | TIMES => (advance(); T'(i*F())) | |
605 | | DIV => (advance (); T'(i div F())) | |
606 | | EOF => i | |
607 | | EOS => i | |
608 | | RPAREN => i | |
609 | | _ => error() | |
610 | and F () = | |
611 | case !nexttok of | |
612 | ID i => (advance(); lookup i) | |
613 | | LPAREN => | |
614 | let val v = (advance(); E()) | |
615 | in if !nexttok = RPAREN then (advance (); v) else error() | |
616 | end | |
617 | | NUM i => (advance(); i) | |
618 | | _ => error() | |
619 | in STMT_LIST () handle Error => parse strm | |
620 | end | |
621 | end | |
622 | \end{verbatim} | |
623 | \end{document} |