mllex/lexgen.doc

   1         A lexical analyzer generator for Standard ML.
   2
   3 THIS TEXT FILE IS OBSOLETE and IS NOT MAINTAINED.
   4 The current (maintained) documentation is in lexgen.tex.
   5
   6
   7                     Andrew W. Appel
   8                     James S. Mattson
   9                     David R. Tarditi
  10
  11                     Princeton University
  12
  13                     Version 1.6, October 1994
  14
  15 Copyright (c) 1989-1992 by Andrew W. Appel, James S. Mattson, David R. Tarditi
  16
  17 This software comes with ABSOLUTELY NO WARRANTY.
  18 This software is subject only to the PRINCETON STANDARD ML SOFTWARE LIBRARY
  19 COPYRIGHT NOTICE, LICENSE AND DISCLAIMER, (in the file "COPYRIGHT",
  20 distributed with this software). You may copy and distribute this software;
  21 see the COPYRIGHT NOTICE for details and restrictions.
  22
  23 I. General Description
  24
  25 Computer programs often need to divide their input into words and
  26 distinguish between different kinds of words.  Compilers, for
  27 example, need to distinguish between integers, reserved words, and
  28 identifiers.  Applications programs often need to be able to
  29 recognize components of typed commands from users.
  30
  31 The problem of segmenting input into words and recognizing classes of
  32 words is known as lexical analysis.  Small cases of this problem,
  33 such as reading text strings separated by spaces, can be solved by
  34 using hand-written programs.  Larger cases of this problem, such as
  35 tokenizing an input stream for a compiler, can also be solved using
  36 hand-written programs.
  37
  38 A hand-written program for a large lexical analysis problem, however,
  39 suffers from two major problems.  First, the program requires a fair
  40 amount of programmer time to create.  Second, the description of
  41 classes of words is not explicit in the program.  It must be inferred
  42 from the program code.  This makes it difficult to verify if the
  43 program recognizes the correct words for each class.  It also makes
  44 future maintenance of the program difficult.
  45
  46 Lex, a programming tool for the Unix system, is a successful solution
  47 to the general problem of lexical analysis.  It uses regular
  48 expressions to describe classes of words.  A program fragment is
  49 associated with each class of words.  This information is given to
  50 Lex as a specification (a Lex program).  Lex produces a program for a
  51 function that can be used to perform lexical analysis.
  52
  53 The function operates as follows.  It finds the longest word starting
  54 from the current position in the input stream that is in one of the
  55 word classes.  It executes the program fragment associated with the
  56 class, and sets the current position in the input stream to be the
  57 character after the word.  The program fragment has the actual text
  58 of the word available to it, and may be any piece of code.  For many
  59 applications it returns some kind of value.
  60
  61 Lex allows the programmer to make the language description explicit,
  62 and to concentrate on what to do with the recognized words, not how
  63 to recognize the words.  It saves programmer time and increases
  64 program maintainability.
  65
  66 Unfortunately, Lex is targeted only C.  It also places artificial
  67 limits on the size of strings that can be recognized.
  68
  69 ML-Lex is a variant of Lex for the ML programming language.  ML-Lex
  70 has a syntax similar to Lex, and produces an ML program instead of a
  71 C program.  ML-Lex produces a program that runs very efficiently.
  72 Typically the program will be as fast or even faster than a
  73 hand-coded lexer implemented in Standard ML.
  74
  75 The program typically uses only a small amount of space.
  76 ML-Lex thus allows ML programmers the same benefits that Lex allows C
  77 programmers.  It also does not place artificial limits on the size of
  78 recognized strings.
  79
  80 II. ML-Lex specifications
  81
  82 An ML-Lex specification has the general format:
  83
  84         {user declarations}
  85         %%
  86         {ML-Lex definitions}
  87         %%
  88         {rules}
  89
  90 Each section is separated from the others by a '%%' delimiter.
  91
  92 The rules are used to define the lexical analysis function.  Each
  93 rule has two parts - a regular expression and an action.  The regular
  94 expression defines the word class that a rule matches.  The action is
  95 a program fragment to be executed when a rule matches the input.  The
  96 actions are used to compute values, and must all return values of the
  97 same type.
  98
  99 The user can define values available to all rule actions in the user
 100 declarations section.  The user must define two values in this
 101 section - a type lexresult and a function eof.  Lexresult defines the
 102 type of values returned by the rule actions.  The function "eof" is
 103 called by the lexer when the end of the input stream is reached.  It
 104 will typically return a value signalling eof or raise an exception.
 105 It is called with the same argument as lex (see %arg, below),
 106 and must return a value of type lexresult.
 107
 108 In the definitions section, the user can define named regular
 109 expressions, a set of start states, and specify which of the various
 110 bells and whistles of ML-Lex are desired.
 111
 112 The start states allow the user to control when certain rules are
 113 matched.  Rules may be defined to match only when the lexer is in
 114 specific start states.  The user may change the lexer's start state
 115 in a rule action.  This allows the user to specify special handling
 116 of lexical objects.
 117
 118 This feature is typically used to handle quoted strings with escapes
 119 to denote special characters.  The rules to recognize the inside
 120 contents of a string are defined for only one start state.  This
 121 start state is entered when the beginning of a string is recognized,
 122 and exited when the end of the string is recognized.
 123
 124 III. Regular expressions.
 125
 126 Regular expressions are a simple language for denoting classes of
 127 strings.  A regular expression is defined inductively over an
 128 alphabet with a set of basic operations.  The alphabet for ML-Lex is
 129 the Ascii character set (character codes 0-127; or if %full is used, 0-255).
 130
 131 The syntax and semantics of regular expressions will be described in
 132 order of decreasing precedence (from the most tightly-binding operators
 133 to the most weakly-binding):
 134
 135         An individual character stands for itself, except for the
 136         reserved characters  ? * + | ( ) ^ $ / ; . = < > [ { " \
 137
 138         A backslash followed by one of the reserved characters stands
 139         for that character.
 140
 141         A set of characters enclosed in square brackets [ ] stands
 142         for any one of those characters.  Inside the brackets, only
 143         the symbols  \ - ^ are reserved.  An initial up-arrow ^ stands
 144         for the complement of the characters listed, e.g. [^abc]
 145         stands any character except a, b, or c.  The hyphen - denotes
 146         a range of characters, e.g. [a-z] stands for any lower-case
 147         alphabetic character, and [0-9a-fA-F] stands for any hexadecimal
 148         digit.  To include ^ literally in a bracketed set, put it anywhere
 149         but first; to include - literally in a set, put it first or last.
 150
 151         The dot . character stands for any character except newline,
 152         i.e. the same as [^\n]
 153
 154         The following special escape sequences are available, inside
 155         or outside of square-brackets:
 156                 \b      - backspace
 157                 \n      - newline
 158                 \t      - tab
 159                 \h      - stands for all characters with codes >127,
 160                                   when 7-bit characters are used.
 161                 \ddd    - where ddd is a 3 digit decimal escape.
 162
 163         A sequence of characters will stand for itself (reserved
 164         characters will be taken literally) if it is enclosed in
 165         double quotes " ".
 166
 167         A named regular expression (defined in the "definitions"
 168         section) may be referred to by enclosing its name in
 169         braces { }.
 170
 171         Any regular expression may be enclosed in parentheses ( )
 172         for syntactic (but, as usual, not semantic) effect.
 173
 174         The postfix operator * stands for Kleene closure:
 175         zero or more repetitions of the preceding expression.
 176
 177         The postfix operator + stands for one or more repetitions
 178         of the preceding expression.
 179
 180         The postfix operator ? stands for zero or one occurrence of
 181         the preceding expression.
 182
 183         A postfix repetition range {n1,n2} where n1 and n2 are small
 184         integers stands for any number of repetitions between n1 and n2
 185         of the preceding expression.  The notation {n1} stands for
 186         exactly n1 repetitions.
 187
 188         Concatenation of expressions denotes concatenation of strings.
 189         The expression e1 e2 stands for any string that results from
 190         the concatenation of one string that matches e1 with another
 191         string that matches e2.
 192
 193         The infix operator | stands for alternation.  The expression
 194         e1 | e2  stands for anything that either e1 or e2 stands for.
 195
 196         The infix operator / denotes lookahead.  Lookahead is not
 197         implemented and cannot be used, because there is a bug
 198         in the algorithm for generating lexers with lookahead.  If
 199         it could be used, the expression e1 / e2 would match any string
 200         that e1 stands for, but only when that string is followed by a
 201         string that matches e2.
 202
 203         When the up-arrow ^ occurs at the beginning of an expression,
 204         that expression will only match strings that occur at the
 205         beginning of a line (right after a newline character).
 206
 207         The dollar sign $ is not implemented, since it is an abbreviation
 208         for lookahead involving the newline character (that is, it
 209         is an abbreviation /\n).   If it could be used, when the dollar
 210         sign $ occurred at the end of an expression, that expression
 211         would only match strings that occur at the end of a line
 212         (right before a newline character).
 213
 214 Here are some examples of regular expressions, and descriptions of the
 215 set of strings they denote:
 216
 217         0 | 1 | 2 | 3           A single digit between 0 and 3
 218         [0123]                  A single digit between 0 and 3
 219         0123                    The string "0123"
 220         0*                      All strings of 0 or more 0's
 221         00*                     All strings of 1  or more 0's
 222         0+                      All strings of 1  or more 0's
 223         [0-9]{3}                Any three-digit decimal number.
 224         \\[ntb]                 The strings "\n" "\t" "\b"
 225         (00)*                   Any string with an even number of 0's.
 226
 227 IV. ML-Lex syntax summary
 228
 229 A. User declarations
 230
 231 Anything up to the first %% is in the user declarations section.  The
 232 user should note that no symbolic identifier containing '%%' can be
 233 used in this section.
 234
 235 B. ML-Lex definitions
 236
 237 Start states can be defined with
 238
 239                 %s {identifier list} ;
 240
 241         or      %S {identifier list} ;
 242
 243 An identifier list consists of one or more identifiers.
 244
 245 An identifier consists of one or more letters, digits, underscores,
 246 or primes.  It must begin with a letter.
 247
 248 Named expressions can be defined with
 249
 250         {identifier} = {regular expression} ;
 251
 252 Regular expressions are defined below.
 253
 254 The following % commands are also available:
 255
 256         %reject         - create REJECT() function
 257         %count          - count newlines using yylineno
 258         %full           - create lexer for the full 8-bit character set,
 259                           with characters in the range 0-255 permitted
 260                           as input.
 261         %structure {identifier} - name the structure in the output program
 262                           {identifier} instead of Mlex
 263         %header         - use code following it to create header for lexer
 264                           structure
 265         %arg            - extra (curried) formal parameter argument to be
 266                           passed to the lex functions, and to be passed
 267                           to the eof function in place of ()
 268         These functions are discussed below, under values available to
 269         actions.
 270
 271 C.  Rules
 272
 273 Each rule has the format:
 274
 275         <start state list> {regular expression} => ( ... code ... );
 276
 277 All parentheses in ...  code ...  must be balanced, including those
 278 used in strings and comments.
 279
 280 The start state list is optional.  It consists of a list of
 281 identifiers separated by commas, and is delimited by triangle
 282 brackets < >.  Each identifier must be a start state defined in the
 283 %s section above.
 284
 285 The regular expression is only recognized when the lexer is in one of
 286 the start states in the start state list.  If no start state list is
 287 given, the expression is recognized in all start states.
 288
 289 The lexer begins in a pre-defined start state called INITIAL.
 290
 291 The lexer resolves conflicts among rules by choosing the rule with
 292 the longest match, and in the case two rules match the same string,
 293 choosing the rule listed first in the specification.
 294
 295 The rules should match all possible input.  If some input occurs that
 296 does not match any rule, the lexer created by ML-Lex will raise an
 297 exception LexError.  Note that this differs from C Lex, which prints
 298 any unmatched input on the standard output.
 299
 300 V. Values available inside the code associated with a rule.
 301
 302 Mlex places the value of the string matched by a regular expression
 303 in yytext, a string variable.
 304
 305 The user may recursively
 306 call the lexing function with lex().  (If %arg is used, the
 307 lexing function may be re-invoked with the same argument by using
 308 continue().) This is convenient for ignoring white space or comments silently:
 309
 310         [\ \t\n]+       => ( lex());
 311
 312 To switch start states, the user may call YYBEGIN with the name of a
 313 start state.
 314
 315 The following values will be available only if the corresponding %
 316 command is in the ML-Lex definitions sections:
 317
 318         value           %command        description
 319         -----           --------        -----------
 320         REJECT          %reject         REJECT() causes the current
 321                                         rule to be "rejected."
 322                                         The lexer behaves as if the
 323                                         current rule had not matched;
 324                                         another rule that matches this
 325                                         string, or that matches the longest
 326                                         possible prefix of this string,
 327                                         is used instead.
 328
 329         yypos                           Current character position from
 330                                         beginning of file.
 331
 332         yylineno        %count          Current line number
 333
 334
 335 These values should be used only if necessary.  Adding REJECT to a
 336 lexer will slow it down by 20%; adding yylineno will slow it down by
 337 another 20%, or more.  (It is much more efficient to recognize \n and
 338 have an action that increments the line-number variable.)  The use of
 339 the lookahead operator / will also slow down the entire lexer.
 340 The character-position, yypos, is not costly to maintain, however.
 341
 342 VI. Running ML-Lex
 343
 344 From the Unix shell, run       sml-lex myfile.lex
 345 The output file will be myfile.lex.sml.  The extension ".lex" is not
 346 required but is recommended.
 347
 348 Within an interactive system [not the preferred method]:
 349 Use "lexgen.sml"; this will create a structure LexGen.  The function
 350 LexGen.lexGen creates a program for a lexer from an input
 351 specification.  It takes a string argument -- the name of the file
 352 containing the input specification.  The output file name is
 353 determined by appending ".sml" to the input file name.
 354
 355 VII. Using the program produced by ML-Lex.
 356
 357 When the output file is loaded, it will create a structure Mlex that
 358 contains the function makeLexer.  makeLexer takes a function from int
 359 -> string and returns a lexing function.
 360
 361 For example,
 362
 363         val lexer = Mlex.makeLexer (inputc (open_in "f"))
 364
 365 creates a lexer that operates on the file whose name is f.
 366
 367 The function from int -> string should read a string of characters
 368 from the input stream.  It should return a null string to indicate
 369 that the end of the stream has been reached.  The integer is the
 370 number of characters that the lexer wishes to read; the function may
 371 return any non-zero number of characters.  For example,
 372
 373   val lexer =
 374     let val input_line = fn f =>
 375           let fun loop result =
 376              let val c = input (f,1)
 377                  val result = c :: result
 378              in if String.size c = 0 orelse c = "\n" then
 379                    String.implode (rev result)
 380                  else loop result
 381              end
 382           in loop nil
 383           end
 384      in Mlex.makeLexer (fn n => input_line std_in)
 385      end
 386 is appropriate for interactive streams where prompting, etc.  occurs;
 387 the lexer won't care that input_line might return a string of more
 388 than or less than n characters.
 389
 390 The lexer tries to read a large number of characters from the input
 391 function at once, and it is desirable that the input function return
 392 as many as possible.  Reading many characters at once makes the lexer
 393 more efficient.  Fewer input calls and buffering operations are
 394 needed, and input is more efficient in large block reads.  For
 395 interactive streams this is less of a concern, as the limiting factor
 396 is the speed at which the user can type.
 397
 398 To obtain a value, invoke the lexer by passing it a unit:
 399
 400         val nextToken = lexer()
 401
 402 If one wanted to restart the lexer, one would just discard "lexer"
 403 and create a new lexer on the same stream with another call to
 404 makeLexer.  This is the best way to discard any characters buffered
 405 internally by the lexer.
 406
 407 All code in the user declarations section is placed inside a
 408 structure UserDeclarations.  To access this structure, use the path name
 409 Mlex.UserDeclarations.
 410
 411 If any input cannot be matched, the program will raise the exception
 412 Mlex.LexError.  An internal error (i.e.  bug) will cause the
 413 exception Internal.LexerError to be raised.
 414
 415 If %structure is used, remember that the structure name will no
 416 longer be Mlex, but the one specified in the command.
 417
 418 VIII. Sample
 419
 420 Here is a sample lexer for a calculator program:
 421
 422 datatype lexresult= DIV | EOF | EOS | ID of string | LPAREN |
 423                      NUM of int | PLUS | PRINT | RPAREN | SUB | TIMES
 424
 425 val linenum = ref 1
 426 val error = fn x => output(std_out,x ^ "\n")
 427 val eof = fn () => EOF
 428 %%
 429 %structure CalcLex
 430 alpha=[A-Za-z];
 431 digit=[0-9];
 432 ws = [\ \t];
 433 %%
 434 \n       => (inc linenum; lex());
 435 {ws}+    => (lex());
 436 "/"      => (DIV);
 437 ";"      => (EOS);
 438 "("      => (LPAREN);
 439 {digit}+ => (NUM (revfold (fn(a,r)=>ord(a)-ord("0")+10*r) (explode yytext) 0));
 440 ")"      => (RPAREN);
 441 "+"      => (PLUS);
 442 {alpha}+ => (if yytext="print" then PRINT else ID yytext);
 443 "-"      => (SUB);
 444 "*"      => (TIMES);
 445 .        => (error ("calc: ignoring bad character "^yytext); lex());
 446
 447
 448 Here is the parser for the calculator:
 449
 450 (* Sample interactive calculator to demonstrate use of lexer produced by ML-Lex
 451
 452    The original grammar was
 453
 454        stmt_list -> stmt_list stmt
 455        stmt -> print exp ;  | exp ;
 456        exp -> exp + t | exp - t | t
 457        t -> t * f | t/f | f
 458        f -> (exp) | id | num
 459
 460   The function parse takes a stream and parses it for the calculator
 461   program.
 462
 463   If a syntax error occurs, parse prints an error message and calls itself
 464   on the stream.  On this system that has the effect of ignoring all input
 465   to the end of a line.
 466 *)
 467
 468 structure Calc =
 469  struct
 470    open CalcLex
 471    open UserDeclarations
 472    exception Error
 473    fun parse strm =
 474     let
 475       val say = fn s => output(std_out,s)
 476       val input_line = fn f =>
 477           let fun loop result =
 478              let val c = input (f,1)
 479                  val result = c :: result
 480              in if String.size c = 0 orelse c = "\n" then
 481                    String.implode (rev result)
 482                  else loop result
 483              end
 484           in loop nil
 485           end
 486       val lexer = makeLexer (fn n => input_line strm)
 487       val nexttok = ref (lexer())
 488       val advance = fn () => (nexttok := lexer(); !nexttok)
 489       val error = fn () => (say ("calc: syntax error on line" ^
 490                            (makestring(!linenum)) ^ "\n"); raise Error)
 491       val lookup = fn i =>
 492         if i = "ONE" then 1
 493         else if i = "TWO" then 2
 494         else  (say ("calc: unknown identifier '" ^ i ^ "'\n"); raise Error)
 495      fun STMT_LIST () =
 496          case !nexttok of
 497             EOF => ()
 498           | _ => (STMT(); STMT_LIST())
 499
 500      and STMT() =
 501          (case !nexttok
 502            of EOS  => ()
 503             | PRINT => (advance(); say ((makestring (E():int)) ^ "\n"); ())
 504             | _ => (E(); ());
 505          case !nexttok
 506            of EOS => (advance())
 507             | _ => error())
 508      and E () = E' (T())
 509      and E' (i : int ) =
 510          case !nexttok of
 511             PLUS => (advance (); E'(i+T()))
 512           | SUB => (advance (); E'(i-T()))
 513           | RPAREN => i
 514           | EOF => i
 515           | EOS => i
 516           | _ => error()
 517      and T () =  T'(F())
 518      and T' i =
 519         case !nexttok of
 520             PLUS => i
 521           | SUB => i
 522           | TIMES => (advance(); T'(i*F()))
 523           | DIV => (advance (); T'(i div F()))
 524           | EOF => i
 525           | EOS => i
 526           | RPAREN => i
 527           | _ => error()
 528      and F () =
 529         case !nexttok of
 530             ID i => (advance(); lookup i)
 531           | LPAREN =>
 532               let val v = (advance(); E())
 533               in if !nexttok = RPAREN then (advance (); v) else error()
 534               end
 535           | NUM i => (advance(); i)
 536           | _ => error()
 537     in STMT_LIST () handle Error => parse strm
 538     end
 539  end