Import Upstream version 20180207
[hcoop/debian/mlton.git] / mllex / mlex_int.doc
CommitLineData
7f918cf1
CE
1This is minimal documentation for the lexer driver produced by ml-lex.
2
3Main data structures:
4
5 The transition table is stored in tab. Tab is an array of records, indexed
6by state number. The first field of the record, fin, is a list of final leaves
7assocated with it. The second field of the record, trans, is a transition
8table for the state indexed by character number. It gives the next state
9for a given input character.
10
11 The usual initial start state is state #1. State 0 is a dead state, which
12has transitions only to itself.
13
14 The field yyfin has type yyfinstate list. yyfinstate consists of the
15following three constructors:
16
17 * N of int - indicates normal end leaf.
18 * D of int - dummy end leaf - for indicating when an end state for
19 a trailing context regular expression has been reached. These are
20 stored and propagated backwards when action is executed.
21 * T of int - indicates an actual end leaf for a trailing context reg.
22 expression, which should be executed only if D i was encountered
23 after this end leaf while scanning forward. The dummy end leaf is
24 removed from the backward propagating list after this node is
25 encountered.
26
27
28 The function scan inside the function lex operates as a transition
29function, scanning the input until it is no longer possible to take any
30more transitions. It accumulates a list of the accepting leaf list
31associated with each accepting state passed through.
32
33 Scan operates as follows:
34
35 Input: * s - current state
36 * AcceptingLeaves - list of accepting leave lists. Each state
37 has a list of accepting leaves associated with it. This list
38 may be nil if the state is not a final state.
39 * l - position of the next character in the buffer b to read
40 * i0 - starting position in the buffer.
41
42 Output: If no match is found, it raises the exception LexError.
43 Otherwise, it returns a value of type lexresult.
44
45 It operates as a transtion function:
46 It (1) adds the list of accepting leaves for the current state to
47 the list of accepting leave lists
48 (2) tries to make a transition on the current input character
49 to the next state. If it can't make a transition, it
50 executes the action function.
51 (a) - if it is past the end of the buffer, it
52 (1) checks if it as at end eof. If it is then:
53 It checks to see if it has made any
54 transitions since it was first called -
55 (l>i0 when this is true.) If it hasn't
56 this implies that scan was called at
57 the end of file. It thus executes
58 eof function declared by the user.
59 Otherwise it must execute action w/
60 the current accepting state list.
61 (2) otherwise it reads a block of up to 1024
62 characters, and appends this block to the
63 useful suffix of characters left in the
64 buffer (those character which have been
65 scanned in this call to lex()). The buffer
66 operation should be altered if one intends
67 to process reg. expressions whose lexemes'
68 length will be >> 1024. For most normal
69 applications, the buffer update operation
70 will be fine.
71
72 This buffer update operation requires
73 O(n^2/1024) char. copies for lexemes > 1024
74 characters in length, and O(n) char. copies
75 for lexemes <= 1024 characters in length.
76 It can be made O(n) using linked list
77 buffers & a Byte.array of size n (not the
78 ^operator!) for concatenating the buffers
79 to return a value for yytext when a lexeme
80 is longer than the typical buffer length.
81
82 (3) If the transition is to a dead state (0 is used
83 for the dead state), action is executed instead.