Commit | Line | Data |
---|---|---|
7f918cf1 CE |
1 | This is minimal documentation for the lexer driver produced by ml-lex. |
2 | ||
3 | Main data structures: | |
4 | ||
5 | The transition table is stored in tab. Tab is an array of records, indexed | |
6 | by state number. The first field of the record, fin, is a list of final leaves | |
7 | assocated with it. The second field of the record, trans, is a transition | |
8 | table for the state indexed by character number. It gives the next state | |
9 | for a given input character. | |
10 | ||
11 | The usual initial start state is state #1. State 0 is a dead state, which | |
12 | has transitions only to itself. | |
13 | ||
14 | The field yyfin has type yyfinstate list. yyfinstate consists of the | |
15 | following three constructors: | |
16 | ||
17 | * N of int - indicates normal end leaf. | |
18 | * D of int - dummy end leaf - for indicating when an end state for | |
19 | a trailing context regular expression has been reached. These are | |
20 | stored and propagated backwards when action is executed. | |
21 | * T of int - indicates an actual end leaf for a trailing context reg. | |
22 | expression, which should be executed only if D i was encountered | |
23 | after this end leaf while scanning forward. The dummy end leaf is | |
24 | removed from the backward propagating list after this node is | |
25 | encountered. | |
26 | ||
27 | ||
28 | The function scan inside the function lex operates as a transition | |
29 | function, scanning the input until it is no longer possible to take any | |
30 | more transitions. It accumulates a list of the accepting leaf list | |
31 | associated with each accepting state passed through. | |
32 | ||
33 | Scan operates as follows: | |
34 | ||
35 | Input: * s - current state | |
36 | * AcceptingLeaves - list of accepting leave lists. Each state | |
37 | has a list of accepting leaves associated with it. This list | |
38 | may be nil if the state is not a final state. | |
39 | * l - position of the next character in the buffer b to read | |
40 | * i0 - starting position in the buffer. | |
41 | ||
42 | Output: If no match is found, it raises the exception LexError. | |
43 | Otherwise, it returns a value of type lexresult. | |
44 | ||
45 | It operates as a transtion function: | |
46 | It (1) adds the list of accepting leaves for the current state to | |
47 | the list of accepting leave lists | |
48 | (2) tries to make a transition on the current input character | |
49 | to the next state. If it can't make a transition, it | |
50 | executes the action function. | |
51 | (a) - if it is past the end of the buffer, it | |
52 | (1) checks if it as at end eof. If it is then: | |
53 | It checks to see if it has made any | |
54 | transitions since it was first called - | |
55 | (l>i0 when this is true.) If it hasn't | |
56 | this implies that scan was called at | |
57 | the end of file. It thus executes | |
58 | eof function declared by the user. | |
59 | Otherwise it must execute action w/ | |
60 | the current accepting state list. | |
61 | (2) otherwise it reads a block of up to 1024 | |
62 | characters, and appends this block to the | |
63 | useful suffix of characters left in the | |
64 | buffer (those character which have been | |
65 | scanned in this call to lex()). The buffer | |
66 | operation should be altered if one intends | |
67 | to process reg. expressions whose lexemes' | |
68 | length will be >> 1024. For most normal | |
69 | applications, the buffer update operation | |
70 | will be fine. | |
71 | ||
72 | This buffer update operation requires | |
73 | O(n^2/1024) char. copies for lexemes > 1024 | |
74 | characters in length, and O(n) char. copies | |
75 | for lexemes <= 1024 characters in length. | |
76 | It can be made O(n) using linked list | |
77 | buffers & a Byte.array of size n (not the | |
78 | ^operator!) for concatenating the buffers | |
79 | to return a value for yytext when a lexeme | |
80 | is longer than the typical buffer length. | |
81 | ||
82 | (3) If the transition is to a dead state (0 is used | |
83 | for the dead state), action is executed instead. |