| 1 | This is minimal documentation for the lexer driver produced by ml-lex. |
| 2 | |
| 3 | Main data structures: |
| 4 | |
| 5 | The transition table is stored in tab. Tab is an array of records, indexed |
| 6 | by state number. The first field of the record, fin, is a list of final leaves |
| 7 | assocated with it. The second field of the record, trans, is a transition |
| 8 | table for the state indexed by character number. It gives the next state |
| 9 | for a given input character. |
| 10 | |
| 11 | The usual initial start state is state #1. State 0 is a dead state, which |
| 12 | has transitions only to itself. |
| 13 | |
| 14 | The field yyfin has type yyfinstate list. yyfinstate consists of the |
| 15 | following three constructors: |
| 16 | |
| 17 | * N of int - indicates normal end leaf. |
| 18 | * D of int - dummy end leaf - for indicating when an end state for |
| 19 | a trailing context regular expression has been reached. These are |
| 20 | stored and propagated backwards when action is executed. |
| 21 | * T of int - indicates an actual end leaf for a trailing context reg. |
| 22 | expression, which should be executed only if D i was encountered |
| 23 | after this end leaf while scanning forward. The dummy end leaf is |
| 24 | removed from the backward propagating list after this node is |
| 25 | encountered. |
| 26 | |
| 27 | |
| 28 | The function scan inside the function lex operates as a transition |
| 29 | function, scanning the input until it is no longer possible to take any |
| 30 | more transitions. It accumulates a list of the accepting leaf list |
| 31 | associated with each accepting state passed through. |
| 32 | |
| 33 | Scan operates as follows: |
| 34 | |
| 35 | Input: * s - current state |
| 36 | * AcceptingLeaves - list of accepting leave lists. Each state |
| 37 | has a list of accepting leaves associated with it. This list |
| 38 | may be nil if the state is not a final state. |
| 39 | * l - position of the next character in the buffer b to read |
| 40 | * i0 - starting position in the buffer. |
| 41 | |
| 42 | Output: If no match is found, it raises the exception LexError. |
| 43 | Otherwise, it returns a value of type lexresult. |
| 44 | |
| 45 | It operates as a transtion function: |
| 46 | It (1) adds the list of accepting leaves for the current state to |
| 47 | the list of accepting leave lists |
| 48 | (2) tries to make a transition on the current input character |
| 49 | to the next state. If it can't make a transition, it |
| 50 | executes the action function. |
| 51 | (a) - if it is past the end of the buffer, it |
| 52 | (1) checks if it as at end eof. If it is then: |
| 53 | It checks to see if it has made any |
| 54 | transitions since it was first called - |
| 55 | (l>i0 when this is true.) If it hasn't |
| 56 | this implies that scan was called at |
| 57 | the end of file. It thus executes |
| 58 | eof function declared by the user. |
| 59 | Otherwise it must execute action w/ |
| 60 | the current accepting state list. |
| 61 | (2) otherwise it reads a block of up to 1024 |
| 62 | characters, and appends this block to the |
| 63 | useful suffix of characters left in the |
| 64 | buffer (those character which have been |
| 65 | scanned in this call to lex()). The buffer |
| 66 | operation should be altered if one intends |
| 67 | to process reg. expressions whose lexemes' |
| 68 | length will be >> 1024. For most normal |
| 69 | applications, the buffer update operation |
| 70 | will be fine. |
| 71 | |
| 72 | This buffer update operation requires |
| 73 | O(n^2/1024) char. copies for lexemes > 1024 |
| 74 | characters in length, and O(n) char. copies |
| 75 | for lexemes <= 1024 characters in length. |
| 76 | It can be made O(n) using linked list |
| 77 | buffers & a Byte.array of size n (not the |
| 78 | ^operator!) for concatenating the buffers |
| 79 | to return a value for yytext when a lexeme |
| 80 | is longer than the typical buffer length. |
| 81 | |
| 82 | (3) If the transition is to a dead state (0 is used |
| 83 | for the dead state), action is executed instead. |