Backport from sid to buster
[hcoop/debian/mlton.git] / mlyacc / doc / tech.doc
1 A Hacker's guide ML-Yacc itself
2
3 The program for computing the LALR(1) table can be divided into 3 separate
4 parts. The first part computes the LR(0) graph. The second part attaches
5 lookahead to the LR(0) graph to get the LALR(1) graph. The third part
6 computes the parse tables from the LALR(1) graph.
7
8 Look at the file sigs.sml to see how the modules are layed out.
9 The file graph.sml contains the Graph functor, which produces a structure
10 containing a function mkGraph. mkGraph takes a grammar and returns a
11 some useful values and functions, including the LR(0) graph. It renumbers
12 the rules to an internal form to make the LR(0) graph generation more
13 efficient. The LR(0) graph includes only core items in its set of items.
14
15 The file look.sml takes some of theses values and produces functions
16 which tell whether a nonterm is nullable and the first set of a symbol
17 list.
18
19 The functor mkLalr creates a structure with a function that takes an LR(0)
20 graph and some other values (notably the first and nullable) functions
21 produced by Look and creates a stripped down version of an LR(0) graph with
22 lookaheads attached. Nullable items (which usually aren't core items) are
23 added and all other items without dots at the end (i.e. non-reduction items)
24 are removed.
25
26 The functor MkTable produces a function with takes the LR(0) graph
27 produced by the function in mkGraph and the LR(0) graph with lookaheads
28 produced by Lalr and creates an LALR(1) table from these graphs.
29
30
31 -----------------------------------------------------------------------
32 An overview of the algorithms used in LR(0) graph generation and
33 LALR(1) lookahead creation.
34
35 LR(0) graph
36 -----------
37
38 The LR(0) graph consists of sets of items. Each set of items will be
39 called a core set. The basic algorithm is:
40
41 let fun add_gotos(graph,f,nil,r) = (graph,r)
42 | add_gotos(graph,f,(a,symbol)::b,r)
43 let newgraph = graph + edge from f to a labelled
44 with symbol
45 in if a exists in graph then
46 add_gotos(newgraph,f,b,r)
47 else add_gotos(newgraph,f,b,a::r)
48 end
49 fun f(graph,nil) = graph
50 | f(graph,a::b) = f(add_gotos(graph,a,gotos of closure a,b))
51 in f(empty-graph,[initial core set])
52 end
53
54 For each core, we compute the new cores which result from doing a shift
55 or goto, and then add these new cores with the symbol used in the shift
56 or goto to the graph. We continue doing this until there are no more cores
57 to adds to the graph.
58
59 We have to take the closure of a core to include those items which are
60 derived from nonterminals with a dot before them. If item A -> 'a .B 'c
61 is in a core, the all productions derived by B must also be in the core.
62
63 We want to be able to do the following operations efficently:
64 (1) check if a core is in the graph already
65 (2) compute the closure of a core
66 (3) compute the cores resulting from goto/shift operations.
67
68 (1) This can be done efficiently if a complete order exists for the cores. This
69 can be done by imposing an ordering on items, giving each item a unique
70 integer and using the place in an item. This can be used to order a
71 set of items.
72
73 (2) Much of the computation for the closure can be done ahead of time.
74 The set of nonterminals to add for a given a nonterminal can be pre-computed
75 using a transitive closure algorithm (the transitive closure is sparse
76 in practice). One can then compute the closure for a core in the following
77 manner. First, compute the set of nonterminals with . in front of them.
78 This can be done in (m ln m) time. Next, use the results from the
79 transitive closure to compute the complete set of nonterminals that
80 should be used. Finally, for each nonterminal, merge its set of
81 productions (sort all rules by the nonterminals from which they
82 are derived before numbering them, then all we have to do is just
83 prepend the rules while scanning the list in reverse order).
84
85 (3) To do this, just scan the core closure, sorting rules by their
86 symbols into lists. Then reverse all the lists, and we have the
87 new core sets.
88
89 Lookahead representation
90 ------------------------
91
92 The previous part throws away the result of the closure operations.
93 It is used only to compute new cores for use in the goto operation.
94 These intermediate results should be saved because they will be useful
95 here.
96
97 Lookaheads are attached to an item when
98
99 (1) an item is the result of a shift/goto. The item
100 must have the same lookahead as the item from which it
101 is derived.
102 (2) an item is added as the result of a closure. Note that
103 in fact all productions derived from a given nonterminal
104 are added here. This can be used (perhaps) to our
105 advantage, as we can represent a closure using just the
106 nonterminal.
107
108 This can be divided into two cases:
109
110 (a) A -> 'a .B 'c , where 'c derives epsilon,
111 (b) A -> 'a .B 'c , where 'c does not derive epsilon
112
113 In (a), lookahead(items derived from B) includes first('c)
114 and lookahead(A -> 'a .B 'c)
115
116 In (b), lookahead(items derived from B) includes only first('c).
117
118 This is an example of back propagation.
119
120 Note that an item is either the result of a closure or the
121 result of a shift/goto. It is never the result of both (that
122 would be a contradiction).
123
124 The following representation will be used:
125
126 goto/shift items:
127 an ordered list of item * lookahead ref *
128 lookahead ref for the resulting
129 shift/goto item in another core.
130
131 closure items:
132 for each nonterminal:
133 (1) lookahead ref
134 (2) a list of item * lookahead ref for the
135 resulting shift/goto item in another
136 core.
137
138 Lookahead algorithms
139 --------------------
140
141 After computing the LR(0) graph, lookaheads must be attached to the items in
142 the graph. An item i may receive lookaheads in two ways. If item i
143 was the result of a shift or goto from some item j, then lookahead(i) includes
144 lookahead(j). If item i is a production of some nonterminal B, and there
145 exists some item j of the form A -> x .B y, then item i will be added through
146 closure(j). This implies that lookahead(i) includes first(y). If y =>
147 epsilon, then lookahead(i) includes lookahead(j).
148
149 Lookahead must be recorded for completion items, which are items of the
150 form A -> x., non-closure items of the form A -> y . B z, where z is
151 not nullable, and closure items of the form A -> epsilon. (comment:
152 items of the form A -> .x can appear in the start state as non-closure items.
153 A must be the start symbol, which should not appear in the right hand side
154 of any rule. This implies that lookaheads will never be propagated to
155 such items)
156
157 We chose to omit closure items that do not have the form A -> epsilon.
158 It is possible to add lookaheads to closure items, but we have not
159 done so because it would greatly slow down the addition of lookaheads.
160
161 Instead we precompute the nonterminals whose productions are
162 added through the closure operation, the lookaheads for these
163 nonterminals, and whether the lookahead for these nonterminals
164 should include first(y) and lookahead(j) for some item j of the
165 form A -> x .B y. This information depends only on the particular
166 nonterminal whose closure is being taken.
167
168 Some notation is necessary to describe what is happening here. Let
169 =c=> denote items added in one closure step that are derived from some
170 nonterminal B in a production A -> x .B y. Let =c+=> denote items
171 added in one or more =c=> steps.
172
173 Consider the following productions
174
175 B -> S ;
176 S -> E
177 E -> F * E
178 F -> num
179
180 in a kernal with the item
181
182 B -> .S
183
184 The following derivations are possible:
185
186 B -> .S =c=> S -> .E =c+=> S -> .E, E -> .F * E, F -> .num
187
188 The nonterminals that are added through the closure operation
189 are the nonterminals for some item j = A -> .B x such that j =c+=> .C y.
190 Lookahead(C) includes first(y). If y =*=> epsilon then
191 lookahead (C) includes first (x). If x=*=> epsilon and y =*=> epsilon
192 then lookahead(C) includes first(j).
193
194 The following algorithm computes the information for each nonterminal:
195
196 (1) nonterminals such that c =c+=> .C y and y =*=> epsilon
197
198 Let s = the set of nonterminals added through closure = B
199
200 repeat
201 for all B which are elements of s,
202 if B -> .C z and z =*=> epsilon then
203 add B to s.
204 until s does not change.
205
206 (2) nonterminals added through closure and their lookaheads
207
208 Let s = the set of nonterminals added through closure = B
209 where A -> x . B y
210
211 repeat
212 for all B which are elements of s,
213 if B -> .C z then add C to s, and
214 add first(z) to lookahead(C)
215 until nothing changes.
216
217 Now, for each nonterminal A in s, find the set of nonterminals
218 such that A =c+=> .B z, and z =*=> epsilon (i.e. use the results
219 from 1). Add the lookahead for nonterminal A to the lookahead
220 for each nonterminal in this set.
221
222 These algorithms can be restated as either breadth-first or depth-first search
223 algorithms. The loop invariant of the algorithms is that whenever a
224 nonterminal is added to the set being calculated, all the productions
225 for the nonterminal are checked.
226
227 This algorithm computes the lookahead for each item:
228
229 for each state,
230 for each item of the form A -> u .B v in the state, where u may be
231 nullable,
232 let first_v = first(v)
233 l-ref = ref for A -> u .B v
234 s = the set of nonterminals added through the closure of B.
235
236 for each element X of s,
237
238 let r = the rules produced by an element X of s
239 l = the lookahead ref cells for each rule, i.e.
240 all items of A -> x. or A -> x .B y, where
241 y =*=> epsilon, and x is not epsilon
242
243 add the lookahead we have computed for X to the
244 elements of l
245
246 if B =c+=> X z, where z is nullable, add first(y) to
247 the l. If y =*=> epsilon, save l with the ref for
248 A -> x .B y in a list.
249
250 Now take the list of (lookahead ref, list of lookahead refs) and propagate
251 each lookahead ref cell's contents to the elements of the list of lookahead
252 ref cells associated with it. Iterate until no lookahead set changes.