[hcoop/debian/mlton.git] / doc / guide / src / CommonArg.adoc

CommonArg
=========

<:CommonArg:> is an optimization pass for the <:SSA:>
<:IntermediateLanguage:>, invoked from <:SSASimplify:>.

== Description ==

It optimizes instances of `Goto` transfers that pass the same
arguments to the same label; e.g.
----
L_1 ()
  ...
  z1 = ?
  ...
  L_3 (x, y, z1)
L_2 ()
  ...
  z2 = ?
  ...
  L_3 (x, y, z2)
L_3 (a, b, c)
  ...
----

This code can be simplified to:
----
L_1 ()
  ...
  z1 = ?
  ...
  L_3 (z1)
L_2 ()
  ...
  z2 = ?
  ...
  L_3 (z2)
L_3 (c)
  a = x
  b = y
----
which saves a number of resources: time of setting up the arguments
for the jump to `L_3`, space (either stack or pseudo-registers) for
the arguments of `L_3`, etc.  It may also expose some other
optimizations, if more information is known about `x` or `y`.

== Implementation ==

* <!ViewGitFile(mlton,master,mlton/ssa/common-arg.fun)>

== Details and Notes ==

Three analyses were originally proposed to drive the optimization
transformation.  Only the _Dominator Analysis_ is currently
implemented.  (Implementations of the other analyses are available in
the <:Sources:repository history>.)

=== Syntactic Analysis ===

The simplest analysis I could think of maintains
----
varInfo: Var.t -> Var.t option list ref
----
initialized to `[]`.

* For each variable `v` bound in a `Statement.t` or in the
`Function.t` args, then `List.push(varInfo v, NONE)`.
* For each `L (x1, ..., xn)` transfer where `(a1, ..., an)` are the
formals of `L`, then `List.push(varInfo ai, SOME xi)`.
* For each block argument a used in an unknown context (e.g.,
arguments of blocks used as continuations, handlers, arith success,
runtime return, or case switch labels), then
`List.push(varInfo a, NONE)`.

Now, any block argument `a` such that `varInfo a = xs`, where all of
the elements of `xs` are equal to `SOME x`, can be optimized by
setting `a = x` at the beginning of the block and dropping the
argument from `Goto` transfers.

That takes care of the example above.  We can clearly do slightly
better, by changing the transformation criteria to the following: any
block argument a such that `varInfo a = xs`, where all of the elements
of `xs` are equal to `SOME x` _or_ are equal to `SOME a`, can be
optimized by setting `a = x` at the beginning of the block and
dropping the argument from `Goto` transfers.  This optimizes a case
like:
----
L_1 ()
  ... z1 = ? ...
  L_3 (x, y, z1)
L_2 ()
  ... z2 = ? ...
  L_3(x, y, z2)
L_3 (a, b, c)
  ... w = ? ...
  case w of
    true => L_4 | false => L_5
L_4 ()
   ...
   L_3 (a, b, w)
L_5 ()
   ...
----
where a common argument is passed to a loop (and is invariant through
the loop).  Of course, the <:LoopInvariant:> optimization pass would
normally introduce a local loop and essentially reduce this to the
first example, but I have seen this in practice, which suggests that
some optimizations after <:LoopInvariant:> do enough simplifications
to introduce (new) loop invariant arguments.

=== Fixpoint Analysis ===

However, the above analysis and transformation doesn't cover the cases
where eliminating one common argument exposes the opportunity to
eliminate other common arguments.  For example:
----
L_1 ()
  ...
  L_3 (x)
L_2 ()
  ...
  L_3 (x)
L_3 (a)
  ...
  L_5 (a)
L_4 ()
  ...
  L_5 (x)
L_5 (b)
  ...
----

One pass of analysis and transformation would eliminate the argument
to `L_3` and rewrite the `L_5(a)` transfer to `L_5 (x)`, thereby
exposing the opportunity to eliminate the common argument to `L_5`.

The interdependency the arguments to `L_3` and `L_5` suggest
performing some sort of fixed-point analysis.  This analysis is
relatively simple; maintain
----
varInfo: Var.t -> VarLattice.t
----
{empty}where
----
VarLattice.t ~=~ Bot | Point of Var.t | Top
----
(but is implemented by the <:FlatLattice:> functor with a `lessThan`
list and `value ref` under the hood), initialized to `Bot`.

* For each variable `v` bound in a `Statement.t` or in the
`Function.t` args, then `VarLattice.<= (Point v, varInfo v)`
* For each `L (x1, ..., xn)` transfer where `(a1, ..., an)` are the
formals of `L`}, then `VarLattice.<= (varInfo xi, varInfo ai)`.
* For each block argument a used in an unknown context, then
`VarLattice.<= (Point a, varInfo a)`.

Now, any block argument a such that `varInfo a = Point x` can be
optimized by setting `a = x` at the beginning of the block and
dropping the argument from `Goto` transfers.

Now, with the last example, we introduce the ordering constraints:
----
varInfo x <= varInfo a
varInfo a <= varInfo b
varInfo x <= varInfo b
----

Assuming that `varInfo x = Point x`, then we get `varInfo a = Point x`
and `varInfo b = Point x`, and we optimize the example as desired.

But, that is a rather weak assumption.  It's quite possible for
`varInfo x = Top`.  For example, consider:
----
G_1 ()
  ... n = 1 ...
  L_0 (n)
G_2 ()
  ... m = 2 ...
  L_0 (m)
L_0 (x)
  ...
L_1 ()
  ...
  L_3 (x)
L_2 ()
  ...
  L_3 (x)
L_3 (a)
  ...
  L_5(a)
L_4 ()
  ...
  L_5(x)
L_5 (b)
   ...
----

Now `varInfo x = varInfo a = varInfo b = Top`.  What went wrong here?
When `varInfo x` went to `Top`, it got propagated all the way through
to `a` and `b`, and prevented the elimination of any common arguments.
What we'd like to do instead is when `varInfo x` goes to `Top`,
propagate on `Point x` -- we have no hope of eliminating `x`, but if
we hold `x` constant, then we have a chance of eliminating arguments
for which `x` is passed as an actual.

=== Dominator Analysis ===

Does anyone see where this is going yet?  Pausing for a little
thought, <:MatthewFluet:> realized that he had once before tried
proposing this kind of "fix" to a fixed-point analysis -- when we were
first investigating the <:Contify:> optimization in light of John
Reppy's CWS paper.  Of course, that "fix" failed because it defined a
non-monotonic function and one couldn't take the fixed point.  But,
<:StephenWeeks:> suggested a dominator based approach, and we were
able to show that, indeed, the dominator analysis subsumed both the
previous call based analysis and the cont based analysis.  And, a
moment's reflection reveals further parallels: when
`varInfo: Var.t -> Var.t option list ref`, we have something analogous
to the call analysis, and when `varInfo: Var.t -> VarLattice.t`, we
have something analogous to the cont analysis.  Maybe there is
something analogous to the dominator approach (and therefore superior
to the previous analyses).

And this turns out to be the case.  Construct the graph `G` as follows:
----
nodes(G) = {Root} U Var.t
edges(G) = {Root -> v | v bound in a Statement.t or
                                in the Function.t args} U
           {xi -> ai | L(x1, ..., xn) transfer where (a1, ..., an)
                                      are the formals of L} U
           {Root -> a | a is a block argument used in an unknown context}
----

Let `idom(x)` be the immediate dominator of `x` in `G` with root
`Root`.  Now, any block argument a such that `idom(a) = x <> Root` can
be optimized by setting `a = x` at the beginning of the block and
dropping the argument from `Goto` transfers.

Furthermore, experimental evidence suggests (and we are confident that
a formal presentation could prove) that the dominator analysis
subsumes the "syntactic" and "fixpoint" based analyses in this context
as well and that the dominator analysis gets "everything" in one go.

=== Final Thoughts ===

I must admit, I was rather surprised at this progression and final
result.  At the outset, I never would have thought of a connection
between <:Contify:> and <:CommonArg:> optimizations.  They would seem
to be two completely different optimizations.  Although, this may not
really be the case.  As one of the reviewers of the ICFP paper said:
____
I understand that such a form of CPS might be convenient in some
cases, but when we're talking about analyzing code to detect that some
continuation is constant, I think it makes a lot more sense to make
all the continuation arguments completely explicit.

I believe that making all the continuation arguments explicit will
show that the optimization can be generalized to eliminating constant
arguments, whether continuations or not.
____

What I think the common argument optimization shows is that the
dominator analysis does slightly better than the reviewer puts it: we
find more than just constant continuations, we find common
continuations.  And I think this is further justified by the fact that
I have observed common argument eliminate some `env_X` arguments which
would appear to correspond to determining that while the closure being
executed isn't constant it is at least the same as the closure being
passed elsewhere.

At first, I was curious whether or not we had missed a bigger picture
with the dominator analysis.  When we wrote the contification paper, I
assumed that the dominator analysis was a specialized solution to a
specialized problem; we never suggested that it was a technique suited
to a larger class of analyses.  After initially finding a connection
between <:Contify:> and <:CommonArg:> (and thinking that the only
connection was the technique), I wondered if the dominator technique
really was applicable to a larger class of analyses.  That is still a
question, but after writing up the above, I'm suspecting that the
"real story" is that the dominator analysis is a solution to the
common argument optimization, and that the <:Contify:> optimization is
specializing <:CommonArg:> to the case of continuation arguments (with
a different transformation at the end).  (Note, a whole-program,
inter-procedural common argument analysis doesn't really make sense
(in our <:SSA:> <:IntermediateLanguage:>), because the only way of
passing values between functions is as arguments.  (Unless of course
in the case that the common argument is also a constant argument, in
which case <:ConstantPropagation:> could lift it to a global.)  The
inter-procedural <:Contify:> optimization works out because there we
move the function to the argument.)

Anyways, it's still unclear to me whether or not the dominator based
approach solves other kinds of problems.

=== Phase Ordering ===

On the downside, the optimization doesn't have a huge impact on
runtime, although it does predictably saved some code size.  I stuck
it in the optimization sequence after <:Flatten:> and (the third round
of) <:LocalFlatten:>, since it seems to me that we could have cases
where some components of a tuple used as an argument are common, but
the whole tuple isn't.  I think it makes sense to add it after
<:IntroduceLoops:> and <:LoopInvariant:> (even though <:CommonArg:>
get some things that <:LoopInvariant:> gets, it doesn't get all of
them).  I also think that it makes sense to add it before
<:CommonSubexp:>, since identifying variables could expose more common
subexpressions.  I would think a similar thought applies to
<:RedundantTests:>.
Commit	Line	Data
7f918cf1 CE	1	CommonArg
	2	=========
	3
	4	<:CommonArg:> is an optimization pass for the <:SSA:>
	5	<:IntermediateLanguage:>, invoked from <:SSASimplify:>.
	6
	7	== Description ==
	8
	9	It optimizes instances of `Goto` transfers that pass the same
	10	arguments to the same label; e.g.
	11	----
	12	L_1 ()
	13	...
	14	z1 = ?
	15	...
	16	L_3 (x, y, z1)
	17	L_2 ()
	18	...
	19	z2 = ?
	20	...
	21	L_3 (x, y, z2)
	22	L_3 (a, b, c)
	23	...
	24	----
	25
	26	This code can be simplified to:
	27	----
	28	L_1 ()
	29	...
	30	z1 = ?
	31	...
	32	L_3 (z1)
	33	L_2 ()
	34	...
	35	z2 = ?
	36	...
	37	L_3 (z2)
	38	L_3 (c)
	39	a = x
	40	b = y
	41	----
	42	which saves a number of resources: time of setting up the arguments
	43	for the jump to `L_3`, space (either stack or pseudo-registers) for
	44	the arguments of `L_3`, etc. It may also expose some other
	45	optimizations, if more information is known about `x` or `y`.
	46
	47	== Implementation ==
	48
	49	* <!ViewGitFile(mlton,master,mlton/ssa/common-arg.fun)>
	50
	51	== Details and Notes ==
	52
	53	Three analyses were originally proposed to drive the optimization
	54	transformation. Only the _Dominator Analysis_ is currently
	55	implemented. (Implementations of the other analyses are available in
	56	the <:Sources:repository history>.)
	57
	58	=== Syntactic Analysis ===
	59
	60	The simplest analysis I could think of maintains
	61	----
	62	varInfo: Var.t -> Var.t option list ref
	63	----
	64	initialized to `[]`.
65
66	* For each variable `v` bound in a `Statement.t` or in the
67	`Function.t` args, then `List.push(varInfo v, NONE)`.
68	* For each `L (x1, ..., xn)` transfer where `(a1, ..., an)` are the
69	formals of `L`, then `List.push(varInfo ai, SOME xi)`.
70	* For each block argument a used in an unknown context (e.g.,
71	arguments of blocks used as continuations, handlers, arith success,
72	runtime return, or case switch labels), then
73	`List.push(varInfo a, NONE)`.
74
75	Now, any block argument `a` such that `varInfo a = xs`, where all of
76	the elements of `xs` are equal to `SOME x`, can be optimized by
77	setting `a = x` at the beginning of the block and dropping the
78	argument from `Goto` transfers.
79
80	That takes care of the example above. We can clearly do slightly
81	better, by changing the transformation criteria to the following: any
82	block argument a such that `varInfo a = xs`, where all of the elements
83	of `xs` are equal to `SOME x` _or_ are equal to `SOME a`, can be
84	optimized by setting `a = x` at the beginning of the block and
85	dropping the argument from `Goto` transfers. This optimizes a case
86	like:
87	----
88	L_1 ()
89	... z1 = ? ...
90	L_3 (x, y, z1)
91	L_2 ()
92	... z2 = ? ...
93	L_3(x, y, z2)
94	L_3 (a, b, c)
95	... w = ? ...
96	case w of
97	true => L_4 \| false => L_5
98	L_4 ()
99	...
100	L_3 (a, b, w)
101	L_5 ()
102	...
103	----
104	where a common argument is passed to a loop (and is invariant through
105	the loop). Of course, the <:LoopInvariant:> optimization pass would
106	normally introduce a local loop and essentially reduce this to the
107	first example, but I have seen this in practice, which suggests that
108	some optimizations after <:LoopInvariant:> do enough simplifications
109	to introduce (new) loop invariant arguments.
110
111	=== Fixpoint Analysis ===
112
113	However, the above analysis and transformation doesn't cover the cases
114	where eliminating one common argument exposes the opportunity to
115	eliminate other common arguments. For example:
116	----
117	L_1 ()
118	...
119	L_3 (x)
120	L_2 ()
121	...
122	L_3 (x)
123	L_3 (a)
124	...
125	L_5 (a)
126	L_4 ()
127	...
128	L_5 (x)
129	L_5 (b)
130	...
131	----
132
133	One pass of analysis and transformation would eliminate the argument
134	to `L_3` and rewrite the `L_5(a)` transfer to `L_5 (x)`, thereby
135	exposing the opportunity to eliminate the common argument to `L_5`.
136
137	The interdependency the arguments to `L_3` and `L_5` suggest
138	performing some sort of fixed-point analysis. This analysis is
139	relatively simple; maintain
140	----
141	varInfo: Var.t -> VarLattice.t
142	----
143	{empty}where
144	----
145	VarLattice.t ~=~ Bot \| Point of Var.t \| Top
146	----
147	(but is implemented by the <:FlatLattice:> functor with a `lessThan`
148	list and `value ref` under the hood), initialized to `Bot`.
149
150	* For each variable `v` bound in a `Statement.t` or in the
151	`Function.t` args, then `VarLattice.<= (Point v, varInfo v)`
152	* For each `L (x1, ..., xn)` transfer where `(a1, ..., an)` are the
153	formals of `L`}, then `VarLattice.<= (varInfo xi, varInfo ai)`.
154	* For each block argument a used in an unknown context, then
155	`VarLattice.<= (Point a, varInfo a)`.
156
157	Now, any block argument a such that `varInfo a = Point x` can be
158	optimized by setting `a = x` at the beginning of the block and
159	dropping the argument from `Goto` transfers.
160
161	Now, with the last example, we introduce the ordering constraints:
162	----
163	varInfo x <= varInfo a
164	varInfo a <= varInfo b
165	varInfo x <= varInfo b
166	----
167
168	Assuming that `varInfo x = Point x`, then we get `varInfo a = Point x`
169	and `varInfo b = Point x`, and we optimize the example as desired.
170
171	But, that is a rather weak assumption. It's quite possible for
172	`varInfo x = Top`. For example, consider:
173	----
174	G_1 ()
175	... n = 1 ...
176	L_0 (n)
177	G_2 ()
178	... m = 2 ...
179	L_0 (m)
180	L_0 (x)
181	...
182	L_1 ()
183	...
184	L_3 (x)
185	L_2 ()
186	...
187	L_3 (x)
188	L_3 (a)
189	...
190	L_5(a)
191	L_4 ()
192	...
193	L_5(x)
194	L_5 (b)
195	...
196	----
197
198	Now `varInfo x = varInfo a = varInfo b = Top`. What went wrong here?
199	When `varInfo x` went to `Top`, it got propagated all the way through
200	to `a` and `b`, and prevented the elimination of any common arguments.
201	What we'd like to do instead is when `varInfo x` goes to `Top`,
202	propagate on `Point x` -- we have no hope of eliminating `x`, but if
203	we hold `x` constant, then we have a chance of eliminating arguments
204	for which `x` is passed as an actual.
205
206	=== Dominator Analysis ===
207
208	Does anyone see where this is going yet? Pausing for a little
209	thought, <:MatthewFluet:> realized that he had once before tried
210	proposing this kind of "fix" to a fixed-point analysis -- when we were
211	first investigating the <:Contify:> optimization in light of John
212	Reppy's CWS paper. Of course, that "fix" failed because it defined a
213	non-monotonic function and one couldn't take the fixed point. But,
214	<:StephenWeeks:> suggested a dominator based approach, and we were
215	able to show that, indeed, the dominator analysis subsumed both the
216	previous call based analysis and the cont based analysis. And, a
217	moment's reflection reveals further parallels: when
218	`varInfo: Var.t -> Var.t option list ref`, we have something analogous
219	to the call analysis, and when `varInfo: Var.t -> VarLattice.t`, we
220	have something analogous to the cont analysis. Maybe there is
221	something analogous to the dominator approach (and therefore superior
222	to the previous analyses).
223
224	And this turns out to be the case. Construct the graph `G` as follows:
225	----
226	nodes(G) = {Root} U Var.t
227	edges(G) = {Root -> v \| v bound in a Statement.t or
228	in the Function.t args} U
229	{xi -> ai \| L(x1, ..., xn) transfer where (a1, ..., an)
230	are the formals of L} U
231	{Root -> a \| a is a block argument used in an unknown context}
232	----
233
234	Let `idom(x)` be the immediate dominator of `x` in `G` with root
235	`Root`. Now, any block argument a such that `idom(a) = x <> Root` can
236	be optimized by setting `a = x` at the beginning of the block and
237	dropping the argument from `Goto` transfers.
238
239	Furthermore, experimental evidence suggests (and we are confident that
240	a formal presentation could prove) that the dominator analysis
241	subsumes the "syntactic" and "fixpoint" based analyses in this context
242	as well and that the dominator analysis gets "everything" in one go.
243
244	=== Final Thoughts ===
245
246	I must admit, I was rather surprised at this progression and final
247	result. At the outset, I never would have thought of a connection
248	between <:Contify:> and <:CommonArg:> optimizations. They would seem
249	to be two completely different optimizations. Although, this may not
250	really be the case. As one of the reviewers of the ICFP paper said:
251	____
252	I understand that such a form of CPS might be convenient in some
253	cases, but when we're talking about analyzing code to detect that some
254	continuation is constant, I think it makes a lot more sense to make
255	all the continuation arguments completely explicit.
256
257	I believe that making all the continuation arguments explicit will
258	show that the optimization can be generalized to eliminating constant
259	arguments, whether continuations or not.
260	____
261
262	What I think the common argument optimization shows is that the
263	dominator analysis does slightly better than the reviewer puts it: we
264	find more than just constant continuations, we find common
265	continuations. And I think this is further justified by the fact that
266	I have observed common argument eliminate some `env_X` arguments which
267	would appear to correspond to determining that while the closure being
268	executed isn't constant it is at least the same as the closure being
269	passed elsewhere.
270
271	At first, I was curious whether or not we had missed a bigger picture
272	with the dominator analysis. When we wrote the contification paper, I
273	assumed that the dominator analysis was a specialized solution to a
274	specialized problem; we never suggested that it was a technique suited
275	to a larger class of analyses. After initially finding a connection
276	between <:Contify:> and <:CommonArg:> (and thinking that the only
277	connection was the technique), I wondered if the dominator technique
278	really was applicable to a larger class of analyses. That is still a
279	question, but after writing up the above, I'm suspecting that the
280	"real story" is that the dominator analysis is a solution to the
281	common argument optimization, and that the <:Contify:> optimization is
282	specializing <:CommonArg:> to the case of continuation arguments (with
283	a different transformation at the end). (Note, a whole-program,
284	inter-procedural common argument analysis doesn't really make sense
285	(in our <:SSA:> <:IntermediateLanguage:>), because the only way of
286	passing values between functions is as arguments. (Unless of course
287	in the case that the common argument is also a constant argument, in
288	which case <:ConstantPropagation:> could lift it to a global.) The
289	inter-procedural <:Contify:> optimization works out because there we
290	move the function to the argument.)
291
292	Anyways, it's still unclear to me whether or not the dominator based
293	approach solves other kinds of problems.
294
295	=== Phase Ordering ===
296
297	On the downside, the optimization doesn't have a huge impact on
298	runtime, although it does predictably saved some code size. I stuck
299	it in the optimization sequence after <:Flatten:> and (the third round
300	of) <:LocalFlatten:>, since it seems to me that we could have cases
301	where some components of a tuple used as an argument are common, but
302	the whole tuple isn't. I think it makes sense to add it after
303	<:IntroduceLoops:> and <:LoopInvariant:> (even though <:CommonArg:>
304	get some things that <:LoopInvariant:> gets, it doesn't get all of
305	them). I also think that it makes sense to add it before
306	<:CommonSubexp:>, since identifying variables could expose more common
307	subexpressions. I would think a similar thought applies to
308	<:RedundantTests:>.