Backport from sid to buster
[hcoop/debian/mlton.git] / mlnlffigen / README
1 Copyright (c) 2001, 2002, Lucent Technologies, Bell Laboratories
2
3 author: Matthias Blume (blume@research.bell-labs.com)
4
5 This directory contains ML-NLFFI-Gen, a glue-code generator for
6 the new "NLFFI" foreign function interface. The generator reads
7 C source code and emits ML code along with a description file for CM.
8
9 Compiling this generator requires the C-Kit ($/ckit-lib.cm) to be
10 installed.
11
12 ---------------------------------------------------------------------
13
14 February 21, 2002: Major changes:
15
16 I reworked the glue code generator in a way that lets generated code
17 scale better -- at the expense of some (mostly academic) generality.
18
19 Changes involve the following:
20
21 1. The functorization is gone.
22
23 2. Every top-level C declaration results in a separate top-level
24 ML equivalent (implemented by its own ML source file).
25
26 3. Incomplete pointer types are treated just like their complete
27 versions -- the only difference being that no RTTI will be
28 available for them. In the "light" interface, this rules out
29 precisely those operations over them that C would disallow.
30
31 4. All related C sources must be supplied to ml-nlffigen together.
32 Types incomplete in one source but complete in another get
33 automatically completed in a cross-file fashion.
34
35 5. The handle for the shared library to link to is now abstracted as
36 a function closure. Moreover, it must be supplied as a top-level
37 variable (by the programmer). For this purpose, ml-nlffigen has
38 corresponding command-line options.
39
40 These changes mean that even very large (in number of exported definitions)
41 libraries such as, e.g., GTK can now be handled gracefully without
42 reaching the limits of the ML compiler's abilities.
43
44 [The example of GTK -- for which ml-nlffigen creates several thousands (!)
45 of separate ML source files -- puts an unusal burden on CM, though.
46 However, aside from running a bit longer than usual, CM handles loads
47 of this magnitute just fine. Stabilizing the resulting library solves
48 the problem entirely as far as later clients are concerned.]
49
50
51 Sketch of translation- (and naming-) scheme:
52
53 struct foo { ... }
54 --> structure ST_foo in st-foo.sml (not exported)
55 basic type info (name, size)
56 & structure S_foo in s-foo.sml
57 abstract interface to the type
58 field accessors f_xxx (unless -light)
59 and f_xxx' (unless -heavy)
60 field types t_f_xxx
61 field RTTI typ_f_xxx
62
63 & (unless "-nosucvt" was set)
64 structures IS_foo in <a>/is-foo.sml
65 (see discussion of struct *foo below)
66
67 union foo { ... }
68 --> structure UT_foo in ut-foo.sml (not exported)
69 basic type info (name, size)
70 & structure U_foo in u-foo.sml
71 abstract interface to the type
72 field accessors f_xxx (unless -light)
73 and f_xxx' (unless -heavy)
74 field types t_f_xxx
75 field RTTI typ_f_xxx
76
77 & (unless "-nosucvt" was set)
78 structures IU_foo in <a>/iu-foo.sml
79 (see discussion of union *foo below)
80
81 struct { ... }
82 like struct <n> { ... }, where <n> is a fresh integer or 'bar
83 if 'struct { ... }' occurs in the context of a
84 'typedef struct { ... } bar'
85
86 union { ... }
87 like union <n> { ... }, where <n> is a fresh integer or 'bar
88 if 'union { ... }' occurs in the context of a
89 'typedef union { ... } bar'
90
91
92 enum foo { ... }
93 --> structure E_foo in e-foo.sml
94 external type mlrep with
95 enum constants e_xxx
96 conversion functions between tag enum and mlrep
97 between mlrep and sint
98 access functions (get/set) that operate on mlrep
99 (as an alternative to C.Get.enum/C.Set.enum which
100 operate on sint)
101
102 If the command-line optino "-ec" ("-enum-constructors") was set
103 and the values of all enum constants are different from each
104 other, then mlrep will be a datatype (thus making it possible
105 to pattern-match).
106
107 enum { ... }
108 If this construct appears in the context of a surrounding
109 (non-anonymous) struct or union or typedef, the enumeration gets
110 assigned an artificial tag (just like similar structs and unions,
111 see above).
112
113 Unless the command-line option "-nocollect" was specified, then
114 all constants in other (truly) unnamed enumerations will be
115 collected into a single enumeration represented by structure E_'.
116 This single enumeration is then treated like a regular enumeration
117 (including handling of "-ec" -- see above).
118
119 The default behavior ("collect") is to assign a fresh integer
120 tag (again, just like in the struct/union case).
121
122 T foo (T, ..., T) (global function/function prototype)
123 --> structure F_foo in f-foo.sml
124 containing three/four members:
125 typ : RTTI
126 fptr: thunkified fptr representing the C function
127 maybe f' : light-weight function wrapper around fptr
128 Turned off by -heavy (see below).
129 maybe f : heavy-weight function wrapper around fptr
130 Turned off by -light (see below).
131
132 T foo; (global variable)
133 --> structure G_foo in g-foo.sml
134 containing three members:
135 t : type
136 typ : RTTI
137 obj : thunkified object representing the C variable
138
139 struct foo * (without existing definition of struct foo; incomplete type)
140 --> an internal structure ST_foo with a type "tag" (just like in
141 the struct foo { ... } case)
142 The difference is that no structure S_foo will be generated,
143 so there is no field-access interface and no RTTI (size or typ)
144 for this. All "light-weight" functions referring to this
145 pointer type will be generated, heavy-weight functions will
146 be generated only if they do not require access to RTTI.
147
148 If "-heavy" was specified but a heavy interface function
149 cannot be generated because of incomplete types, then its
150 light counterpart will be issued generated anyway.
151
152 union foo * Same as with struct foo *, but replace S_foo with U_foo
153 and ST_foo with UT_foo.
154
155 Additional files for implementing function entry sequences are created
156 and used internally. They do not contribute exports, though.
157
158
159 Command-line options for ml-nlffigen:
160
161 General syntax: ml-nlffigen <option> ... [--] <C-file> ...
162
163 Environment variables:
164
165 Ml-nlffigen looks at the environment variable FFIGEN_CPP to obtain
166 the template string for the cpp command line. If FFIGEN_CPP is not
167 set, the template defaults to "gcc -E -U__GNUC__ %o %s > %t".
168 The actual command line is obtained by substituting occurences of
169 %s with the name of the source, and %t with the name of a temporary
170 file holding the pre-processed code.
171
172 Options:
173
174 -dir <dir> output directory where all generated files are placed
175 -d <dir> default: "NLFFI-Generated"
176
177 -allSU instructs ml-nlffigen to include all structs and unions,
178 even those that are defined in included files (as opposed
179 to files explicitly listed as arguments)
180 default: off
181
182 -width <w> sets output line width (just a guess) to <w>
183 -w <w> default: 75
184
185 -smloption <x> instructs ml-nlffigen to include <x> into the list
186 of options to annotate .sml entries in the generated .cm
187 file with. By default, the list consists just of "noguid".
188 -guid Removes the default "noguid" from the list of sml options.
189 (This re-enables strict handling of type- and object-identity
190 but can have negative impact on CM cutoff recompilation
191 performance if the programmer routinely removes the entire
192 tree of ml-nlffigen-generated files during development.)
193
194 (*
195 -lambdasplit <x> instructs ml-nlffigen to generate "lambdasplit"
196 -ls <x> options for all ML files (see CM manual for what this means;
197 it does not currently work anyway because cross-module
198 inlining is broken).
199 default: nothing
200 *)
201
202 -target <t> Sets the target to <t> (which must be one of "sparc-unix",
203 -t <t> "x86-unix", or "x86-win32").
204 default: current architecture
205
206 -light suppress "heavy" versions of function wrappers and
207 -l field accessors; also resets any earlier -heavy to default
208 default: not suppressed
209
210 -heavy suppress "light" versions of function wrappers and
211 -h field accessors; also resets any earlier -light to default
212 default: not suppressed
213
214 -namedargs instruct ml-nlffigen to generated function wrappers that
215 -na use named arguments (ML records) instead of tuples if
216 there is enough information for this in the C source;
217 (this is not always very useful)
218 default: off
219
220 -nocollect Do not do the following:
221 Collect enum constants from truly unnamed enumerations
222 (those without tags that occur at toplevel or in an
223 unnamed context, i.e., not in a typedef or another
224 named struct or union) into a single artificial
225 enumeration tagged by ' (single apostrohe). The corresponding
226 ML-side representative will be a structure named E_'.
227
228 -enum-constructors
229 -ec When possible (i.e., if all values of a given enumeration
230 are different from each other), make the ML representation
231 type of the enumeration a datatype. The default (and
232 fallback) is to make that type the same as MLRep.Signed.int.
233
234 -libhandle <h> Use the variable <h> to refer to the handle to the
235 -lh <h> shared library object. Given the constraints of CM, <h>
236 must have the form of a long ML identifier, e.g.,
237 MyLibrary.libhandle.
238 default: Library.libh
239
240 -include <f> Mention file <f> in the generated .cm file. This option
241 -add <f> is necessary at least once for providing the library handle.
242 It can be used arbitrarily many times, resulting in more
243 than one such programmer-supplied file to be mentioned.
244 If <f> is relative, then it must be relative to the directory
245 specified in the -dir <dir> option.
246
247 -cmfile <f> Specify name of the generated .cm file, relative to
248 -cm <f> the directory specified by the -dir <dir> option.
249 default: nlffi-generated.cm
250
251 -cppopt <o> The string <o> gets added to the list of options to be
252 passed to cpp (the C preprocessor). The list of options
253 gets substituted for %o in the cpp command line template.
254
255 -U<x> The string -U<x> gets added to the list of cpp options.
256
257 -D<x> The string -D<x> gets added to the list of cpp options.
258
259 -I<x> The string -I<x> gets added to the list of cpp options.
260
261 -version Just write the version number of ml-nlffigen to standard
262 output and then quit.
263
264 -match <r> Normally ml-nlffigen will include ML definitions for a C
265 -m <r> declaration if the C declaration textually appears in
266 one of the files specified at the command line. Definitions
267 in #include-d files will normally not appear (unless
268 their absence would lead to inconsistencies).
269 By specifying -match <r>, ml-nlffigen will also include
270 definitions that occur in recursively #include-d files
271 for which the AWK-style regular expression <r> matches
272 their names.
273
274 -prefix <p> Generated ML structure names will all have prefix <p>
275 -p <p> (in addition to the usual "S_" or "U_" or "F_" ...)
276
277 -gensym <g> Names "gensym-ed" by ml-nlffigen (for anonymous struct/union/
278 -g <g> enums) will get an additional suffix _<g>. (This should
279 be used if output from several indepdendent runs of
280 ml-nlffigen are to coexist in the same ML program.)
281
282 -- Terminate processing of options, remaining arguments are
283 taken to be C sources.
284
285 ----------------------------------------------------------------------
286
287 Sample usage:
288
289 Suppose we have a C interface defined in foo.h.
290
291 1. Running ml-nlffigen:
292
293 It is best to let a tool such as Unix' "make" handle the invocation of
294 ml-nlffigen. The following "Makefile" can be used as a template for
295 other projects:
296
297 +----------------------------------------------------------
298 |FILES = foo.h
299 |H = FooH.libh
300 |D = FFI
301 |HF = ../foo-h.sml
302 |CF = foo.cm
303 |
304 |$(D)/$(CF): $(FILES)
305 | ml-nlffigen -include $(HF) -libhandle $(H) -dir $(D) -cmfile $(CF) $^
306 +----------------------------------------------------------
307
308 Suppose the above file is stored as "foo.make". Running
309
310 $ make -f foo.make
311
312 will generate a subdirectory "FFI" full of ML files corresponding to
313 the definitions in foo.h. Access to the generated ML code is gained
314 by refering to the CM library FFI/foo.cm; the .cm-file (foo.cm) is
315 also produced by ml-nlffigen.
316
317 2. The ML code uses the library handle specified in the command line
318 (here: FooH.libh) for dynamic linking. The type of FooH.libh must
319 be:
320
321 FooH.libh : string -> unit -> CMemory.addr
322
323 That is, FooH.libh takes the name of a symbol and produces that
324 symbol's suspended address.
325
326 The code that implements FooH.libh must be provided by the programmer.
327 In the above example, we assume that it is stored in file foo-h.sml.
328 The name of that file must appear in the generated .cm-file, hence the
329 "-include" command-line argument.
330
331 Notice that the name provided to ml-nlffigen must be relative to the
332 output directory. Therefore, in our case it is "../foo-h.sml" and not
333 just foo-h.sml (because the full path would be FFI/../foo-h.sml).
334
335 3. To actually implement FooH.libh, use the "DynLinkage" module.
336 Suppose the shared library's name is "/usr/lib/foo.so". Here is
337 the corresponding contents of foo-h.sml:
338
339 +-------------------------------------------------------------
340 |structure FooH = struct
341 | local
342 | val lh = DynLinkage.open_lib
343 | { name = "/usr/lib/foo.so", global = true, lazy = true }
344 | in
345 | fun libh s = let
346 | val sh = DynLinkage.lib_symbol (lh, s)
347 | in
348 | fn () => DynLinkage.addr sh
349 | end
350 | end
351 |end
352 +-------------------------------------------------------------
353
354 If all the symbols you are linking to are already available within
355 the ML runtime system, then you don't need to open a new shared
356 object. As a result, your FooH implementation would look like this:
357
358 +-------------------------------------------------------------
359 |structure FooH = struct
360 | fun libh s = let
361 | val sh = DynLinkage.lib_symbol (DynLinkage.main_lib, s)
362 | in
363 | fn () => DynLinkage.addr sh
364 | end
365 |end
366 +-------------------------------------------------------------
367
368 If the symbols your are accessing are strewn across several separate
369 shared objects, then there are two possible solutions:
370
371 a) Open several shared libraries and perform a trial-and-error search
372 for every symbol you are looking up. (The DynLinkage module raises
373 an exception (DynLinkError of string) if the lookup fails. This
374 could be used to daisy-chain lookup operations.)
375
376 [Be careful: Sometimes there are non-obvious inter-dependencies
377 between shared libraries. Consider using DynLinkage.open_lib'
378 to express those.]
379
380 b) A simpler and more robust way of accessing several shared libraries
381 is to create a new "summary" library object at the OS level.
382 Supposed you are trying to access /usr/lib/foo.so and /usr/lib/bar.so.
383 The solution is to make a "foobar.so" object by saying:
384
385 $ ld -shared -o foobar.so /usr/lib/foo.so /usr/lib/bar.so
386
387 The ML code then referes to foobar.so and the Linux dynamic loader
388 does the rest.
389
390 4. To put it all together, let's wrap it up in a .cm-file. For example,
391 if we simply want to directly make the ml-nlffigen-generated definitions
392 available to the "end user", we could write this wrapper .cm-file
393 (let's call it foo.cm):
394
395 +-------------------------------------------------------------
396 |library
397 | library(FFI/foo.cm)
398 |is
399 | $/basis.cm
400 | $/c.cm
401 | FFI/foo.cm : make (-f foo.make)
402 +-------------------------------------------------------------
403
404 Now, saying
405
406 $ sml -m foo.cm
407
408 is all one need's to do in order to compile. (CM will automatically
409 invoke "make", so you don't have to run "make" separately.)
410
411 If the goal is not to export the "raw" ml-nlffigen-generated stuff
412 but rather something more nicely "wrapped", consider writing wrapper
413 ML code. Suppose you have wrapper definitions for structure Foo_a
414 and structure Foo_b with code for those in wrap-foo-a.sml and
415 wrap-foo-b.sml. In this case the corresponding .cm-file would
416 look like the following:
417
418 +-------------------------------------------------------------
419 |library
420 | structure Foo_a
421 | structure Foo_b
422 |is
423 | $/basis.cm
424 | $/c.cm
425 | FFI/foo.cm : make (-f foo.make)
426 | wrapper-foo-a.sml
427 | wrapper-foo-b.sml
428 +-------------------------------------------------------------