Copyright (c) 2001, 2002, Lucent Technologies, Bell Laboratories

  author: Matthias Blume (blume@research.bell-labs.com)

This directory contains ML-NLFFI-Gen, a glue-code generator for
the new "NLFFI" foreign function interface.  The generator reads
C source code and emits ML code along with a description file for CM.

Compiling this generator requires the C-Kit ($/ckit-lib.cm) to be
installed.

---------------------------------------------------------------------

February 21, 2002:  Major changes:

I reworked the glue code generator in a way that lets generated code
scale better -- at the expense of some (mostly academic) generality.

Changes involve the following:

1. The functorization is gone.

2. Every top-level C declaration results in a separate top-level
   ML equivalent (implemented by its own ML source file).

3. Incomplete pointer types are treated just like their complete
   versions -- the only difference being that no RTTI will be
   available for them.  In the "light" interface, this rules out
   precisely those operations over them that C would disallow.

4. All related C sources must be supplied to ml-nlffigen together.
   Types incomplete in one source but complete in another get
   automatically completed in a cross-file fashion.

5. The handle for the shared library to link to is now abstracted as
   a function closure.  Moreover, it must be supplied as a top-level
   variable (by the programmer).  For this purpose, ml-nlffigen has
   corresponding command-line options.

These changes mean that even very large (in number of exported definitions)
libraries such as, e.g., GTK can now be handled gracefully without
reaching the limits of the ML compiler's abilities.

[The example of GTK -- for which ml-nlffigen creates several thousands (!)
of separate ML source files -- puts an unusal burden on CM, though.
However, aside from running a bit longer than usual, CM handles loads
of this magnitute just fine.  Stabilizing the resulting library solves
the problem entirely as far as later clients are concerned.]


Sketch of translation- (and naming-) scheme:

  struct foo { ... }
      -->   structure ST_foo   in    st-foo.sml  (not exported)
               basic type info (name, size)
       &    structure S_foo    in    s-foo.sml
               abstract interface to the type
                    field accessors f_xxx  (unless -light)
                                and f_xxx' (unless -heavy)
                    field types     t_f_xxx
                    field RTTI      typ_f_xxx

       & (unless "-nosucvt" was set)
            structures IS_foo  in    <a>/is-foo.sml
            (see discussion of struct *foo below)

  union foo { ... }
      -->   structure UT_foo   in    ut-foo.sml  (not exported)
               basic type info (name, size)
       &    structure U_foo    in    u-foo.sml
               abstract interface to the type
                    field accessors f_xxx  (unless -light)
                                and f_xxx' (unless -heavy)
                    field types     t_f_xxx
                    field RTTI      typ_f_xxx

       & (unless "-nosucvt" was set)
            structures IU_foo  in    <a>/iu-foo.sml
            (see discussion of union *foo below)

  struct { ... }
      like struct <n> { ... }, where <n> is a fresh integer or 'bar
      if 'struct { ... }' occurs in the context of a
      'typedef struct { ... } bar'

  union { ... }
      like union <n> { ... }, where <n> is a fresh integer or 'bar
      if 'union { ... }' occurs in the context of a
      'typedef union { ... } bar'


  enum foo { ... }
      -->   structure E_foo   in     e-foo.sml
               external type mlrep with
                   enum constants    e_xxx
               conversion functions between tag enum and mlrep
                                    between mlrep and sint
               access functions (get/set) that operate on mlrep
                (as an alternative to C.Get.enum/C.Set.enum which
                 operate on sint)

           If the command-line optino "-ec" ("-enum-constructors") was set
           and the values of all enum constants are different from each
           other, then mlrep will be a datatype (thus making it possible
           to pattern-match).

  enum { ... }
      If this construct appears in the context of a surrounding
      (non-anonymous) struct or union or typedef, the enumeration gets
      assigned an artificial tag (just like similar structs and unions,
      see above).

      Unless the command-line option "-nocollect" was specified, then
      all constants in other (truly) unnamed enumerations will be
      collected into a single enumeration represented by structure E_'.
      This single enumeration is then treated like a regular enumeration
      (including handling of "-ec" -- see above).

      The default behavior ("collect") is to assign a fresh integer
      tag (again, just like in the struct/union case).

  T foo (T, ..., T)  (global function/function prototype)
      -->   structure F_foo   in     f-foo.sml
               containing three/four members:
                    typ :  RTTI
                    fptr:  thunkified fptr representing the C function
            maybe   f'  :  light-weight function wrapper around fptr
                              Turned off by -heavy (see below).
            maybe   f   :  heavy-weight function wrapper around fptr
                              Turned off by -light (see below).

  T foo;  (global variable)
      -->   structure G_foo   in     g-foo.sml
               containing three members:
                    t   :  type
                    typ :  RTTI
                    obj :  thunkified object representing the C variable

  struct foo *  (without existing definition of struct foo; incomplete type)
      -->   an internal structure ST_foo with a type "tag" (just like in
            the struct foo { ... } case)
            The difference is that no structure S_foo will be generated,
            so there is no field-access interface and no RTTI (size or typ)
            for this.  All "light-weight" functions referring to this
            pointer type will be generated, heavy-weight functions will
            be generated only if they do not require access to RTTI.

            If "-heavy" was specified but a heavy interface function
            cannot be generated because of incomplete types, then its
            light counterpart will be issued generated anyway.

  union foo *   Same as with struct foo *, but replace S_foo with U_foo
            and ST_foo with UT_foo.

  Additional files for implementing function entry sequences are created
  and used internally.  They do not contribute exports, though.


Command-line options for ml-nlffigen:

  General syntax:   ml-nlffigen <option> ... [--] <C-file> ...

  Environment variables:

    Ml-nlffigen looks at the environment variable FFIGEN_CPP to obtain
    the template string for the cpp command line.  If FFIGEN_CPP is not
    set, the template defaults to "gcc -E -U__GNUC__ %o %s > %t".
    The actual command line is obtained by substituting occurences of
    %s with the name of the source, and %t with the name of a temporary
    file holding the pre-processed code.

  Options:

   -dir <dir>   output directory where all generated files are placed
   -d <dir>     default:  "NLFFI-Generated"

   -allSU       instructs ml-nlffigen to include all structs and unions,
                even those that are defined in included files (as opposed
                to files explicitly listed as arguments)
                default: off

   -width <w>   sets output line width (just a guess) to <w>
   -w <w>       default: 75

   -smloption <x>   instructs ml-nlffigen to include <x> into the list
                of options to annotate .sml entries in the generated .cm
                file with.  By default, the list consists just of "noguid".
   -guid        Removes the default "noguid" from the list of sml options.
                (This re-enables strict handling of type- and object-identity
                but can have negative impact on CM cutoff recompilation
                performance if the programmer routinely removes the entire
                tree of ml-nlffigen-generated files during development.)

(*
   -lambdasplit <x>   instructs ml-nlffigen to generate "lambdasplit"
   -ls <x>      options for all ML files (see CM manual for what this means;
                it does not currently work anyway because cross-module
                inlining is broken).
                default: nothing
*)

   -target <t>  Sets the target to <t> (which must be one of "sparc-unix",
   -t <t>       "x86-unix", or "x86-win32").
                default: current architecture

   -light       suppress "heavy" versions of function wrappers and
   -l           field accessors; also resets any earlier -heavy to default
                default: not suppressed

   -heavy       suppress "light" versions of function wrappers and
   -h           field accessors; also resets any earlier -light to default
                default: not suppressed

   -namedargs   instruct ml-nlffigen to generated function wrappers that
   -na          use named arguments (ML records) instead of tuples if
                there is enough information for this in the C source;
                (this is not always very useful)
                default: off

   -nocollect   Do not do the following:
                Collect enum constants from truly unnamed enumerations
                (those without tags that occur at toplevel or in an
                unnamed context, i.e., not in a typedef or another
                named struct or union) into a single artificial
                enumeration tagged by ' (single apostrohe).  The corresponding
                ML-side representative will be a structure named E_'.

   -enum-constructors
   -ec          When possible (i.e., if all values of a given enumeration
                are different from each other), make the ML representation
                type of the enumeration a datatype.  The default (and
                fallback) is to make that type the same as MLRep.Signed.int.

   -libhandle <h>   Use the variable <h> to refer to the handle to the
   -lh <h>      shared library object.  Given the constraints of CM, <h>
                must have the form of a long ML identifier, e.g.,
                MyLibrary.libhandle.
                default: Library.libh

   -include <f> Mention file <f> in the generated .cm file.  This option
   -add <f>     is necessary at least once for providing the library handle.
                It can be used arbitrarily many times, resulting in more
                than one such programmer-supplied file to be mentioned.
                If <f> is relative, then it must be relative to the directory
                specified in the -dir <dir> option.

   -cmfile <f>  Specify name of the generated .cm file, relative to
   -cm <f>      the directory specified by the -dir <dir> option.
                default: nlffi-generated.cm

   -cppopt <o>  The string <o> gets added to the list of options to be
                passed to cpp (the C preprocessor).  The list of options
                gets substituted for %o in the cpp command line template.

   -U<x>        The string -U<x> gets added to the list of cpp options.

   -D<x>        The string -D<x> gets added to the list of cpp options.

   -I<x>        The string -I<x> gets added to the list of cpp options.

   -version     Just write the version number of ml-nlffigen to standard
                output and then quit.

   -match <r>   Normally ml-nlffigen will include ML definitions for a C
   -m <r>       declaration if the C declaration textually appears in
                one of the files specified at the command line.  Definitions
                in #include-d files will normally not appear (unless
                their absence would lead to inconsistencies).
                By specifying -match <r>, ml-nlffigen will also include
                definitions that occur in recursively #include-d files
                for which the AWK-style regular expression <r> matches
                their names.

   -prefix <p>  Generated ML structure names will all have prefix <p>
   -p <p>       (in addition to the usual "S_" or "U_" or "F_" ...)

   -gensym <g>  Names "gensym-ed" by ml-nlffigen (for anonymous struct/union/
   -g <g>       enums) will get an additional suffix _<g>.  (This should
                be used if output from several indepdendent runs of 
                ml-nlffigen are to coexist in the same ML program.)

   --           Terminate processing of options, remaining arguments are
                taken to be C sources.

----------------------------------------------------------------------

Sample usage:

Suppose we have a C interface defined in foo.h.

1. Running ml-nlffigen:

   It is best to let a tool such as Unix' "make" handle the invocation of
   ml-nlffigen.  The following "Makefile" can be used as a template for
   other projects:

  +----------------------------------------------------------
  |FILES = foo.h
  |H = FooH.libh
  |D = FFI
  |HF = ../foo-h.sml
  |CF = foo.cm
  |
  |$(D)/$(CF): $(FILES)
  |     ml-nlffigen -include $(HF) -libhandle $(H) -dir $(D) -cmfile $(CF) $^
  +----------------------------------------------------------

   Suppose the above file is stored as "foo.make".  Running

     $ make -f foo.make

   will generate a subdirectory "FFI" full of ML files corresponding to
   the definitions in foo.h.  Access to the generated ML code is gained
   by refering to the CM library FFI/foo.cm; the .cm-file (foo.cm) is
   also produced by ml-nlffigen.

2. The ML code uses the library handle specified in the command line
   (here: FooH.libh) for dynamic linking.  The type of FooH.libh must
   be:

        FooH.libh : string -> unit -> CMemory.addr

   That is, FooH.libh takes the name of a symbol and produces that
   symbol's suspended address.

   The code that implements FooH.libh must be provided by the programmer.
   In the above example, we assume that it is stored in file foo-h.sml.
   The name of that file must appear in the generated .cm-file, hence the
   "-include" command-line argument.

   Notice that the name provided to ml-nlffigen must be relative to the
   output directory.  Therefore, in our case it is "../foo-h.sml" and not
   just foo-h.sml (because the full path would be FFI/../foo-h.sml).

3. To actually implement FooH.libh, use the "DynLinkage" module.
   Suppose the shared library's name is "/usr/lib/foo.so".  Here is
   the corresponding contents of foo-h.sml:

  +-------------------------------------------------------------
  |structure FooH = struct
  |    local 
  |        val lh = DynLinkage.open_lib
  |             { name = "/usr/lib/foo.so", global = true, lazy = true }
  |    in
  |        fun libh s = let
  |            val sh = DynLinkage.lib_symbol (lh, s)
  |        in
  |            fn () => DynLinkage.addr sh
  |        end
  |    end
  |end
  +-------------------------------------------------------------

   If all the symbols you are linking to are already available within
   the ML runtime system, then you don't need to open a new shared
   object.  As a result, your FooH implementation would look like this:

  +-------------------------------------------------------------
  |structure FooH = struct
  |    fun libh s = let
  |        val sh = DynLinkage.lib_symbol (DynLinkage.main_lib, s)
  |    in
  |        fn () => DynLinkage.addr sh
  |    end
  |end
  +-------------------------------------------------------------

   If the symbols your are accessing are strewn across several separate
   shared objects, then there are two possible solutions:

   a)  Open several shared libraries and perform a trial-and-error search
       for every symbol you are looking up.  (The DynLinkage module raises
       an exception (DynLinkError of string) if the lookup fails.  This
       could be used to daisy-chain lookup operations.)

       [Be careful:  Sometimes there are non-obvious inter-dependencies
       between shared libraries.  Consider using DynLinkage.open_lib'
       to express those.]

   b)  A simpler and more robust way of accessing several shared libraries
       is to create a new "summary" library object at the OS level.
       Supposed you are trying to access /usr/lib/foo.so and /usr/lib/bar.so.
       The solution is to make a "foobar.so" object by saying:

        $ ld -shared -o foobar.so /usr/lib/foo.so /usr/lib/bar.so

       The ML code then referes to foobar.so and the Linux dynamic loader
       does the rest.

4. To put it all together, let's wrap it up in a .cm-file.  For example,
   if we simply want to directly make the ml-nlffigen-generated definitions
   available to the "end user", we could write this wrapper .cm-file
   (let's call it foo.cm):

  +-------------------------------------------------------------
  |library
  |     library(FFI/foo.cm)
  |is
  |     $/basis.cm
  |     $/c.cm
  |     FFI/foo.cm : make (-f foo.make)
  +-------------------------------------------------------------

   Now, saying

     $ sml -m foo.cm

   is all one need's to do in order to compile.  (CM will automatically
   invoke "make", so you don't have to run "make" separately.)

   If the goal is not to export the "raw" ml-nlffigen-generated stuff
   but rather something more nicely "wrapped", consider writing wrapper
   ML code.  Suppose you have wrapper definitions for structure Foo_a
   and structure Foo_b with code for those in wrap-foo-a.sml and
   wrap-foo-b.sml.  In this case the corresponding .cm-file would
   look like the following:

  +-------------------------------------------------------------
  |library
  |     structure Foo_a
  |     structure Foo_b
  |is
  |     $/basis.cm
  |     $/c.cm
  |     FFI/foo.cm : make (-f foo.make)
  |     wrapper-foo-a.sml
  |     wrapper-foo-b.sml
  +-------------------------------------------------------------