2 @c This is part of the GNU Guile Reference Manual.
3 @c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2010
4 @c Free Software Foundation, Inc.
5 @c See the file guile.texi for copying conditions.
7 @node Data Representation
8 @section Data Representation
10 Scheme is a latently-typed language; this means that the system cannot,
11 in general, determine the type of a given expression at compile time.
12 Types only become apparent at run time. Variables do not have fixed
13 types; a variable may hold a pair at one point, an integer at the next,
14 and a thousand-element vector later. Instead, values, not variables,
17 In order to implement standard Scheme functions like @code{pair?} and
18 @code{string?} and provide garbage collection, the representation of
19 every value must contain enough information to accurately determine its
20 type at run time. Often, Scheme systems also use this information to
21 determine whether a program has attempted to apply an operation to an
22 inappropriately typed value (such as taking the @code{car} of a string).
24 Because variables, pairs, and vectors may hold values of any type,
25 Scheme implementations use a uniform representation for values --- a
26 single type large enough to hold either a complete value or a pointer
27 to a complete value, along with the necessary typing information.
29 The following sections will present a simple typing system, and then
30 make some refinements to correct its major weaknesses. We then conclude
31 with a discussion of specific choices that Guile has made regarding
32 garbage collection and data representation.
35 * A Simple Representation::
39 * The SCM Type in Guile::
42 @node A Simple Representation
43 @subsection A Simple Representation
45 The simplest way to represent Scheme values in C would be to represent
46 each value as a pointer to a structure containing a type indicator,
47 followed by a union carrying the real value. Assuming that @code{SCM} is
48 the name of our universal type, we can write:
51 enum type @{ integer, pair, string, vector, ... @};
53 typedef struct value *SCM;
59 struct @{ SCM car, cdr; @} pair;
60 struct @{ int length; char *elts; @} string;
61 struct @{ int length; SCM *elts; @} vector;
66 with the ellipses replaced with code for the remaining Scheme types.
68 This representation is sufficient to implement all of Scheme's
69 semantics. If @var{x} is an @code{SCM} value:
72 To test if @var{x} is an integer, we can write @code{@var{x}->type == integer}.
74 To find its value, we can write @code{@var{x}->value.integer}.
76 To test if @var{x} is a vector, we can write @code{@var{x}->type == vector}.
78 If we know @var{x} is a vector, we can write
79 @code{@var{x}->value.vector.elts[0]} to refer to its first element.
81 If we know @var{x} is a pair, we can write
82 @code{@var{x}->value.pair.car} to extract its car.
87 @subsection Faster Integers
89 Unfortunately, the above representation has a serious disadvantage. In
90 order to return an integer, an expression must allocate a @code{struct
91 value}, initialize it to represent that integer, and return a pointer to
92 it. Furthermore, fetching an integer's value requires a memory
93 reference, which is much slower than a register reference on most
94 processors. Since integers are extremely common, this representation is
95 too costly, in both time and space. Integers should be very cheap to
96 create and manipulate.
98 One possible solution comes from the observation that, on many
99 architectures, heap-allocated data (i.e., what you get when you call
100 @code{malloc}) must be aligned on an eight-byte boundary. (Whether or
101 not the machine actually requires it, we can write our own allocator for
102 @code{struct value} objects that assures this is true.) In this case,
103 the lower three bits of the structure's address are known to be zero.
105 This gives us the room we need to provide an improved representation
106 for integers. We make the following rules:
109 If the lower three bits of an @code{SCM} value are zero, then the SCM
110 value is a pointer to a @code{struct value}, and everything proceeds as
113 Otherwise, the @code{SCM} value represents an integer, whose value
114 appears in its upper bits.
117 Here is C code implementing this convention:
119 enum type @{ pair, string, vector, ... @};
121 typedef struct value *SCM;
126 struct @{ SCM car, cdr; @} pair;
127 struct @{ int length; char *elts; @} string;
128 struct @{ int length; SCM *elts; @} vector;
133 #define POINTER_P(x) (((int) (x) & 7) == 0)
134 #define INTEGER_P(x) (! POINTER_P (x))
136 #define GET_INTEGER(x) ((int) (x) >> 3)
137 #define MAKE_INTEGER(x) ((SCM) (((x) << 3) | 1))
140 Notice that @code{integer} no longer appears as an element of @code{enum
141 type}, and the union has lost its @code{integer} member. Instead, we
142 use the @code{POINTER_P} and @code{INTEGER_P} macros to make a coarse
143 classification of values into integers and non-integers, and do further
144 type testing as before.
146 Here's how we would answer the questions posed above (again, assume
147 @var{x} is an @code{SCM} value):
150 To test if @var{x} is an integer, we can write @code{INTEGER_P (@var{x})}.
152 To find its value, we can write @code{GET_INTEGER (@var{x})}.
154 To test if @var{x} is a vector, we can write:
156 @code{POINTER_P (@var{x}) && @var{x}->type == vector}
158 Given the new representation, we must make sure @var{x} is truly a
159 pointer before we dereference it to determine its complete type.
161 If we know @var{x} is a vector, we can write
162 @code{@var{x}->value.vector.elts[0]} to refer to its first element, as
165 If we know @var{x} is a pair, we can write
166 @code{@var{x}->value.pair.car} to extract its car, just as before.
169 This representation allows us to operate more efficiently on integers
170 than the first. For example, if @var{x} and @var{y} are known to be
171 integers, we can compute their sum as follows:
173 MAKE_INTEGER (GET_INTEGER (@var{x}) + GET_INTEGER (@var{y}))
175 Now, integer math requires no allocation or memory references. Most real
176 Scheme systems actually implement addition and other operations using an
177 even more efficient algorithm, but this essay isn't about
178 bit-twiddling. (Hint: how do you decide when to overflow to a bignum?
179 How would you do it in assembly?)
183 @subsection Cheaper Pairs
185 However, there is yet another issue to confront. Most Scheme heaps
186 contain more pairs than any other type of object; Jonathan Rees said at
187 one point that pairs occupy 45% of the heap in his Scheme
188 implementation, Scheme 48. However, our representation above spends
189 three @code{SCM}-sized words per pair --- one for the type, and two for
190 the @sc{car} and @sc{cdr}. Is there any way to represent pairs using
193 Let us refine the convention we established earlier. Let us assert
197 If the bottom three bits of an @code{SCM} value are @code{#b000}, then
198 it is a pointer, as before.
200 If the bottom three bits are @code{#b001}, then the upper bits are an
201 integer. This is a bit more restrictive than before.
203 If the bottom two bits are @code{#b010}, then the value, with the bottom
204 three bits masked out, is the address of a pair.
207 Here is the new C code:
209 enum type @{ string, vector, ... @};
211 typedef struct value *SCM;
216 struct @{ int length; char *elts; @} string;
217 struct @{ int length; SCM *elts; @} vector;
226 #define POINTER_P(x) (((int) (x) & 7) == 0)
228 #define INTEGER_P(x) (((int) (x) & 7) == 1)
229 #define GET_INTEGER(x) ((int) (x) >> 3)
230 #define MAKE_INTEGER(x) ((SCM) (((x) << 3) | 1))
232 #define PAIR_P(x) (((int) (x) & 7) == 2)
233 #define GET_PAIR(x) ((struct pair *) ((int) (x) & ~7))
236 Notice that @code{enum type} and @code{struct value} now only contain
237 provisions for vectors and strings; both integers and pairs have become
238 special cases. The code above also assumes that an @code{int} is large
239 enough to hold a pointer, which isn't generally true.
242 Our list of examples is now as follows:
245 To test if @var{x} is an integer, we can write @code{INTEGER_P
246 (@var{x})}; this is as before.
248 To find its value, we can write @code{GET_INTEGER (@var{x})}, as
251 To test if @var{x} is a vector, we can write:
253 @code{POINTER_P (@var{x}) && @var{x}->type == vector}
255 We must still make sure that @var{x} is a pointer to a @code{struct
256 value} before dereferencing it to find its type.
258 If we know @var{x} is a vector, we can write
259 @code{@var{x}->value.vector.elts[0]} to refer to its first element, as
262 We can write @code{PAIR_P (@var{x})} to determine if @var{x} is a
263 pair, and then write @code{GET_PAIR (@var{x})->car} to refer to its
267 This change in representation reduces our heap size by 15%. It also
268 makes it cheaper to decide if a value is a pair, because no memory
269 references are necessary; it suffices to check the bottom two bits of
270 the @code{SCM} value. This may be significant when traversing lists, a
271 common activity in a Scheme system.
273 Again, most real Scheme systems use a slightly different implementation;
274 for example, if GET_PAIR subtracts off the low bits of @code{x}, instead
275 of masking them off, the optimizer will often be able to combine that
276 subtraction with the addition of the offset of the structure member we
277 are referencing, making a modified pointer as fast to use as an
281 @node Conservative GC
282 @subsection Conservative Garbage Collection
284 Aside from the latent typing, the major source of constraints on a
285 Scheme implementation's data representation is the garbage collector.
286 The collector must be able to traverse every live object in the heap, to
287 determine which objects are not live, and thus collectable.
289 There are many ways to implement this. Guile's garbage collection is
290 built on a library, the Boehm-Demers-Weiser conservative garbage
291 collector (BDW-GC). The BDW-GC ``just works'', for the most part. But
292 since it is interesting to know how these things work, we include here a
293 high-level description of what the BDW-GC does.
295 Garbage collection has two logical phases: a @dfn{mark} phase, in which
296 the set of live objects is enumerated, and a @dfn{sweep} phase, in which
297 objects not traversed in the mark phase are collected. Correct
298 functioning of the collector depends on being able to traverse the
299 entire set of live objects.
301 In the mark phase, the collector scans the system's global variables and
302 the local variables on the stack to determine which objects are
303 immediately accessible by the C code. It then scans those objects to
304 find the objects they point to, and so on. The collector logically sets
305 a @dfn{mark bit} on each object it finds, so each object is traversed
308 When the collector can find no unmarked objects pointed to by marked
309 objects, it assumes that any objects that are still unmarked will never
310 be used by the program (since there is no path of dereferences from any
311 global or local variable that reaches them) and deallocates them.
313 In the above paragraphs, we did not specify how the garbage collector
314 finds the global and local variables; as usual, there are many different
315 approaches. Frequently, the programmer must maintain a list of pointers
316 to all global variables that refer to the heap, and another list
317 (adjusted upon entry to and exit from each function) of local variables,
318 for the collector's benefit.
320 The list of global variables is usually not too difficult to maintain,
321 since global variables are relatively rare. However, an explicitly
322 maintained list of local variables (in the author's personal experience)
323 is a nightmare to maintain. Thus, the BDW-GC uses a technique called
324 @dfn{conservative garbage collection}, to make the local variable list
327 The trick to conservative collection is to treat the stack as an
328 ordinary range of memory, and assume that @emph{every} word on the stack
329 is a pointer into the heap. Thus, the collector marks all objects whose
330 addresses appear anywhere in the stack, without knowing for sure how
331 that word is meant to be interpreted.
333 In addition to the stack, the BDW-GC will also scan static data
334 sections. This means that global variables are also scanned when looking
335 for live Scheme objects.
337 Obviously, such a system will occasionally retain objects that are
338 actually garbage, and should be freed. In practice, this is not a
339 problem. The alternative, an explicitly maintained list of local
340 variable addresses, is effectively much less reliable, due to programmer
341 error. Interested readers should see the BDW-GC web page at
342 @uref{http://www.hpl.hp.com/personal/Hans_Boehm/gc}, for more
346 @node The SCM Type in Guile
347 @subsection The SCM Type in Guile
349 Guile classifies Scheme objects into two kinds: those that fit entirely
350 within an @code{SCM}, and those that require heap storage.
352 The former class are called @dfn{immediates}. The class of immediates
353 includes small integers, characters, boolean values, the empty list, the
354 mysterious end-of-file object, and some others.
356 The remaining types are called, not surprisingly, @dfn{non-immediates}.
357 They include pairs, procedures, strings, vectors, and all other data
358 types in Guile. For non-immediates, the @code{SCM} word contains a
359 pointer to data on the heap, with further information about the object
360 in question is stored in that data.
362 This section describes how the @code{SCM} type is actually represented
363 and used at the C level. Interested readers should see
364 @code{libguile/tags.h} for an exposition of how Guile stores type
367 In fact, there are two basic C data types to represent objects in
368 Guile: @code{SCM} and @code{scm_t_bits}.
371 * Relationship between SCM and scm_t_bits::
372 * Immediate objects::
373 * Non-immediate objects::
375 * Heap Cell Type Information::
376 * Accessing Cell Entries::
380 @node Relationship between SCM and scm_t_bits
381 @subsubsection Relationship between @code{SCM} and @code{scm_t_bits}
383 A variable of type @code{SCM} is guaranteed to hold a valid Scheme
384 object. A variable of type @code{scm_t_bits}, on the other hand, may
385 hold a representation of a @code{SCM} value as a C integral type, but
386 may also hold any C value, even if it does not correspond to a valid
389 For a variable @var{x} of type @code{SCM}, the Scheme object's type
390 information is stored in a form that is not directly usable. To be able
391 to work on the type encoding of the scheme value, the @code{SCM}
392 variable has to be transformed into the corresponding representation as
393 a @code{scm_t_bits} variable @var{y} by using the @code{SCM_UNPACK}
394 macro. Once this has been done, the type of the scheme object @var{x}
395 can be derived from the content of the bits of the @code{scm_t_bits}
396 value @var{y}, in the way illustrated by the example earlier in this
397 chapter (@pxref{Cheaper Pairs}). Conversely, a valid bit encoding of a
398 Scheme value as a @code{scm_t_bits} variable can be transformed into the
399 corresponding @code{SCM} value using the @code{SCM_PACK} macro.
401 @node Immediate objects
402 @subsubsection Immediate objects
404 A Scheme object may either be an immediate, i.e.@: carrying all necessary
405 information by itself, or it may contain a reference to a @dfn{cell}
406 with additional information on the heap. Although in general it should
407 be irrelevant for user code whether an object is an immediate or not,
408 within Guile's own code the distinction is sometimes of importance.
409 Thus, the following low level macro is provided:
411 @deftypefn Macro int SCM_IMP (SCM @var{x})
412 A Scheme object is an immediate if it fulfills the @code{SCM_IMP}
413 predicate, otherwise it holds an encoded reference to a heap cell. The
414 result of the predicate is delivered as a C style boolean value. User
415 code and code that extends Guile should normally not be required to use
423 Given a Scheme object @var{x} of unknown type, check first
424 with @code{SCM_IMP (@var{x})} if it is an immediate object.
426 If so, all of the type and value information can be determined from the
427 @code{scm_t_bits} value that is delivered by @code{SCM_UNPACK
431 There are a number of special values in Scheme, most of them documented
432 elsewhere in this manual. It's not quite the right place to put them,
433 but for now, here's a list of the C names given to some of these values:
435 @deftypefn Macro SCM SCM_EOL
436 The Scheme empty list object, or ``End Of List'' object, usually written
437 in Scheme as @code{'()}.
440 @deftypefn Macro SCM SCM_EOF_VAL
441 The Scheme end-of-file value. It has no standard written
442 representation, for obvious reasons.
445 @deftypefn Macro SCM SCM_UNSPECIFIED
446 The value returned by some (but not all) expressions that the Scheme
447 standard says return an ``unspecified'' value.
449 This is sort of a weirdly literal way to take things, but the standard
450 read-eval-print loop prints nothing when the expression returns this
451 value, so it's not a bad idea to return this when you can't think of
452 anything else helpful.
455 @deftypefn Macro SCM SCM_UNDEFINED
456 The ``undefined'' value. Its most important property is that is not
457 equal to any valid Scheme value. This is put to various internal uses
458 by C code interacting with Guile.
460 For example, when you write a C function that is callable from Scheme
461 and which takes optional arguments, the interpreter passes
462 @code{SCM_UNDEFINED} for any arguments you did not receive.
464 We also use this to mark unbound variables.
467 @deftypefn Macro int SCM_UNBNDP (SCM @var{x})
468 Return true if @var{x} is @code{SCM_UNDEFINED}. Note that this is not a
469 check to see if @var{x} is @code{SCM_UNBOUND}. History will not be kind
474 @node Non-immediate objects
475 @subsubsection Non-immediate objects
477 A Scheme object of type @code{SCM} that does not fulfill the
478 @code{SCM_IMP} predicate holds an encoded reference to a heap cell.
479 This reference can be decoded to a C pointer to a heap cell using the
480 @code{SCM2PTR} macro. The encoding of a pointer to a heap cell into a
481 @code{SCM} value is done using the @code{PTR2SCM} macro.
483 @c (FIXME:: this name should be changed)
484 @deftypefn Macro {scm_t_cell *} SCM2PTR (SCM @var{x})
485 Extract and return the heap cell pointer from a non-immediate @code{SCM}
489 @c (FIXME:: this name should be changed)
490 @deftypefn Macro SCM PTR2SCM (scm_t_cell * @var{x})
491 Return a @code{SCM} value that encodes a reference to the heap cell
495 Note that it is also possible to transform a non-immediate @code{SCM}
496 value by using @code{SCM_UNPACK} into a @code{scm_t_bits} variable.
497 However, the result of @code{SCM_UNPACK} may not be used as a pointer to
498 a @code{scm_t_cell}: only @code{SCM2PTR} is guaranteed to transform a
499 @code{SCM} object into a valid pointer to a heap cell. Also, it is not
500 allowed to apply @code{PTR2SCM} to anything that is not a valid pointer
507 Only use @code{SCM2PTR} on @code{SCM} values for which @code{SCM_IMP} is
510 Don't use @code{(scm_t_cell *) SCM_UNPACK (@var{x})}! Use @code{SCM2PTR
513 Don't use @code{PTR2SCM} for anything but a cell pointer!
516 @node Allocating Cells
517 @subsubsection Allocating Cells
519 Guile provides both ordinary cells with two slots, and double cells
520 with four slots. The following two function are the most primitive
521 way to allocate such cells.
523 If the caller intends to use it as a header for some other type, she
524 must pass an appropriate magic value in @var{word_0}, to mark it as a
525 member of that type, and pass whatever value as @var{word_1}, etc that
526 the type expects. You should generally not need these functions,
527 unless you are implementing a new datatype, and thoroughly understand
528 the code in @code{<libguile/tags.h>}.
530 If you just want to allocate pairs, use @code{scm_cons}.
532 @deftypefn Function SCM scm_cell (scm_t_bits word_0, scm_t_bits word_1)
533 Allocate a new cell, initialize the two slots with @var{word_0} and
534 @var{word_1}, and return it.
536 Note that @var{word_0} and @var{word_1} are of type @code{scm_t_bits}.
537 If you want to pass a @code{SCM} object, you need to use
541 @deftypefn Function SCM scm_double_cell (scm_t_bits word_0, scm_t_bits word_1, scm_t_bits word_2, scm_t_bits word_3)
542 Like @code{scm_cell}, but allocates a double cell with four
546 @node Heap Cell Type Information
547 @subsubsection Heap Cell Type Information
549 Heap cells contain a number of entries, each of which is either a scheme
550 object of type @code{SCM} or a raw C value of type @code{scm_t_bits}.
551 Which of the cell entries contain Scheme objects and which contain raw C
552 values is determined by the first entry of the cell, which holds the
553 cell type information.
555 @deftypefn Macro scm_t_bits SCM_CELL_TYPE (SCM @var{x})
556 For a non-immediate Scheme object @var{x}, deliver the content of the
557 first entry of the heap cell referenced by @var{x}. This value holds
558 the information about the cell type.
561 @deftypefn Macro void SCM_SET_CELL_TYPE (SCM @var{x}, scm_t_bits @var{t})
562 For a non-immediate Scheme object @var{x}, write the value @var{t} into
563 the first entry of the heap cell referenced by @var{x}. The value
564 @var{t} must hold a valid cell type.
568 @node Accessing Cell Entries
569 @subsubsection Accessing Cell Entries
571 For a non-immediate Scheme object @var{x}, the object type can be
572 determined by reading the cell type entry using the @code{SCM_CELL_TYPE}
573 macro. For each different type of cell it is known which cell entries
574 hold Scheme objects and which cell entries hold raw C data. To access
575 the different cell entries appropriately, the following macros are
578 @deftypefn Macro scm_t_bits SCM_CELL_WORD (SCM @var{x}, unsigned int @var{n})
579 Deliver the cell entry @var{n} of the heap cell referenced by the
580 non-immediate Scheme object @var{x} as raw data. It is illegal, to
581 access cell entries that hold Scheme objects by using these macros. For
582 convenience, the following macros are also provided.
585 SCM_CELL_WORD_0 (@var{x}) @result{} SCM_CELL_WORD (@var{x}, 0)
587 SCM_CELL_WORD_1 (@var{x}) @result{} SCM_CELL_WORD (@var{x}, 1)
591 SCM_CELL_WORD_@var{n} (@var{x}) @result{} SCM_CELL_WORD (@var{x}, @var{n})
595 @deftypefn Macro SCM SCM_CELL_OBJECT (SCM @var{x}, unsigned int @var{n})
596 Deliver the cell entry @var{n} of the heap cell referenced by the
597 non-immediate Scheme object @var{x} as a Scheme object. It is illegal,
598 to access cell entries that do not hold Scheme objects by using these
599 macros. For convenience, the following macros are also provided.
602 SCM_CELL_OBJECT_0 (@var{x}) @result{} SCM_CELL_OBJECT (@var{x}, 0)
604 SCM_CELL_OBJECT_1 (@var{x}) @result{} SCM_CELL_OBJECT (@var{x}, 1)
608 SCM_CELL_OBJECT_@var{n} (@var{x}) @result{} SCM_CELL_OBJECT (@var{x},
613 @deftypefn Macro void SCM_SET_CELL_WORD (SCM @var{x}, unsigned int @var{n}, scm_t_bits @var{w})
614 Write the raw C value @var{w} into entry number @var{n} of the heap cell
615 referenced by the non-immediate Scheme value @var{x}. Values that are
616 written into cells this way may only be read from the cells using the
617 @code{SCM_CELL_WORD} macros or, in case cell entry 0 is written, using
618 the @code{SCM_CELL_TYPE} macro. For the special case of cell entry 0 it
619 has to be made sure that @var{w} contains a cell type information which
620 does not describe a Scheme object. For convenience, the following
621 macros are also provided.
624 SCM_SET_CELL_WORD_0 (@var{x}, @var{w}) @result{} SCM_SET_CELL_WORD
625 (@var{x}, 0, @var{w})
627 SCM_SET_CELL_WORD_1 (@var{x}, @var{w}) @result{} SCM_SET_CELL_WORD
628 (@var{x}, 1, @var{w})
632 SCM_SET_CELL_WORD_@var{n} (@var{x}, @var{w}) @result{} SCM_SET_CELL_WORD
633 (@var{x}, @var{n}, @var{w})
637 @deftypefn Macro void SCM_SET_CELL_OBJECT (SCM @var{x}, unsigned int @var{n}, SCM @var{o})
638 Write the Scheme object @var{o} into entry number @var{n} of the heap
639 cell referenced by the non-immediate Scheme value @var{x}. Values that
640 are written into cells this way may only be read from the cells using
641 the @code{SCM_CELL_OBJECT} macros or, in case cell entry 0 is written,
642 using the @code{SCM_CELL_TYPE} macro. For the special case of cell
643 entry 0 the writing of a Scheme object into this cell is only allowed
644 if the cell forms a Scheme pair. For convenience, the following macros
648 SCM_SET_CELL_OBJECT_0 (@var{x}, @var{o}) @result{} SCM_SET_CELL_OBJECT
649 (@var{x}, 0, @var{o})
651 SCM_SET_CELL_OBJECT_1 (@var{x}, @var{o}) @result{} SCM_SET_CELL_OBJECT
652 (@var{x}, 1, @var{o})
656 SCM_SET_CELL_OBJECT_@var{n} (@var{x}, @var{o}) @result{}
657 SCM_SET_CELL_OBJECT (@var{x}, @var{n}, @var{o})
665 For a non-immediate Scheme object @var{x} of unknown type, get the type
666 information by using @code{SCM_CELL_TYPE (@var{x})}.
668 As soon as the cell type information is available, only use the
669 appropriate access methods to read and write data to the different cell
675 @c TeX-master: "guile.texi"