* meta/gdb-uninstalled-guile.in: Specify a path to libtool.
[bpt/guile.git] / doc / mbapi.texi
CommitLineData
38a93523
NJ
1\input texinfo
2@setfilename mbapi.info
3@settitle Multibyte API
4@setchapternewpage off
5
6@c Open issues:
7
8@c What's the best way to report errors? Should functions return a
9@c magic value, according to C tradition, or should they signal a
10@c Guile exception?
11
12@c
13
14
15@node Working With Multibyte Strings in C
16@chapter Working With Multibyte Strings in C
17
18Guile allows strings to contain characters drawn from a wide variety of
19languages, including many Asian, Eastern European, and Middle Eastern
20languages, in a uniform and unrestricted way. The string representation
21normally used in C code --- an array of @sc{ASCII} characters --- is not
22sufficient for Guile strings, since they may contain characters not
23present in @sc{ASCII}.
24
25Instead, Guile uses a very large character set, and encodes each
26character as a sequence of one or more bytes. We call this
27variable-width encoding a @dfn{multibyte} encoding. Guile uses this
28single encoding internally for all strings, symbol names, error
29messages, etc., and performs appropriate conversions upon input and
30output.
31
32The use of this variable-width encoding is almost invisible to Scheme
33code. Strings are still indexed by character number, not by byte
34offset; @code{string-length} still returns the length of a string in
35characters, not in bytes. @code{string-ref} and @code{string-set!} are
36no longer guaranteed to be constant-time operations, but Guile uses
37various strategies to reduce the impact of this change.
38
39However, the encoding is visible via Guile's C interface, which gives
40the user direct access to a string's bytes. This chapter explains how
41to work with Guile multibyte text in C code. Since variable-width
42encodings are clumsier to work with than simple fixed-width encodings,
43Guile provides a set of standard macros and functions for manipulating
44multibyte text to make the job easier. Furthermore, Guile makes some
45promises about the encoding which you can use in writing your own text
46processing code.
47
48While we discuss guaranteed properties of Guile's encoding, and provide
49functions to operate on its character set, we do not actually specify
50either the character set or encoding here. This is because we expect
51both of them to change in the future: currently, Guile uses the same
52encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs
53as well) to use Unicode and UTF-8, with some extensions. This will make
54it more comfortable to use Guile with other systems which use UTF-8,
55like the GTk user interface toolkit.
56
57@menu
58* Multibyte String Terminology::
59* Promised Properties of the Guile Multibyte Encoding::
60* Functions for Operating on Multibyte Text::
61* Multibyte Text Processing Errors::
62* Why Guile Does Not Use a Fixed-Width Encoding::
63@end menu
64
65
66@node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C
67@section Multibyte String Terminology
68
69In the descriptions which follow, we make the following definitions:
70@table @dfn
71
72@item byte
73A @dfn{byte} is a number between 0 and 255. It has no inherent textual
74interpretation. So 65 is a byte, not a character.
75
76@item character
77A @dfn{character} is a unit of text. It has no inherent numeric value.
78@samp{A} and @samp{.} are characters, not bytes. (This is different
79from the C language's definition of @dfn{character}; in this chapter, we
80will always use a phrase like ``the C language's @code{char} type'' when
81that's what we mean.)
82
83@item character set
84A @dfn{character set} is an invertible mapping between numbers and a
85given set of characters. @sc{ASCII} is a character set assigning
86characters to the numbers 0 through 127. It maps @samp{A} onto the
87number 65, and @samp{.} onto 46.
88
89Note that a character set maps characters onto numbers, @emph{not
90necessarily} onto bytes. For example, the Unicode character set maps
91the Greek lower-case @samp{alpha} character onto the number 945, which
92is not a byte.
93
94(This is what Internet standards would call a "coding character set".)
95
96@item encoding
97An encoding maps numbers onto sequences of bytes. For example, the
98UTF-8 encoding, defined in the Unicode Standard, would map the number
99945 onto the sequence of bytes @samp{206 177}. When using the
100@sc{ASCII} character set, every number assigned also happens to be a
101byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes.
102
103(This is what Internet standards would call a "character encoding
104scheme".)
105
106@end table
107
108Thus, to turn a character into a sequence of bytes, you need a character
109set to assign a number to that character, and then an encoding to turn
110that number into a sequence of bytes.
111
112Likewise, to interpret a sequence of bytes as a sequence of characters,
113you use an encoding to extract a sequence of numbers from the bytes, and
114then a character set to turn the numbers into characters.
115
116Errors can occur while carrying out either of these processes. For
117example, under a particular encoding, a given string of bytes might not
118correspond to any number. For example, the byte sequence @samp{128 128}
119is not a valid encoding of any number under UTF-8.
120
121Having carefully defined our terminology, we will now abuse it.
122
123We will sometimes use the word @dfn{character} to refer to the number
124assigned to a character by a character set, in contexts where it's
125obvious we mean a number.
126
127Sometimes there is a close association between a particular encoding and
128a particular character set. Thus, we may sometimes refer to the
129character set and encoding together as an @dfn{encoding}.
130
131
132@node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C
133@section Promised Properties of the Guile Multibyte Encoding
134
135Internally, Guile uses a single encoding for all text --- symbols,
136strings, error messages, etc. Here we list a number of helpful
137properties of Guile's encoding. It is correct to write code which
138assumes these properties; code which uses these assumptions will be
139portable to all future versions of Guile, as far as we know.
140
141@b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in
142the obvious way.} This means that a standard C string containing only
143@sc{ASCII} characters is a valid Guile string (except for the terminator;
144Guile strings store the length explicitly, so they can contain null
145characters).
146
147@b{The encodings of non-@sc{ASCII} characters use only bytes between 128
148and 255.} That is, when we turn a non-@sc{ASCII} character into a
149series of bytes, none of those bytes can ever be mistaken for the
150encoding of an @sc{ASCII} character. This means that you can search a
151Guile string for an @sc{ASCII} character using the standard
152@code{memchr} library function. By extension, you can search for an
153@sc{ASCII} substring in a Guile string using a traditional substring
154search algorithm --- you needn't add special checks to verify encoding
155boundaries, etc.
156
157@b{No character encoding is a subsequence of any other character
158encoding.} (This is just a stronger version of the previous promise.)
159This means that you can search for occurrences of one Guile string
160within another Guile string just as if they were raw byte strings. You
161can use the stock @code{memmem} function (provided on GNU systems, at
162least) for such searches. If you don't need the ability to represent
163null characters in your text, you can still use null-termination for
164strings, and use the traditional string-handling functions like
165@code{strlen}, @code{strstr}, and @code{strcat}.
166
167@b{You can always determine the full length of a character's encoding
168from its first byte.} Guile provides the macro @code{scm_mb_len} which
169computes the encoding's length from its first byte. Given the first
170rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <=
171@var{b} <= 127}, returns 1.
172
173@b{Given an arbitrary byte position in a Guile string, you can always
174find the beginning and end of the character containing that byte without
175scanning too far in either direction.} This means that, if you are sure
176a byte sequence is a valid encoding of a character sequence, you can
177find character boundaries without keeping track of the beginning and
178ending of the overall string. This promise relies on the fact that, in
179addition to storing the string's length explicitly, Guile always either
180terminates the string's storage with a zero byte, or shares it with
181another string which is terminated this way.
182
183
184@node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C
185@section Functions for Operating on Multibyte Text
186
187Guile provides a variety of functions, variables, and types for working
188with multibyte text.
189
190@menu
191* Basic Multibyte Character Processing::
192* Finding Character Encoding Boundaries::
193* Multibyte String Functions::
194* Exchanging Guile Text With the Outside World in C::
195* Implementing Your Own Text Conversions::
196@end menu
197
198
199@node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text
200@subsection Basic Multibyte Character Processing
201
202Here are the essential types and functions for working with Guile text.
203Guile uses the C type @code{unsigned char *} to refer to text encoded
204with Guile's encoding.
205
206Note that any operation marked here as a ``Libguile Macro'' might
207evaluate its argument multiple times.
208
209@deftp {Libguile Type} scm_char_t
210This is a signed integral type large enough to hold any character in
211Guile's character set. All character numbers are positive.
212@end deftp
213
214@deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p})
215Return the character whose encoding starts at @var{p}. If @var{p} does
216not point at a valid character encoding, the behavior is undefined.
217@end deftypefn
218
219@deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c})
220Place the encoded form of the Guile character @var{c} at @var{p}, and
221return its length in bytes. If @var{c} is not a Guile character, the
222behavior is undefined.
223@end deftypefn
224
225@deftypevr {Libguile Constant} int scm_mb_max_len
226The maximum length of any character's encoding, in bytes. You may
227assume this is relatively small --- less than a dozen or so.
228@end deftypevr
229
230@deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b})
231If @var{b} is the first byte of a character's encoding, return the full
232length of the character's encoding, in bytes. If @var{b} is not a valid
233leading byte, the behavior is undefined.
234@end deftypefn
235
236@deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c})
237Return the length of the encoding of the character @var{c}, in bytes.
238If @var{c} is not a valid Guile character, the behavior is undefined.
239@end deftypefn
240
241@deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p})
242@deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c})
243@deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b})
244@deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c})
245These are functions identical to the corresponding macros. You can use
246them in situations where the overhead of a function call is acceptable,
247and the cleaner semantics of function application are desireable.
248@end deftypefn
249
250
251@node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text
252@subsection Finding Character Encoding Boundaries
253
254These are functions for finding the boundaries between characters in
255multibyte text.
256
257Note that any operation marked here as a ``Libguile Macro'' might
258evaluate its argument multiple times, unless the definition promises
259otherwise.
260
261@deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p})
262Return non-zero iff @var{p} points to the start of a character in
263multibyte text.
264
265This macro will evaluate its argument only once.
266@end deftypefn
267
268@deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p})
269``Round'' @var{p} to the previous character boundary. That is, if
270@var{p} points to the middle of the encoding of a Guile character,
271return a pointer to the first byte of the encoding. If @var{p} points
272to the start of the encoding of a Guile character, return @var{p}
273unchanged.
274@end deftypefn
275
276@deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p})
277``Round'' @var{p} to the next character boundary. That is, if @var{p}
278points to the middle of the encoding of a Guile character, return a
279pointer to the first byte of the encoding of the next character. If
280@var{p} points to the start of the encoding of a Guile character, return
281@var{p} unchanged.
282@end deftypefn
283
284Note that it is usually not friendly for functions to silently correct
285byte offsets that point into the middle of a character's encoding. Such
286offsets almost always indicate a programming error, and they should be
287reported as early as possible. So, when you write code which operates
288on multibyte text, you should not use functions like these to ``clean
289up'' byte offsets which the originator believes to be correct; instead,
290your code should signal a @code{text:not-char-boundary} error as soon as
291it detects an invalid offset. @xref{Multibyte Text Processing Errors}.
292
293
294@node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text
295@subsection Multibyte String Functions
296
297These functions allow you to operate on multibyte strings: sequences of
298character encodings.
299
300@deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len})
301Return the number of Guile characters encoded by the @var{len} bytes at
302@var{p}.
303
304If the sequence contains any invalid character encodings, or ends with
305an incomplete character encoding, signal a @code{text:bad-encoding}
306error.
307@end deftypefn
308
309@deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp})
310Return the character whose encoding starts at @code{*@var{pp}}, and
311advance @code{*@var{pp}} to the start of the next character. Return -1
312if @code{*@var{pp}} does not point to a valid character encoding.
313@end deftypefn
314
315@deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p})
316If @var{p} points to the middle of the encoding of a Guile character,
317return a pointer to the first byte of the encoding. If @var{p} points
318to the start of the encoding of a Guile character, return the start of
319the previous character's encoding.
320
321This is like @code{scm_mb_floor}, but the returned pointer will always
322be before @var{p}. If you use this function to drive an iteration, it
323guarantees backward progress.
324@end deftypefn
325
326@deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p})
327If @var{p} points to the encoding of a Guile character, return a pointer
328to the first byte of the encoding of the next character.
329
330This is like @code{scm_mb_ceiling}, but the returned pointer will always
331be after @var{p}. If you use this function to drive an iteration, it
332guarantees forward progress.
333@end deftypefn
334
335@deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i})
336Assuming that the @var{len} bytes starting at @var{p} are a
337concatenation of valid character encodings, return a pointer to the
338start of the @var{i}'th character encoding in the sequence.
339
340This function scans the sequence from the beginning to find the
341@var{i}'th character, and will generally require time proportional to
342the distance from @var{p} to the returned address.
343
344If the sequence contains any invalid character encodings, or ends with
345an incomplete character encoding, signal a @code{text:bad-encoding}
346error.
347@end deftypefn
348
349It is common to process the characters in a string from left to right.
350However, if you fetch each character using @code{scm_mb_index}, each
351call will scan the text from the beginning, so your loop will require
352time proportional to at least the square of the length of the text. To
353avoid this poor performance, you can use an @code{scm_mb_cache}
354structure and the @code{scm_mb_index_cached} macro.
355
356@deftp {Libguile Type} {struct scm_mb_cache}
357This structure holds information that allows a string scanning operation
358to use the results from a previous scan of the string. It has the
359following members:
360@table @code
361
362@item character
363An index, in characters, into the string.
364
365@item byte
366The index, in bytes, of the start of that character.
367
368@end table
369
370In other words, @code{byte} is the byte offset of the
371@code{character}'th character of the string. Note that if @code{byte}
372and @code{character} are equal, then all characters before that point
373must have encodings exactly one byte long, and the string can be indexed
374normally.
375
376All elements of a @code{struct scm_mb_cache} structure should be
377initialized to zero before its first use, and whenever the string's text
378changes.
379@end deftp
380
381@deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
382@deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
383This macro and this function are identical to @code{scm_mb_index},
384except that they may consult and update *@var{cache} in order to avoid
385scanning the string from the beginning. @code{scm_mb_index_cached} is a
386macro, so it may have less overhead than
387@code{scm_mb_index_cached_func}, but it may evaluate its arguments more
388than once.
389
390Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you
391can scan a string from left to right, or from right to left, in time
392proportional to the length of the string. As long as each character
393fetched is less than some constant distance before or after the previous
394character fetched with @var{cache}, each access will require constant
395time.
396@end deftypefn
397
398Guile also provides functions to convert between an encoded sequence of
399characters, and an array of @code{scm_char_t} objects.
400
401@deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len})
402Convert the variable-width text in the @var{len} bytes at @var{p}
403to an array of @code{scm_char_t} values. Return a pointer to the array,
404and set @code{*@var{result_len}} to the number of elements it contains.
405The returned array is allocated with @code{malloc}, and it is the
406caller's responsibility to free it.
407
408If the text is not a sequence of valid character encodings, this
409function will signal a @code{text:bad-encoding} error.
410@end deftypefn
411
412@deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len})
413Convert the array of @code{scm_char_t} values to a sequence of
414variable-width character encodings. Return a pointer to the array of
415bytes, and set @code{*@var{result_len}} to its length, in bytes.
416
417The returned byte sequence is terminated with a zero byte, which is not
418counted in the length returned in @code{*@var{result_len}}.
419
420The returned byte sequence is allocated with @code{malloc}; it is the
421caller's responsibility to free it.
422
423If the text is not a sequence of valid character encodings, this
424function will signal a @code{text:bad-encoding} error.
425@end deftypefn
426
427
428@node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text
429@subsection Exchanging Guile Text With the Outside World in C
430
431[[This is kind of a heavy-weight model, given that one end of the
432conversion is always going to be the Guile encoding. Any way to shorten
433things a bit?]]
434
435Guile provides functions for converting between Guile's internal text
436representation and encodings popular in the outside world. These
437functions are closely modeled after the @code{iconv} functions available
438on some systems.
439
440To convert text between two encodings, you should first call
441@code{scm_mb_iconv_open} to indicate the source and destination
442encodings; this function returns a context object which records the
443conversion to perform.
444
445Then, you should call @code{scm_mb_iconv} to actually convert the text.
446This function expects input and output buffers, and a pointer to the
447context you got from @var{scm_mb_iconv_open}. You don't need to pass
448all your input to @code{scm_mb_iconv} at once; you can invoke it on
449successive blocks of input (as you read it from a file, say), and it
450will convert as much as it can each time, indicating when you should
451grow your output buffer.
452
453An encoding may be @dfn{stateless}, or @dfn{stateful}. In most
454encodings, a contiguous group of bytes from the sequence completely
455specifies a particular character; these are stateless encodings.
456However, some encodings require you to look back an unbounded number of
457bytes in the stream to assign a meaning to a particular byte sequence;
458such encodings are stateful.
459
460For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the
461byte sequence @samp{27 36 66} indicates that subsequent bytes should be
462taken in pairs and interpreted as characters from the JIS-0208 character
463set. An arbitrary number of byte pairs may follow this sequence. The
464byte sequence @samp{27 40 66} indicates that subsequent bytes should be
465interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a
466given byte is an @sc{ASCII} character without looking back an arbitrary
467distance for the most recent escape sequence, so it is a stateful
468encoding.
469
470In Guile, if a conversion involves a stateful encoding, the context
471object carries any necessary state. Thus, you can have many independent
472conversions to or from stateful encodings taking place simultaneously,
473as long as each data stream uses its own context object for the
474conversion.
475
476@deftp {Libguile Type} {struct scm_mb_iconv}
477This is the type for context objects, which represent the encodings and
478current state of an ongoing text conversion. A @code{struct
479scm_mb_iconv} records the source and destination encodings, and keeps
480track of any information needed to handle stateful encodings.
481@end deftp
482
483@deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode})
484Return a pointer to a new @code{struct scm_mb_iconv} context object,
485ready to convert from the encoding named @var{fromcode} to the encoding
486named @var{tocode}. For stateful encodings, the context object is in
487some appropriate initial state, ready for use with the
488@code{scm_mb_iconv} function.
489
490When you are done using a context object, you may call
491@code{scm_mb_iconv_close} to free it.
492
493If either @var{tocode} or @var{fromcode} is not the name of a known
494encoding, this function will signal the @code{text:unknown-conversion}
495error, described below.
496
497@c Try to use names here from the IANA list:
498@c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
499Guile supports at least these encodings:
500@table @samp
501
502@item US-ASCII
503@sc{US-ASCII}, in the standard one-character-per-byte encoding.
504
505@item ISO-8859-1
506The usual character set for Western European languages, in its usual
507one-character-per-byte encoding.
508
509@item Guile-MB
510Guile's current internal multibyte encoding. The actual encoding this
511name refers to will change from one version of Guile to the next. You
512should use this when converting data between external sources and the
513encoding used by Guile objects.
514
515You should @emph{not} use this as the encoding for data presented to the
516outside world, for two reasons. 1) Its meaning will change over time,
517so data written using the @samp{guile} encoding with one version of
518Guile might not be readable with the @samp{guile} encoding in another
519version of Guile. 2) It currently corresponds to @samp{Emacs-Mule},
520which invented for Emacs's internal use, and was never intended to serve
521as an exchange medium.
522
523@item Guile-Wide
524Guile's character set, as an array of @code{scm_char_t} values.
525
526Note that this encoding is even less suitable for public use than
527@samp{Guile}, since the exact sequence of bytes depends heavily on the
528size and endianness the host system uses for @code{scm_char_t}. Using
529this encoding is very much like calling the
530@code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte}
531functions, except that @code{scm_mb_iconv} gives you more control over
532buffer allocation and management.
533
534@item Emacs-Mule
535This is the variable-length encoding for multi-lingual text by GNU
536Emacs, at least through version 20.4. You probably should not use this
537encoding, as it is designed only for Emacs's internal use. However, we
538provide it here because it's trivial to support, and some people
539probably do have @samp{emacs-mule}-format files lying around.
540
541@end table
542
543(At the moment, this list doesn't include any character sets suitable for
544external use that can actually handle multilingual data; this is
545unfortunate, as it encourages users to write data in Emacs-Mule format,
546which nobody but Emacs and Guile understands. We hope to add support
547for Unicode in UTF-8 soon, which should solve this problem.)
548
549Case is not significant in encoding names.
550
551You can define your own conversions; see @ref{Implementing Your Own Text
552Conversions}.
553@end deftypefn
554
555@deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding})
556Return a non-zero value if Guile supports the encoding named @var{encoding}[[]]
557@end deftypefn
558
559@deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
560Convert a sequence of characters from one encoding to another. The
561argument @var{context} specifies the encodings to use for the input and
562output, and carries state for stateful encodings; use
563@code{scm_mb_iconv_open} to create a @var{context} object for a
564particular conversion.
565
566Upon entry to the function, @code{*@var{inbuf}} should point to the
567input buffer, and @code{*@var{inbytesleft}} should hold the number of
568input bytes present in the buffer; @code{*@var{outbuf}} should point to
569the output buffer, and @code{*@var{outbytesleft}} should hold the number
570of bytes available to hold the conversion results in that buffer.
571
572Upon exit from the function, @code{*@var{inbuf}} points to the first
573unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number
574of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after
575the last output byte, and @code{*@var{outbyteleft}} holds the number of
576bytes left unused in the output buffer.
577
578For stateful encodings, @var{context} carries encoding state from one
579call to @code{scm_mb_iconv} to the next. Thus, successive calls to
580@var{scm_mb_iconv} which use the same context object can convert a
581stream of data one chunk at a time.
582
583If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is
584taken as a request to reset the states of the input and the output
585encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is
586non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output
587buffer to put the output encoding in its initial state. If the output
588buffer is not large enough to hold this byte sequence,
589@code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves
590the shift states of @var{context}'s input and output encodings
591unchanged.
592
593The @code{scm_mb_iconv} function always consumes only complete
594characters or shift sequences from the input buffer, and the output
595buffer always contains a sequence of complete characters or escape
596sequences.
597
598If the input sequence contains characters which are not expressible in
599the output encoding, @code{scm_mb_iconv} converts it in an
600implementation-defined way. It may simply delete the character.
601
602Some encodings use byte sequences which do not correspond to any textual
603character. For example, the escape sequence of a stateful encoding has
604no textual meaning. When converting from such an encoding, a call to
605@code{scm_mb_iconv} might consume input but produce no output, since the
606input sequence might contain only escape sequences.
607
608Normally, @code{scm_mb_iconv} returns the number of input characters it
609could not convert perfectly to the ouput encoding. However, it may
610return one of the @code{scm_mb_iconv_} codes described below, to
611indicate an error. All of these codes are negative values.
612
613If the input sequence contains an invalid character encoding, conversion
614stops before the invalid input character, and @code{scm_mb_iconv}
615returns the constant value @code{scm_mb_iconv_bad_encoding}.
616
617If the input sequence ends with an incomplete character encoding,
618@code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and
619return the constant value @code{scm_mb_iconv_incomplete_encoding}. This
620is not necessarily an error, if you expect to call @code{scm_mb_iconv}
621again with more data which might contain the rest of the encoding
622fragment.
623
624If the output buffer does not contain enough room to hold the converted
625form of the complete input text, @code{scm_mb_iconv} converts as much as
626it can, changes the input and output pointers to reflect the amount of
627text successfully converted, and then returns
628@code{scm_mb_iconv_too_big}.
629@end deftypefn
630
631Here are the status codes that might be returned by @code{scm_mb_iconv}.
632They are all negative integers.
633@table @code
634
635@item scm_mb_iconv_too_big
636The conversion needs more room in the output buffer. Some characters
637may have been consumed from the input buffer, and some characters may
638have been placed in the available space in the output buffer.
639
640@item scm_mb_iconv_bad_encoding
641@code{scm_mb_iconv} encountered an invalid character encoding in the
642input buffer. Conversion stopped before the invalid character, so there
643may be some characters consumed from the input buffer, and some
644converted text in the output buffer.
645
646@item scm_mb_iconv_incomplete_encoding
647The input buffer ends with an incomplete character encoding. The
648incomplete encoding is left in the input buffer, unconsumed. This is
649not necessarily an error, if you expect to call @code{scm_mb_iconv}
650again with more data which might contain the rest of the incomplete
651encoding.
652
653@end table
654
655
656Finally, Guile provides a function for destroying conversion contexts.
657
658@deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context})
659Deallocate the conversion context object @var{context}, and all other
660resources allocated by the call to @code{scm_mb_iconv_open} which
661returned @var{context}.
662@end deftypefn
663
664
665@node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text
666@subsection Implementing Your Own Text Conversions
667
668[[note that conversions to and from Guile must produce streams
669containing only valid character encodings, or else Guile will crash]]
670
671This section describes the interface for adding your own encoding
672conversions for use with @code{scm_mb_iconv}. The interface here is
673borrowed from the GNOME Project's @file{libunicode} library.
674
675Guile's @code{scm_mb_iconv} function works by converting the input text
676to a stream of @code{scm_char_t} characters, and then converting
677those characters to the desired output encoding. This makes it easy
678for Guile to choose the appropriate conversion back ends for an
679arbitrary pair of input and output encodings, but it also means that the
680accuracy and quality of the conversions depends on the fidelity of
681Guile's internal character set to the source and destination encodings.
682Since @code{scm_mb_iconv} will be used almost exclusively for converting
683to and from Guile's internal character set, this shouldn't be a problem.
684
685To add support for a particular encoding to Guile, you must provide one
686function (called the @dfn{read} function) which converts from your
687encoding to an array of @code{scm_char_t}'s, and another function
688(called the @dfn{write} function) to convert from an array of
689@code{scm_char_t}'s back into your encoding. To convert from some
690encoding @var{a} to some other encoding @var{b}, Guile pairs up
691@var{a}'s read function with @var{b}'s write function. Each call to
692@code{scm_mb_iconv} passes text in encoding @var{a} through the read
693function, to produce an array of @code{scm_char_t}'s, and then passes
694that array to the write function, to produce text in encoding @var{b}.
695
696For stateful encodings, a read or write function can hang its own data
697structures off the conversion object, and provide its own functions to
698allocate and destroy them; this allows read and write functions to
699maintain whatever state they like.
700
701The Guile conversion back end represents each available encoding with a
702@code{struct scm_mb_encoding} object.
703
704@deftp {Libguile Type} {struct scm_mb_encoding}
705This data structure describes an encoding. It has the following
706members:
707
708@table @code
709
710@item char **names
711An array of strings, giving the various names for this encoding. The
712array should be terminated by a zero pointer. Case is not significant
713in encoding names.
714
715The @code{scm_mb_iconv_open} function searches the list of registered
716encodings for an encoding whose @code{names} array matches its
717@var{tocode} or @var{fromcode} argument.
718
719@item int (*init) (void **@var{cookie})
720An initialization function for the encoding's private data.
721@code{scm_mb_iconv_open} will call this function, passing it the address
722of the cookie for this encoding in this context. (We explain cookies
723below.) There is no way for the @code{init} function to tell whether
724the encoding will be used for reading or writing.
725
726Note that @code{init} receives a @emph{pointer} to the cookie, not the
727cookie itself. Because the type of @var{cookie} is @code{void **}, the
728C compiler will not check it as carefully as it would other types.
729
730The @code{init} member may be zero, indicating that no initialization is
731necessary for this encoding.
732
733@item int (*destroy) (void **@var{cookie})
734A deallocation function for the encoding's private data.
735@code{scm_mb_iconv_close} calls this function, passing it the address of
736the cookie for this encoding in this context. The @code{destroy}
737function should free any data the @code{init} function allocated.
738
739Note that @code{destroy} receives a @emph{pointer} to the cookie, not the
740cookie itself. Because the type of @var{cookie} is @code{void **}, the
741C compiler will not check it as carefully as it would other types.
742
743The @code{destroy} member may be zero, indicating that this encoding
744doesn't need to perform any special action to destroy its local data.
745
746@item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft})
747Put the encoding into its initial shift state. Guile calls this
748function whether the encoding is being used for input or output, so this
749should take appropriate steps for both directions. If @var{outbuf} and
750@var{outbytesleft} are valid, the reset function should emit an escape
751sequence to reset the output stream to its initial state; @var{outbuf}
752and @var{outbytesleft} should be handled just as for
753@code{scm_mb_iconv}.
754
755This function can return an @code{scm_mb_iconv_} error code
756(@pxref{Exchanging Guile Text With the Outside World in C}). If it
757returns @code{scm_mb_iconv_too_big}, then the output buffer's shift
758state must be left unchanged.
759
760Note that @code{reset} receives the cookie's value itself, not a pointer
761to the cookie, as the @code{init} and @code{destroy} functions do.
762
763The @code{reset} member may be zero, indicating that this encoding
764doesn't use a shift state.
765
766@item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf}, size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft})
767Read some bytes and convert into an array of Guile characters. This is
768the encoding's read function.
769
770On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to
771be converted, and *@var{outcharsleft} characters available at
772*@var{outbuf} to hold the results.
773
774On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes
775still not consumed. *@var{outcharsleft} and *@var{outbuf} indicate the
776output buffer space still not filled. (By exclusion, these indicate
777which input bytes were consumed, and which output characters were
778produced.)
779
780Return one of the @code{enum scm_mb_read_result} values, described below.
781
782Note that @code{read} receives the cookie's value itself, not a pointer
783to the cookie, as the @code{init} and @code{destroy} functions do.
784
785@item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft})
786Convert an array of Guile characters to output bytes. This is
787the encoding's write function.
788
789On entry, there are *@var{incharsleft} Guile characters available at
790*@var{inbuf}, and *@var{outbytesleft} bytes available to store output at
791*@var{outbuf}.
792
793On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of
794Guile characters left unconverted (because there was insufficient room
795in the output buffer to hold their converted forms), and
796*@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the
797output buffer.
798
799Return one of the @code{scm_mb_write_result} values, described below.
800
801Note that @code{write} receives the cookie's value itself, not a pointer
802to the cookie, as the @code{init} and @code{destroy} functions do.
803
804@item struct scm_mb_encoding *next
805This is used by Guile to maintain a linked list of encodings. It is
806filled in when you call @code{scm_mb_register_encoding} to add your
807encoding to the list.
808
809@end table
810@end deftp
811
812Here is the enumerated type for the values an encoding's read function
813can return:
814
815@deftp {Libguile Type} {enum scm_mb_read_result}
816This type represents the result of a call to an encoding's read
817function. It has the following values:
818
819@table @code
820
821@item scm_mb_read_ok
822The read function consumed at least one byte of input.
823
824@item scm_mb_read_incomplete
825The data present in the input buffer does not contain a complete
826character encoding. No input was consumed, and no characters were
827produced as output. This is not necessarily an error status, if there
828is more data to pass through.
829
830@item scm_mb_read_error
831The input contains an invalid character encoding.
832
833@end table
834@end deftp
835
836Here is the enumerated type for the values an encoding's write function
837can return:
838
839@deftp {Libguile Type} {enum scm_mb_write_result}
840This type represents the result of a call to an encoding's write
841function. It has the following values:
842
843@table @code
844
845@item scm_mb_write_ok
846The write function was able to convert all the characters in @var{inbuf}
847successfully.
848
849@item scm_mb_write_too_big
850The write function filled the output buffer, but there are still
851characters in @var{inbuf} left unconsumed; @var{inbuf} and
852@var{incharsleft} indicate the unconsumed portion of the input buffer.
853
854@end table
855@end deftp
856
857
858Conversions to or from stateful encodings need to keep track of each
859encoding's current state. Each conversion context contains two
860@code{void *} variables called @dfn{cookies}, one for the input
861encoding, and one for the output encoding. These cookies are passed to
862the encodings' functions, for them to use however they please. A
863stateful encoding can use its cookie to hold a pointer to some object
864which maintains the context's current shift state. Stateless encodings
865will probably not use their cookies.
866
867The cookies' lifetime is the same as that of the context object. When
868the user calls @code{scm_mb_iconv_close} to destroy a context object,
869@code{scm_mb_iconv_close} calls the input and output encodings'
870@code{destroy} functions, passing them their respective cookies, so each
871encoding can free any data it allocated for that context.
872
873Note that, if a read or write function returns a successful result code
874like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining
875input, together with the output, must together represent the complete
876input text; the encoding may not store any text temporarily in its
877cookie. This is because, if @code{scm_mb_iconv} returns a successful
878result to the user, it is correct for the user to assume that all the
879consumed input has been converted and placed in the output buffer.
880There is no ``flush'' operation to push any final results out of the
881encodings' buffers.
882
883Here is the function you call to register a new encoding with the
884conversion system:
885
886@deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding})
887Add the encoding described by @code{*@var{encoding}} to the set
888understood by @code{scm_mb_iconv_open}. Once you have registered your
889encoding, you can use it by calling @code{scm_mb_iconv_open} with one of
890the names in @code{@var{encoding}->names}.
891@end deftypefn
892
893
894@node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C
895@section Multibyte Text Processing Errors
896
897This section describes error conditions which code can signal to
898indicate problems encountered while processing multibyte text. In each
899case, the arguments @var{message} and @var{args} are an error format
900string and arguments to be substituted into the string, as accepted by
901the @code{display-error} function.
902
903@deffn Condition text:not-char-boundary func message args object offset
904By calling @var{func}, the program attempted to access a character at
905byte offset @var{offset} in the Guile object @var{object}, but
906@var{offset} is not the start of a character's encoding in @var{object}.
907
908Typically, @var{object} is a string or symbol. If the function signalling
909the error cannot find the Guile object that contains the text it is
910inspecting, it should use @code{#f} for @var{object}.
911@end deffn
912
913@deffn Condition text:bad-encoding func message args object
914By calling @var{func}, the program attempted to interpret the text in
915@var{object}, but @var{object} contains a byte sequence which is not a
916valid encoding for any character.
917@end deffn
918
919@deffn Condition text:not-guile-char func message args number
920By calling @var{func}, the program attempted to treat @var{number} as the
921number of a character in the Guile character set, but @var{number} does
922not correspond to any character in the Guile character set.
923@end deffn
924
925@deffn Condition text:unknown-conversion func message args from to
926By calling @var{func}, the program attempted to convert from an encoding
927named @var{from} to an encoding named @var{to}, but Guile does not
928support such a conversion.
929@end deffn
930
931@deftypevr {Libguile Variable} SCM scm_text_not_char_boundary
932@deftypevrx {Libguile Variable} SCM scm_text_bad_encoding
933@deftypevrx {Libguile Variable} SCM scm_text_not_guile_char
934These variables hold the scheme symbol objects whose names are the
935condition symbols above. You can use these when signalling these
936errors, instead of looking them up yourself.
937@end deftypevr
938
939
940@node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C
941@section Why Guile Does Not Use a Fixed-Width Encoding
942
943Multibyte encodings are clumsier to work with than encodings which use a
944fixed number of bytes for every character. For example, using a
945fixed-width encoding, we can extract the @var{i}th character of a string
946in constant time, and we can always substitute the @var{i}th character
947of a string with any other character without reallocating or copying the
948string.
949
950However, there are no fixed-width encodings which include the characters
951we wish to include, and also fit in a reasonable amount of space.
952Despite the Unicode standard's claims to the contrary, Unicode is not
953really a fixed-width encoding. Unicode uses surrogate pairs to
954represent characters outside the 16-bit range; a surrogate pair must be
955treated as a single character, but occupies two 16-bit spaces. As of
956this writing, there are already plans to assign characters to the
957surrogate character codes. Three- and four-byte encodings are
958too wasteful for a majority of Guile's users, who only need @sc{ASCII}
959and a few accented characters.
960
961Another alternative would be to have several different fixed-width
962string representations, each with a different element size. For each
963string, Guile would use the smallest element size capable of
964accomodating the string's text. This would allow users of English and
965the Western European languages to use the traditional memory-efficient
966encodings. However, if Guile has @var{n} string representations, then
967users must write @var{n} versions of any code which manipulates text
968directly --- one for each element size. And if a user wants to operate
969on two strings simultaneously, and wants to avoid testing the string
970sizes within the loop, she must make @var{n}*@var{n} copies of the loop.
971Most users will simply not bother. Instead, they will write code which
972supports only one string size, leaving us back where we started. By
973using a single internal representation, Guile makes it easier for users
974to write multilingual code.
975
976[[What about tagging each string with its encoding?
977"Every extension must be written to deal with every encoding"]]
978
979[[You don't really want to index strings anyway.]]
980
981Finally, Guile's multibyte encoding is not so bad. Unlike a two- or
982four-byte encoding, it is efficient in space for American and European
983users. Furthermore, the properties described above mean that many
984functions can be coded just as they would for a single-byte encoding;
985see @ref{Promised Properties of the Guile Multibyte Encoding}.
986
987@bye