Commit | Line | Data |
---|---|---|
38a93523 NJ |
1 | \input texinfo |
2 | @setfilename mbapi.info | |
3 | @settitle Multibyte API | |
4 | @setchapternewpage off | |
5 | ||
6 | @c Open issues: | |
7 | ||
8 | @c What's the best way to report errors? Should functions return a | |
9 | @c magic value, according to C tradition, or should they signal a | |
10 | @c Guile exception? | |
11 | ||
12 | @c | |
13 | ||
14 | ||
15 | @node Working With Multibyte Strings in C | |
16 | @chapter Working With Multibyte Strings in C | |
17 | ||
18 | Guile allows strings to contain characters drawn from a wide variety of | |
19 | languages, including many Asian, Eastern European, and Middle Eastern | |
20 | languages, in a uniform and unrestricted way. The string representation | |
21 | normally used in C code --- an array of @sc{ASCII} characters --- is not | |
22 | sufficient for Guile strings, since they may contain characters not | |
23 | present in @sc{ASCII}. | |
24 | ||
25 | Instead, Guile uses a very large character set, and encodes each | |
26 | character as a sequence of one or more bytes. We call this | |
27 | variable-width encoding a @dfn{multibyte} encoding. Guile uses this | |
28 | single encoding internally for all strings, symbol names, error | |
29 | messages, etc., and performs appropriate conversions upon input and | |
30 | output. | |
31 | ||
32 | The use of this variable-width encoding is almost invisible to Scheme | |
33 | code. Strings are still indexed by character number, not by byte | |
34 | offset; @code{string-length} still returns the length of a string in | |
35 | characters, not in bytes. @code{string-ref} and @code{string-set!} are | |
36 | no longer guaranteed to be constant-time operations, but Guile uses | |
37 | various strategies to reduce the impact of this change. | |
38 | ||
39 | However, the encoding is visible via Guile's C interface, which gives | |
40 | the user direct access to a string's bytes. This chapter explains how | |
41 | to work with Guile multibyte text in C code. Since variable-width | |
42 | encodings are clumsier to work with than simple fixed-width encodings, | |
43 | Guile provides a set of standard macros and functions for manipulating | |
44 | multibyte text to make the job easier. Furthermore, Guile makes some | |
45 | promises about the encoding which you can use in writing your own text | |
46 | processing code. | |
47 | ||
48 | While we discuss guaranteed properties of Guile's encoding, and provide | |
49 | functions to operate on its character set, we do not actually specify | |
50 | either the character set or encoding here. This is because we expect | |
51 | both of them to change in the future: currently, Guile uses the same | |
52 | encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs | |
53 | as well) to use Unicode and UTF-8, with some extensions. This will make | |
54 | it more comfortable to use Guile with other systems which use UTF-8, | |
55 | like the GTk user interface toolkit. | |
56 | ||
57 | @menu | |
58 | * Multibyte String Terminology:: | |
59 | * Promised Properties of the Guile Multibyte Encoding:: | |
60 | * Functions for Operating on Multibyte Text:: | |
61 | * Multibyte Text Processing Errors:: | |
62 | * Why Guile Does Not Use a Fixed-Width Encoding:: | |
63 | @end menu | |
64 | ||
65 | ||
66 | @node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C | |
67 | @section Multibyte String Terminology | |
68 | ||
69 | In the descriptions which follow, we make the following definitions: | |
70 | @table @dfn | |
71 | ||
72 | @item byte | |
73 | A @dfn{byte} is a number between 0 and 255. It has no inherent textual | |
74 | interpretation. So 65 is a byte, not a character. | |
75 | ||
76 | @item character | |
77 | A @dfn{character} is a unit of text. It has no inherent numeric value. | |
78 | @samp{A} and @samp{.} are characters, not bytes. (This is different | |
79 | from the C language's definition of @dfn{character}; in this chapter, we | |
80 | will always use a phrase like ``the C language's @code{char} type'' when | |
81 | that's what we mean.) | |
82 | ||
83 | @item character set | |
84 | A @dfn{character set} is an invertible mapping between numbers and a | |
85 | given set of characters. @sc{ASCII} is a character set assigning | |
86 | characters to the numbers 0 through 127. It maps @samp{A} onto the | |
87 | number 65, and @samp{.} onto 46. | |
88 | ||
89 | Note that a character set maps characters onto numbers, @emph{not | |
90 | necessarily} onto bytes. For example, the Unicode character set maps | |
91 | the Greek lower-case @samp{alpha} character onto the number 945, which | |
92 | is not a byte. | |
93 | ||
94 | (This is what Internet standards would call a "coding character set".) | |
95 | ||
96 | @item encoding | |
97 | An encoding maps numbers onto sequences of bytes. For example, the | |
98 | UTF-8 encoding, defined in the Unicode Standard, would map the number | |
99 | 945 onto the sequence of bytes @samp{206 177}. When using the | |
100 | @sc{ASCII} character set, every number assigned also happens to be a | |
101 | byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes. | |
102 | ||
103 | (This is what Internet standards would call a "character encoding | |
104 | scheme".) | |
105 | ||
106 | @end table | |
107 | ||
108 | Thus, to turn a character into a sequence of bytes, you need a character | |
109 | set to assign a number to that character, and then an encoding to turn | |
110 | that number into a sequence of bytes. | |
111 | ||
112 | Likewise, to interpret a sequence of bytes as a sequence of characters, | |
113 | you use an encoding to extract a sequence of numbers from the bytes, and | |
114 | then a character set to turn the numbers into characters. | |
115 | ||
116 | Errors can occur while carrying out either of these processes. For | |
117 | example, under a particular encoding, a given string of bytes might not | |
118 | correspond to any number. For example, the byte sequence @samp{128 128} | |
119 | is not a valid encoding of any number under UTF-8. | |
120 | ||
121 | Having carefully defined our terminology, we will now abuse it. | |
122 | ||
123 | We will sometimes use the word @dfn{character} to refer to the number | |
124 | assigned to a character by a character set, in contexts where it's | |
125 | obvious we mean a number. | |
126 | ||
127 | Sometimes there is a close association between a particular encoding and | |
128 | a particular character set. Thus, we may sometimes refer to the | |
129 | character set and encoding together as an @dfn{encoding}. | |
130 | ||
131 | ||
132 | @node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C | |
133 | @section Promised Properties of the Guile Multibyte Encoding | |
134 | ||
135 | Internally, Guile uses a single encoding for all text --- symbols, | |
136 | strings, error messages, etc. Here we list a number of helpful | |
137 | properties of Guile's encoding. It is correct to write code which | |
138 | assumes these properties; code which uses these assumptions will be | |
139 | portable to all future versions of Guile, as far as we know. | |
140 | ||
141 | @b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in | |
142 | the obvious way.} This means that a standard C string containing only | |
143 | @sc{ASCII} characters is a valid Guile string (except for the terminator; | |
144 | Guile strings store the length explicitly, so they can contain null | |
145 | characters). | |
146 | ||
147 | @b{The encodings of non-@sc{ASCII} characters use only bytes between 128 | |
148 | and 255.} That is, when we turn a non-@sc{ASCII} character into a | |
149 | series of bytes, none of those bytes can ever be mistaken for the | |
150 | encoding of an @sc{ASCII} character. This means that you can search a | |
151 | Guile string for an @sc{ASCII} character using the standard | |
152 | @code{memchr} library function. By extension, you can search for an | |
153 | @sc{ASCII} substring in a Guile string using a traditional substring | |
154 | search algorithm --- you needn't add special checks to verify encoding | |
155 | boundaries, etc. | |
156 | ||
157 | @b{No character encoding is a subsequence of any other character | |
158 | encoding.} (This is just a stronger version of the previous promise.) | |
159 | This means that you can search for occurrences of one Guile string | |
160 | within another Guile string just as if they were raw byte strings. You | |
161 | can use the stock @code{memmem} function (provided on GNU systems, at | |
162 | least) for such searches. If you don't need the ability to represent | |
163 | null characters in your text, you can still use null-termination for | |
164 | strings, and use the traditional string-handling functions like | |
165 | @code{strlen}, @code{strstr}, and @code{strcat}. | |
166 | ||
167 | @b{You can always determine the full length of a character's encoding | |
168 | from its first byte.} Guile provides the macro @code{scm_mb_len} which | |
169 | computes the encoding's length from its first byte. Given the first | |
170 | rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <= | |
171 | @var{b} <= 127}, returns 1. | |
172 | ||
173 | @b{Given an arbitrary byte position in a Guile string, you can always | |
174 | find the beginning and end of the character containing that byte without | |
175 | scanning too far in either direction.} This means that, if you are sure | |
176 | a byte sequence is a valid encoding of a character sequence, you can | |
177 | find character boundaries without keeping track of the beginning and | |
178 | ending of the overall string. This promise relies on the fact that, in | |
179 | addition to storing the string's length explicitly, Guile always either | |
180 | terminates the string's storage with a zero byte, or shares it with | |
181 | another string which is terminated this way. | |
182 | ||
183 | ||
184 | @node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C | |
185 | @section Functions for Operating on Multibyte Text | |
186 | ||
187 | Guile provides a variety of functions, variables, and types for working | |
188 | with multibyte text. | |
189 | ||
190 | @menu | |
191 | * Basic Multibyte Character Processing:: | |
192 | * Finding Character Encoding Boundaries:: | |
193 | * Multibyte String Functions:: | |
194 | * Exchanging Guile Text With the Outside World in C:: | |
195 | * Implementing Your Own Text Conversions:: | |
196 | @end menu | |
197 | ||
198 | ||
199 | @node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text | |
200 | @subsection Basic Multibyte Character Processing | |
201 | ||
202 | Here are the essential types and functions for working with Guile text. | |
203 | Guile uses the C type @code{unsigned char *} to refer to text encoded | |
204 | with Guile's encoding. | |
205 | ||
206 | Note that any operation marked here as a ``Libguile Macro'' might | |
207 | evaluate its argument multiple times. | |
208 | ||
209 | @deftp {Libguile Type} scm_char_t | |
210 | This is a signed integral type large enough to hold any character in | |
211 | Guile's character set. All character numbers are positive. | |
212 | @end deftp | |
213 | ||
214 | @deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p}) | |
215 | Return the character whose encoding starts at @var{p}. If @var{p} does | |
216 | not point at a valid character encoding, the behavior is undefined. | |
217 | @end deftypefn | |
218 | ||
219 | @deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c}) | |
220 | Place the encoded form of the Guile character @var{c} at @var{p}, and | |
221 | return its length in bytes. If @var{c} is not a Guile character, the | |
222 | behavior is undefined. | |
223 | @end deftypefn | |
224 | ||
225 | @deftypevr {Libguile Constant} int scm_mb_max_len | |
226 | The maximum length of any character's encoding, in bytes. You may | |
227 | assume this is relatively small --- less than a dozen or so. | |
228 | @end deftypevr | |
229 | ||
230 | @deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b}) | |
231 | If @var{b} is the first byte of a character's encoding, return the full | |
232 | length of the character's encoding, in bytes. If @var{b} is not a valid | |
233 | leading byte, the behavior is undefined. | |
234 | @end deftypefn | |
235 | ||
236 | @deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c}) | |
237 | Return the length of the encoding of the character @var{c}, in bytes. | |
238 | If @var{c} is not a valid Guile character, the behavior is undefined. | |
239 | @end deftypefn | |
240 | ||
241 | @deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p}) | |
242 | @deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c}) | |
243 | @deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b}) | |
244 | @deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c}) | |
245 | These are functions identical to the corresponding macros. You can use | |
246 | them in situations where the overhead of a function call is acceptable, | |
247 | and the cleaner semantics of function application are desireable. | |
248 | @end deftypefn | |
249 | ||
250 | ||
251 | @node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text | |
252 | @subsection Finding Character Encoding Boundaries | |
253 | ||
254 | These are functions for finding the boundaries between characters in | |
255 | multibyte text. | |
256 | ||
257 | Note that any operation marked here as a ``Libguile Macro'' might | |
258 | evaluate its argument multiple times, unless the definition promises | |
259 | otherwise. | |
260 | ||
261 | @deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p}) | |
262 | Return non-zero iff @var{p} points to the start of a character in | |
263 | multibyte text. | |
264 | ||
265 | This macro will evaluate its argument only once. | |
266 | @end deftypefn | |
267 | ||
268 | @deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p}) | |
269 | ``Round'' @var{p} to the previous character boundary. That is, if | |
270 | @var{p} points to the middle of the encoding of a Guile character, | |
271 | return a pointer to the first byte of the encoding. If @var{p} points | |
272 | to the start of the encoding of a Guile character, return @var{p} | |
273 | unchanged. | |
274 | @end deftypefn | |
275 | ||
276 | @deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p}) | |
277 | ``Round'' @var{p} to the next character boundary. That is, if @var{p} | |
278 | points to the middle of the encoding of a Guile character, return a | |
279 | pointer to the first byte of the encoding of the next character. If | |
280 | @var{p} points to the start of the encoding of a Guile character, return | |
281 | @var{p} unchanged. | |
282 | @end deftypefn | |
283 | ||
284 | Note that it is usually not friendly for functions to silently correct | |
285 | byte offsets that point into the middle of a character's encoding. Such | |
286 | offsets almost always indicate a programming error, and they should be | |
287 | reported as early as possible. So, when you write code which operates | |
288 | on multibyte text, you should not use functions like these to ``clean | |
289 | up'' byte offsets which the originator believes to be correct; instead, | |
290 | your code should signal a @code{text:not-char-boundary} error as soon as | |
291 | it detects an invalid offset. @xref{Multibyte Text Processing Errors}. | |
292 | ||
293 | ||
294 | @node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text | |
295 | @subsection Multibyte String Functions | |
296 | ||
297 | These functions allow you to operate on multibyte strings: sequences of | |
298 | character encodings. | |
299 | ||
300 | @deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len}) | |
301 | Return the number of Guile characters encoded by the @var{len} bytes at | |
302 | @var{p}. | |
303 | ||
304 | If the sequence contains any invalid character encodings, or ends with | |
305 | an incomplete character encoding, signal a @code{text:bad-encoding} | |
306 | error. | |
307 | @end deftypefn | |
308 | ||
309 | @deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp}) | |
310 | Return the character whose encoding starts at @code{*@var{pp}}, and | |
311 | advance @code{*@var{pp}} to the start of the next character. Return -1 | |
312 | if @code{*@var{pp}} does not point to a valid character encoding. | |
313 | @end deftypefn | |
314 | ||
315 | @deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p}) | |
316 | If @var{p} points to the middle of the encoding of a Guile character, | |
317 | return a pointer to the first byte of the encoding. If @var{p} points | |
318 | to the start of the encoding of a Guile character, return the start of | |
319 | the previous character's encoding. | |
320 | ||
321 | This is like @code{scm_mb_floor}, but the returned pointer will always | |
322 | be before @var{p}. If you use this function to drive an iteration, it | |
323 | guarantees backward progress. | |
324 | @end deftypefn | |
325 | ||
326 | @deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p}) | |
327 | If @var{p} points to the encoding of a Guile character, return a pointer | |
328 | to the first byte of the encoding of the next character. | |
329 | ||
330 | This is like @code{scm_mb_ceiling}, but the returned pointer will always | |
331 | be after @var{p}. If you use this function to drive an iteration, it | |
332 | guarantees forward progress. | |
333 | @end deftypefn | |
334 | ||
335 | @deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i}) | |
336 | Assuming that the @var{len} bytes starting at @var{p} are a | |
337 | concatenation of valid character encodings, return a pointer to the | |
338 | start of the @var{i}'th character encoding in the sequence. | |
339 | ||
340 | This function scans the sequence from the beginning to find the | |
341 | @var{i}'th character, and will generally require time proportional to | |
342 | the distance from @var{p} to the returned address. | |
343 | ||
344 | If the sequence contains any invalid character encodings, or ends with | |
345 | an incomplete character encoding, signal a @code{text:bad-encoding} | |
346 | error. | |
347 | @end deftypefn | |
348 | ||
349 | It is common to process the characters in a string from left to right. | |
350 | However, if you fetch each character using @code{scm_mb_index}, each | |
351 | call will scan the text from the beginning, so your loop will require | |
352 | time proportional to at least the square of the length of the text. To | |
353 | avoid this poor performance, you can use an @code{scm_mb_cache} | |
354 | structure and the @code{scm_mb_index_cached} macro. | |
355 | ||
356 | @deftp {Libguile Type} {struct scm_mb_cache} | |
357 | This structure holds information that allows a string scanning operation | |
358 | to use the results from a previous scan of the string. It has the | |
359 | following members: | |
360 | @table @code | |
361 | ||
362 | @item character | |
363 | An index, in characters, into the string. | |
364 | ||
365 | @item byte | |
366 | The index, in bytes, of the start of that character. | |
367 | ||
368 | @end table | |
369 | ||
370 | In other words, @code{byte} is the byte offset of the | |
371 | @code{character}'th character of the string. Note that if @code{byte} | |
372 | and @code{character} are equal, then all characters before that point | |
373 | must have encodings exactly one byte long, and the string can be indexed | |
374 | normally. | |
375 | ||
376 | All elements of a @code{struct scm_mb_cache} structure should be | |
377 | initialized to zero before its first use, and whenever the string's text | |
378 | changes. | |
379 | @end deftp | |
380 | ||
381 | @deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache}) | |
382 | @deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache}) | |
383 | This macro and this function are identical to @code{scm_mb_index}, | |
384 | except that they may consult and update *@var{cache} in order to avoid | |
385 | scanning the string from the beginning. @code{scm_mb_index_cached} is a | |
386 | macro, so it may have less overhead than | |
387 | @code{scm_mb_index_cached_func}, but it may evaluate its arguments more | |
388 | than once. | |
389 | ||
390 | Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you | |
391 | can scan a string from left to right, or from right to left, in time | |
392 | proportional to the length of the string. As long as each character | |
393 | fetched is less than some constant distance before or after the previous | |
394 | character fetched with @var{cache}, each access will require constant | |
395 | time. | |
396 | @end deftypefn | |
397 | ||
398 | Guile also provides functions to convert between an encoded sequence of | |
399 | characters, and an array of @code{scm_char_t} objects. | |
400 | ||
401 | @deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len}) | |
402 | Convert the variable-width text in the @var{len} bytes at @var{p} | |
403 | to an array of @code{scm_char_t} values. Return a pointer to the array, | |
404 | and set @code{*@var{result_len}} to the number of elements it contains. | |
405 | The returned array is allocated with @code{malloc}, and it is the | |
406 | caller's responsibility to free it. | |
407 | ||
408 | If the text is not a sequence of valid character encodings, this | |
409 | function will signal a @code{text:bad-encoding} error. | |
410 | @end deftypefn | |
411 | ||
412 | @deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len}) | |
413 | Convert the array of @code{scm_char_t} values to a sequence of | |
414 | variable-width character encodings. Return a pointer to the array of | |
415 | bytes, and set @code{*@var{result_len}} to its length, in bytes. | |
416 | ||
417 | The returned byte sequence is terminated with a zero byte, which is not | |
418 | counted in the length returned in @code{*@var{result_len}}. | |
419 | ||
420 | The returned byte sequence is allocated with @code{malloc}; it is the | |
421 | caller's responsibility to free it. | |
422 | ||
423 | If the text is not a sequence of valid character encodings, this | |
424 | function will signal a @code{text:bad-encoding} error. | |
425 | @end deftypefn | |
426 | ||
427 | ||
428 | @node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text | |
429 | @subsection Exchanging Guile Text With the Outside World in C | |
430 | ||
431 | [[This is kind of a heavy-weight model, given that one end of the | |
432 | conversion is always going to be the Guile encoding. Any way to shorten | |
433 | things a bit?]] | |
434 | ||
435 | Guile provides functions for converting between Guile's internal text | |
436 | representation and encodings popular in the outside world. These | |
437 | functions are closely modeled after the @code{iconv} functions available | |
438 | on some systems. | |
439 | ||
440 | To convert text between two encodings, you should first call | |
441 | @code{scm_mb_iconv_open} to indicate the source and destination | |
442 | encodings; this function returns a context object which records the | |
443 | conversion to perform. | |
444 | ||
445 | Then, you should call @code{scm_mb_iconv} to actually convert the text. | |
446 | This function expects input and output buffers, and a pointer to the | |
447 | context you got from @var{scm_mb_iconv_open}. You don't need to pass | |
448 | all your input to @code{scm_mb_iconv} at once; you can invoke it on | |
449 | successive blocks of input (as you read it from a file, say), and it | |
450 | will convert as much as it can each time, indicating when you should | |
451 | grow your output buffer. | |
452 | ||
453 | An encoding may be @dfn{stateless}, or @dfn{stateful}. In most | |
454 | encodings, a contiguous group of bytes from the sequence completely | |
455 | specifies a particular character; these are stateless encodings. | |
456 | However, some encodings require you to look back an unbounded number of | |
457 | bytes in the stream to assign a meaning to a particular byte sequence; | |
458 | such encodings are stateful. | |
459 | ||
460 | For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the | |
461 | byte sequence @samp{27 36 66} indicates that subsequent bytes should be | |
462 | taken in pairs and interpreted as characters from the JIS-0208 character | |
463 | set. An arbitrary number of byte pairs may follow this sequence. The | |
464 | byte sequence @samp{27 40 66} indicates that subsequent bytes should be | |
465 | interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a | |
466 | given byte is an @sc{ASCII} character without looking back an arbitrary | |
467 | distance for the most recent escape sequence, so it is a stateful | |
468 | encoding. | |
469 | ||
470 | In Guile, if a conversion involves a stateful encoding, the context | |
471 | object carries any necessary state. Thus, you can have many independent | |
472 | conversions to or from stateful encodings taking place simultaneously, | |
473 | as long as each data stream uses its own context object for the | |
474 | conversion. | |
475 | ||
476 | @deftp {Libguile Type} {struct scm_mb_iconv} | |
477 | This is the type for context objects, which represent the encodings and | |
478 | current state of an ongoing text conversion. A @code{struct | |
479 | scm_mb_iconv} records the source and destination encodings, and keeps | |
480 | track of any information needed to handle stateful encodings. | |
481 | @end deftp | |
482 | ||
483 | @deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode}) | |
484 | Return a pointer to a new @code{struct scm_mb_iconv} context object, | |
485 | ready to convert from the encoding named @var{fromcode} to the encoding | |
486 | named @var{tocode}. For stateful encodings, the context object is in | |
487 | some appropriate initial state, ready for use with the | |
488 | @code{scm_mb_iconv} function. | |
489 | ||
490 | When you are done using a context object, you may call | |
491 | @code{scm_mb_iconv_close} to free it. | |
492 | ||
493 | If either @var{tocode} or @var{fromcode} is not the name of a known | |
494 | encoding, this function will signal the @code{text:unknown-conversion} | |
495 | error, described below. | |
496 | ||
497 | @c Try to use names here from the IANA list: | |
498 | @c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets | |
499 | Guile supports at least these encodings: | |
500 | @table @samp | |
501 | ||
502 | @item US-ASCII | |
503 | @sc{US-ASCII}, in the standard one-character-per-byte encoding. | |
504 | ||
505 | @item ISO-8859-1 | |
506 | The usual character set for Western European languages, in its usual | |
507 | one-character-per-byte encoding. | |
508 | ||
509 | @item Guile-MB | |
510 | Guile's current internal multibyte encoding. The actual encoding this | |
511 | name refers to will change from one version of Guile to the next. You | |
512 | should use this when converting data between external sources and the | |
513 | encoding used by Guile objects. | |
514 | ||
515 | You should @emph{not} use this as the encoding for data presented to the | |
516 | outside world, for two reasons. 1) Its meaning will change over time, | |
517 | so data written using the @samp{guile} encoding with one version of | |
518 | Guile might not be readable with the @samp{guile} encoding in another | |
519 | version of Guile. 2) It currently corresponds to @samp{Emacs-Mule}, | |
520 | which invented for Emacs's internal use, and was never intended to serve | |
521 | as an exchange medium. | |
522 | ||
523 | @item Guile-Wide | |
524 | Guile's character set, as an array of @code{scm_char_t} values. | |
525 | ||
526 | Note that this encoding is even less suitable for public use than | |
527 | @samp{Guile}, since the exact sequence of bytes depends heavily on the | |
528 | size and endianness the host system uses for @code{scm_char_t}. Using | |
529 | this encoding is very much like calling the | |
530 | @code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte} | |
531 | functions, except that @code{scm_mb_iconv} gives you more control over | |
532 | buffer allocation and management. | |
533 | ||
534 | @item Emacs-Mule | |
535 | This is the variable-length encoding for multi-lingual text by GNU | |
536 | Emacs, at least through version 20.4. You probably should not use this | |
537 | encoding, as it is designed only for Emacs's internal use. However, we | |
538 | provide it here because it's trivial to support, and some people | |
539 | probably do have @samp{emacs-mule}-format files lying around. | |
540 | ||
541 | @end table | |
542 | ||
543 | (At the moment, this list doesn't include any character sets suitable for | |
544 | external use that can actually handle multilingual data; this is | |
545 | unfortunate, as it encourages users to write data in Emacs-Mule format, | |
546 | which nobody but Emacs and Guile understands. We hope to add support | |
547 | for Unicode in UTF-8 soon, which should solve this problem.) | |
548 | ||
549 | Case is not significant in encoding names. | |
550 | ||
551 | You can define your own conversions; see @ref{Implementing Your Own Text | |
552 | Conversions}. | |
553 | @end deftypefn | |
554 | ||
555 | @deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding}) | |
556 | Return a non-zero value if Guile supports the encoding named @var{encoding}[[]] | |
557 | @end deftypefn | |
558 | ||
559 | @deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) | |
560 | Convert a sequence of characters from one encoding to another. The | |
561 | argument @var{context} specifies the encodings to use for the input and | |
562 | output, and carries state for stateful encodings; use | |
563 | @code{scm_mb_iconv_open} to create a @var{context} object for a | |
564 | particular conversion. | |
565 | ||
566 | Upon entry to the function, @code{*@var{inbuf}} should point to the | |
567 | input buffer, and @code{*@var{inbytesleft}} should hold the number of | |
568 | input bytes present in the buffer; @code{*@var{outbuf}} should point to | |
569 | the output buffer, and @code{*@var{outbytesleft}} should hold the number | |
570 | of bytes available to hold the conversion results in that buffer. | |
571 | ||
572 | Upon exit from the function, @code{*@var{inbuf}} points to the first | |
573 | unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number | |
574 | of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after | |
575 | the last output byte, and @code{*@var{outbyteleft}} holds the number of | |
576 | bytes left unused in the output buffer. | |
577 | ||
578 | For stateful encodings, @var{context} carries encoding state from one | |
579 | call to @code{scm_mb_iconv} to the next. Thus, successive calls to | |
580 | @var{scm_mb_iconv} which use the same context object can convert a | |
581 | stream of data one chunk at a time. | |
582 | ||
583 | If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is | |
584 | taken as a request to reset the states of the input and the output | |
585 | encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is | |
586 | non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output | |
587 | buffer to put the output encoding in its initial state. If the output | |
588 | buffer is not large enough to hold this byte sequence, | |
589 | @code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves | |
590 | the shift states of @var{context}'s input and output encodings | |
591 | unchanged. | |
592 | ||
593 | The @code{scm_mb_iconv} function always consumes only complete | |
594 | characters or shift sequences from the input buffer, and the output | |
595 | buffer always contains a sequence of complete characters or escape | |
596 | sequences. | |
597 | ||
598 | If the input sequence contains characters which are not expressible in | |
599 | the output encoding, @code{scm_mb_iconv} converts it in an | |
600 | implementation-defined way. It may simply delete the character. | |
601 | ||
602 | Some encodings use byte sequences which do not correspond to any textual | |
603 | character. For example, the escape sequence of a stateful encoding has | |
604 | no textual meaning. When converting from such an encoding, a call to | |
605 | @code{scm_mb_iconv} might consume input but produce no output, since the | |
606 | input sequence might contain only escape sequences. | |
607 | ||
608 | Normally, @code{scm_mb_iconv} returns the number of input characters it | |
609 | could not convert perfectly to the ouput encoding. However, it may | |
610 | return one of the @code{scm_mb_iconv_} codes described below, to | |
611 | indicate an error. All of these codes are negative values. | |
612 | ||
613 | If the input sequence contains an invalid character encoding, conversion | |
614 | stops before the invalid input character, and @code{scm_mb_iconv} | |
615 | returns the constant value @code{scm_mb_iconv_bad_encoding}. | |
616 | ||
617 | If the input sequence ends with an incomplete character encoding, | |
618 | @code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and | |
619 | return the constant value @code{scm_mb_iconv_incomplete_encoding}. This | |
620 | is not necessarily an error, if you expect to call @code{scm_mb_iconv} | |
621 | again with more data which might contain the rest of the encoding | |
622 | fragment. | |
623 | ||
624 | If the output buffer does not contain enough room to hold the converted | |
625 | form of the complete input text, @code{scm_mb_iconv} converts as much as | |
626 | it can, changes the input and output pointers to reflect the amount of | |
627 | text successfully converted, and then returns | |
628 | @code{scm_mb_iconv_too_big}. | |
629 | @end deftypefn | |
630 | ||
631 | Here are the status codes that might be returned by @code{scm_mb_iconv}. | |
632 | They are all negative integers. | |
633 | @table @code | |
634 | ||
635 | @item scm_mb_iconv_too_big | |
636 | The conversion needs more room in the output buffer. Some characters | |
637 | may have been consumed from the input buffer, and some characters may | |
638 | have been placed in the available space in the output buffer. | |
639 | ||
640 | @item scm_mb_iconv_bad_encoding | |
641 | @code{scm_mb_iconv} encountered an invalid character encoding in the | |
642 | input buffer. Conversion stopped before the invalid character, so there | |
643 | may be some characters consumed from the input buffer, and some | |
644 | converted text in the output buffer. | |
645 | ||
646 | @item scm_mb_iconv_incomplete_encoding | |
647 | The input buffer ends with an incomplete character encoding. The | |
648 | incomplete encoding is left in the input buffer, unconsumed. This is | |
649 | not necessarily an error, if you expect to call @code{scm_mb_iconv} | |
650 | again with more data which might contain the rest of the incomplete | |
651 | encoding. | |
652 | ||
653 | @end table | |
654 | ||
655 | ||
656 | Finally, Guile provides a function for destroying conversion contexts. | |
657 | ||
658 | @deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context}) | |
659 | Deallocate the conversion context object @var{context}, and all other | |
660 | resources allocated by the call to @code{scm_mb_iconv_open} which | |
661 | returned @var{context}. | |
662 | @end deftypefn | |
663 | ||
664 | ||
665 | @node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text | |
666 | @subsection Implementing Your Own Text Conversions | |
667 | ||
668 | [[note that conversions to and from Guile must produce streams | |
669 | containing only valid character encodings, or else Guile will crash]] | |
670 | ||
671 | This section describes the interface for adding your own encoding | |
672 | conversions for use with @code{scm_mb_iconv}. The interface here is | |
673 | borrowed from the GNOME Project's @file{libunicode} library. | |
674 | ||
675 | Guile's @code{scm_mb_iconv} function works by converting the input text | |
676 | to a stream of @code{scm_char_t} characters, and then converting | |
677 | those characters to the desired output encoding. This makes it easy | |
678 | for Guile to choose the appropriate conversion back ends for an | |
679 | arbitrary pair of input and output encodings, but it also means that the | |
680 | accuracy and quality of the conversions depends on the fidelity of | |
681 | Guile's internal character set to the source and destination encodings. | |
682 | Since @code{scm_mb_iconv} will be used almost exclusively for converting | |
683 | to and from Guile's internal character set, this shouldn't be a problem. | |
684 | ||
685 | To add support for a particular encoding to Guile, you must provide one | |
686 | function (called the @dfn{read} function) which converts from your | |
687 | encoding to an array of @code{scm_char_t}'s, and another function | |
688 | (called the @dfn{write} function) to convert from an array of | |
689 | @code{scm_char_t}'s back into your encoding. To convert from some | |
690 | encoding @var{a} to some other encoding @var{b}, Guile pairs up | |
691 | @var{a}'s read function with @var{b}'s write function. Each call to | |
692 | @code{scm_mb_iconv} passes text in encoding @var{a} through the read | |
693 | function, to produce an array of @code{scm_char_t}'s, and then passes | |
694 | that array to the write function, to produce text in encoding @var{b}. | |
695 | ||
696 | For stateful encodings, a read or write function can hang its own data | |
697 | structures off the conversion object, and provide its own functions to | |
698 | allocate and destroy them; this allows read and write functions to | |
699 | maintain whatever state they like. | |
700 | ||
701 | The Guile conversion back end represents each available encoding with a | |
702 | @code{struct scm_mb_encoding} object. | |
703 | ||
704 | @deftp {Libguile Type} {struct scm_mb_encoding} | |
705 | This data structure describes an encoding. It has the following | |
706 | members: | |
707 | ||
708 | @table @code | |
709 | ||
710 | @item char **names | |
711 | An array of strings, giving the various names for this encoding. The | |
712 | array should be terminated by a zero pointer. Case is not significant | |
713 | in encoding names. | |
714 | ||
715 | The @code{scm_mb_iconv_open} function searches the list of registered | |
716 | encodings for an encoding whose @code{names} array matches its | |
717 | @var{tocode} or @var{fromcode} argument. | |
718 | ||
719 | @item int (*init) (void **@var{cookie}) | |
720 | An initialization function for the encoding's private data. | |
721 | @code{scm_mb_iconv_open} will call this function, passing it the address | |
722 | of the cookie for this encoding in this context. (We explain cookies | |
723 | below.) There is no way for the @code{init} function to tell whether | |
724 | the encoding will be used for reading or writing. | |
725 | ||
726 | Note that @code{init} receives a @emph{pointer} to the cookie, not the | |
727 | cookie itself. Because the type of @var{cookie} is @code{void **}, the | |
728 | C compiler will not check it as carefully as it would other types. | |
729 | ||
730 | The @code{init} member may be zero, indicating that no initialization is | |
731 | necessary for this encoding. | |
732 | ||
733 | @item int (*destroy) (void **@var{cookie}) | |
734 | A deallocation function for the encoding's private data. | |
735 | @code{scm_mb_iconv_close} calls this function, passing it the address of | |
736 | the cookie for this encoding in this context. The @code{destroy} | |
737 | function should free any data the @code{init} function allocated. | |
738 | ||
739 | Note that @code{destroy} receives a @emph{pointer} to the cookie, not the | |
740 | cookie itself. Because the type of @var{cookie} is @code{void **}, the | |
741 | C compiler will not check it as carefully as it would other types. | |
742 | ||
743 | The @code{destroy} member may be zero, indicating that this encoding | |
744 | doesn't need to perform any special action to destroy its local data. | |
745 | ||
746 | @item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft}) | |
747 | Put the encoding into its initial shift state. Guile calls this | |
748 | function whether the encoding is being used for input or output, so this | |
749 | should take appropriate steps for both directions. If @var{outbuf} and | |
750 | @var{outbytesleft} are valid, the reset function should emit an escape | |
751 | sequence to reset the output stream to its initial state; @var{outbuf} | |
752 | and @var{outbytesleft} should be handled just as for | |
753 | @code{scm_mb_iconv}. | |
754 | ||
755 | This function can return an @code{scm_mb_iconv_} error code | |
756 | (@pxref{Exchanging Guile Text With the Outside World in C}). If it | |
757 | returns @code{scm_mb_iconv_too_big}, then the output buffer's shift | |
758 | state must be left unchanged. | |
759 | ||
760 | Note that @code{reset} receives the cookie's value itself, not a pointer | |
761 | to the cookie, as the @code{init} and @code{destroy} functions do. | |
762 | ||
763 | The @code{reset} member may be zero, indicating that this encoding | |
764 | doesn't use a shift state. | |
765 | ||
766 | @item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf}, size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft}) | |
767 | Read some bytes and convert into an array of Guile characters. This is | |
768 | the encoding's read function. | |
769 | ||
770 | On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to | |
771 | be converted, and *@var{outcharsleft} characters available at | |
772 | *@var{outbuf} to hold the results. | |
773 | ||
774 | On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes | |
775 | still not consumed. *@var{outcharsleft} and *@var{outbuf} indicate the | |
776 | output buffer space still not filled. (By exclusion, these indicate | |
777 | which input bytes were consumed, and which output characters were | |
778 | produced.) | |
779 | ||
780 | Return one of the @code{enum scm_mb_read_result} values, described below. | |
781 | ||
782 | Note that @code{read} receives the cookie's value itself, not a pointer | |
783 | to the cookie, as the @code{init} and @code{destroy} functions do. | |
784 | ||
785 | @item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft}) | |
786 | Convert an array of Guile characters to output bytes. This is | |
787 | the encoding's write function. | |
788 | ||
789 | On entry, there are *@var{incharsleft} Guile characters available at | |
790 | *@var{inbuf}, and *@var{outbytesleft} bytes available to store output at | |
791 | *@var{outbuf}. | |
792 | ||
793 | On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of | |
794 | Guile characters left unconverted (because there was insufficient room | |
795 | in the output buffer to hold their converted forms), and | |
796 | *@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the | |
797 | output buffer. | |
798 | ||
799 | Return one of the @code{scm_mb_write_result} values, described below. | |
800 | ||
801 | Note that @code{write} receives the cookie's value itself, not a pointer | |
802 | to the cookie, as the @code{init} and @code{destroy} functions do. | |
803 | ||
804 | @item struct scm_mb_encoding *next | |
805 | This is used by Guile to maintain a linked list of encodings. It is | |
806 | filled in when you call @code{scm_mb_register_encoding} to add your | |
807 | encoding to the list. | |
808 | ||
809 | @end table | |
810 | @end deftp | |
811 | ||
812 | Here is the enumerated type for the values an encoding's read function | |
813 | can return: | |
814 | ||
815 | @deftp {Libguile Type} {enum scm_mb_read_result} | |
816 | This type represents the result of a call to an encoding's read | |
817 | function. It has the following values: | |
818 | ||
819 | @table @code | |
820 | ||
821 | @item scm_mb_read_ok | |
822 | The read function consumed at least one byte of input. | |
823 | ||
824 | @item scm_mb_read_incomplete | |
825 | The data present in the input buffer does not contain a complete | |
826 | character encoding. No input was consumed, and no characters were | |
827 | produced as output. This is not necessarily an error status, if there | |
828 | is more data to pass through. | |
829 | ||
830 | @item scm_mb_read_error | |
831 | The input contains an invalid character encoding. | |
832 | ||
833 | @end table | |
834 | @end deftp | |
835 | ||
836 | Here is the enumerated type for the values an encoding's write function | |
837 | can return: | |
838 | ||
839 | @deftp {Libguile Type} {enum scm_mb_write_result} | |
840 | This type represents the result of a call to an encoding's write | |
841 | function. It has the following values: | |
842 | ||
843 | @table @code | |
844 | ||
845 | @item scm_mb_write_ok | |
846 | The write function was able to convert all the characters in @var{inbuf} | |
847 | successfully. | |
848 | ||
849 | @item scm_mb_write_too_big | |
850 | The write function filled the output buffer, but there are still | |
851 | characters in @var{inbuf} left unconsumed; @var{inbuf} and | |
852 | @var{incharsleft} indicate the unconsumed portion of the input buffer. | |
853 | ||
854 | @end table | |
855 | @end deftp | |
856 | ||
857 | ||
858 | Conversions to or from stateful encodings need to keep track of each | |
859 | encoding's current state. Each conversion context contains two | |
860 | @code{void *} variables called @dfn{cookies}, one for the input | |
861 | encoding, and one for the output encoding. These cookies are passed to | |
862 | the encodings' functions, for them to use however they please. A | |
863 | stateful encoding can use its cookie to hold a pointer to some object | |
864 | which maintains the context's current shift state. Stateless encodings | |
865 | will probably not use their cookies. | |
866 | ||
867 | The cookies' lifetime is the same as that of the context object. When | |
868 | the user calls @code{scm_mb_iconv_close} to destroy a context object, | |
869 | @code{scm_mb_iconv_close} calls the input and output encodings' | |
870 | @code{destroy} functions, passing them their respective cookies, so each | |
871 | encoding can free any data it allocated for that context. | |
872 | ||
873 | Note that, if a read or write function returns a successful result code | |
874 | like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining | |
875 | input, together with the output, must together represent the complete | |
876 | input text; the encoding may not store any text temporarily in its | |
877 | cookie. This is because, if @code{scm_mb_iconv} returns a successful | |
878 | result to the user, it is correct for the user to assume that all the | |
879 | consumed input has been converted and placed in the output buffer. | |
880 | There is no ``flush'' operation to push any final results out of the | |
881 | encodings' buffers. | |
882 | ||
883 | Here is the function you call to register a new encoding with the | |
884 | conversion system: | |
885 | ||
886 | @deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding}) | |
887 | Add the encoding described by @code{*@var{encoding}} to the set | |
888 | understood by @code{scm_mb_iconv_open}. Once you have registered your | |
889 | encoding, you can use it by calling @code{scm_mb_iconv_open} with one of | |
890 | the names in @code{@var{encoding}->names}. | |
891 | @end deftypefn | |
892 | ||
893 | ||
894 | @node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C | |
895 | @section Multibyte Text Processing Errors | |
896 | ||
897 | This section describes error conditions which code can signal to | |
898 | indicate problems encountered while processing multibyte text. In each | |
899 | case, the arguments @var{message} and @var{args} are an error format | |
900 | string and arguments to be substituted into the string, as accepted by | |
901 | the @code{display-error} function. | |
902 | ||
903 | @deffn Condition text:not-char-boundary func message args object offset | |
904 | By calling @var{func}, the program attempted to access a character at | |
905 | byte offset @var{offset} in the Guile object @var{object}, but | |
906 | @var{offset} is not the start of a character's encoding in @var{object}. | |
907 | ||
908 | Typically, @var{object} is a string or symbol. If the function signalling | |
909 | the error cannot find the Guile object that contains the text it is | |
910 | inspecting, it should use @code{#f} for @var{object}. | |
911 | @end deffn | |
912 | ||
913 | @deffn Condition text:bad-encoding func message args object | |
914 | By calling @var{func}, the program attempted to interpret the text in | |
915 | @var{object}, but @var{object} contains a byte sequence which is not a | |
916 | valid encoding for any character. | |
917 | @end deffn | |
918 | ||
919 | @deffn Condition text:not-guile-char func message args number | |
920 | By calling @var{func}, the program attempted to treat @var{number} as the | |
921 | number of a character in the Guile character set, but @var{number} does | |
922 | not correspond to any character in the Guile character set. | |
923 | @end deffn | |
924 | ||
925 | @deffn Condition text:unknown-conversion func message args from to | |
926 | By calling @var{func}, the program attempted to convert from an encoding | |
927 | named @var{from} to an encoding named @var{to}, but Guile does not | |
928 | support such a conversion. | |
929 | @end deffn | |
930 | ||
931 | @deftypevr {Libguile Variable} SCM scm_text_not_char_boundary | |
932 | @deftypevrx {Libguile Variable} SCM scm_text_bad_encoding | |
933 | @deftypevrx {Libguile Variable} SCM scm_text_not_guile_char | |
934 | These variables hold the scheme symbol objects whose names are the | |
935 | condition symbols above. You can use these when signalling these | |
936 | errors, instead of looking them up yourself. | |
937 | @end deftypevr | |
938 | ||
939 | ||
940 | @node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C | |
941 | @section Why Guile Does Not Use a Fixed-Width Encoding | |
942 | ||
943 | Multibyte encodings are clumsier to work with than encodings which use a | |
944 | fixed number of bytes for every character. For example, using a | |
945 | fixed-width encoding, we can extract the @var{i}th character of a string | |
946 | in constant time, and we can always substitute the @var{i}th character | |
947 | of a string with any other character without reallocating or copying the | |
948 | string. | |
949 | ||
950 | However, there are no fixed-width encodings which include the characters | |
951 | we wish to include, and also fit in a reasonable amount of space. | |
952 | Despite the Unicode standard's claims to the contrary, Unicode is not | |
953 | really a fixed-width encoding. Unicode uses surrogate pairs to | |
954 | represent characters outside the 16-bit range; a surrogate pair must be | |
955 | treated as a single character, but occupies two 16-bit spaces. As of | |
956 | this writing, there are already plans to assign characters to the | |
957 | surrogate character codes. Three- and four-byte encodings are | |
958 | too wasteful for a majority of Guile's users, who only need @sc{ASCII} | |
959 | and a few accented characters. | |
960 | ||
961 | Another alternative would be to have several different fixed-width | |
962 | string representations, each with a different element size. For each | |
963 | string, Guile would use the smallest element size capable of | |
964 | accomodating the string's text. This would allow users of English and | |
965 | the Western European languages to use the traditional memory-efficient | |
966 | encodings. However, if Guile has @var{n} string representations, then | |
967 | users must write @var{n} versions of any code which manipulates text | |
968 | directly --- one for each element size. And if a user wants to operate | |
969 | on two strings simultaneously, and wants to avoid testing the string | |
970 | sizes within the loop, she must make @var{n}*@var{n} copies of the loop. | |
971 | Most users will simply not bother. Instead, they will write code which | |
972 | supports only one string size, leaving us back where we started. By | |
973 | using a single internal representation, Guile makes it easier for users | |
974 | to write multilingual code. | |
975 | ||
976 | [[What about tagging each string with its encoding? | |
977 | "Every extension must be written to deal with every encoding"]] | |
978 | ||
979 | [[You don't really want to index strings anyway.]] | |
980 | ||
981 | Finally, Guile's multibyte encoding is not so bad. Unlike a two- or | |
982 | four-byte encoding, it is efficient in space for American and European | |
983 | users. Furthermore, the properties described above mean that many | |
984 | functions can be coded just as they would for a single-byte encoding; | |
985 | see @ref{Promised Properties of the Guile Multibyte Encoding}. | |
986 | ||
987 | @bye |