| 1 | \input texinfo |
| 2 | @setfilename mbapi.info |
| 3 | @settitle Multibyte API |
| 4 | @setchapternewpage off |
| 5 | |
| 6 | @c Open issues: |
| 7 | |
| 8 | @c What's the best way to report errors? Should functions return a |
| 9 | @c magic value, according to C tradition, or should they signal a |
| 10 | @c Guile exception? |
| 11 | |
| 12 | @c |
| 13 | |
| 14 | |
| 15 | @node Working With Multibyte Strings in C |
| 16 | @chapter Working With Multibyte Strings in C |
| 17 | |
| 18 | Guile allows strings to contain characters drawn from a wide variety of |
| 19 | languages, including many Asian, Eastern European, and Middle Eastern |
| 20 | languages, in a uniform and unrestricted way. The string representation |
| 21 | normally used in C code --- an array of @sc{ASCII} characters --- is not |
| 22 | sufficient for Guile strings, since they may contain characters not |
| 23 | present in @sc{ASCII}. |
| 24 | |
| 25 | Instead, Guile uses a very large character set, and encodes each |
| 26 | character as a sequence of one or more bytes. We call this |
| 27 | variable-width encoding a @dfn{multibyte} encoding. Guile uses this |
| 28 | single encoding internally for all strings, symbol names, error |
| 29 | messages, etc., and performs appropriate conversions upon input and |
| 30 | output. |
| 31 | |
| 32 | The use of this variable-width encoding is almost invisible to Scheme |
| 33 | code. Strings are still indexed by character number, not by byte |
| 34 | offset; @code{string-length} still returns the length of a string in |
| 35 | characters, not in bytes. @code{string-ref} and @code{string-set!} are |
| 36 | no longer guaranteed to be constant-time operations, but Guile uses |
| 37 | various strategies to reduce the impact of this change. |
| 38 | |
| 39 | However, the encoding is visible via Guile's C interface, which gives |
| 40 | the user direct access to a string's bytes. This chapter explains how |
| 41 | to work with Guile multibyte text in C code. Since variable-width |
| 42 | encodings are clumsier to work with than simple fixed-width encodings, |
| 43 | Guile provides a set of standard macros and functions for manipulating |
| 44 | multibyte text to make the job easier. Furthermore, Guile makes some |
| 45 | promises about the encoding which you can use in writing your own text |
| 46 | processing code. |
| 47 | |
| 48 | While we discuss guaranteed properties of Guile's encoding, and provide |
| 49 | functions to operate on its character set, we do not actually specify |
| 50 | either the character set or encoding here. This is because we expect |
| 51 | both of them to change in the future: currently, Guile uses the same |
| 52 | encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs |
| 53 | as well) to use Unicode and UTF-8, with some extensions. This will make |
| 54 | it more comfortable to use Guile with other systems which use UTF-8, |
| 55 | like the GTk user interface toolkit. |
| 56 | |
| 57 | @menu |
| 58 | * Multibyte String Terminology:: |
| 59 | * Promised Properties of the Guile Multibyte Encoding:: |
| 60 | * Functions for Operating on Multibyte Text:: |
| 61 | * Multibyte Text Processing Errors:: |
| 62 | * Why Guile Does Not Use a Fixed-Width Encoding:: |
| 63 | @end menu |
| 64 | |
| 65 | |
| 66 | @node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C |
| 67 | @section Multibyte String Terminology |
| 68 | |
| 69 | In the descriptions which follow, we make the following definitions: |
| 70 | @table @dfn |
| 71 | |
| 72 | @item byte |
| 73 | A @dfn{byte} is a number between 0 and 255. It has no inherent textual |
| 74 | interpretation. So 65 is a byte, not a character. |
| 75 | |
| 76 | @item character |
| 77 | A @dfn{character} is a unit of text. It has no inherent numeric value. |
| 78 | @samp{A} and @samp{.} are characters, not bytes. (This is different |
| 79 | from the C language's definition of @dfn{character}; in this chapter, we |
| 80 | will always use a phrase like ``the C language's @code{char} type'' when |
| 81 | that's what we mean.) |
| 82 | |
| 83 | @item character set |
| 84 | A @dfn{character set} is an invertible mapping between numbers and a |
| 85 | given set of characters. @sc{ASCII} is a character set assigning |
| 86 | characters to the numbers 0 through 127. It maps @samp{A} onto the |
| 87 | number 65, and @samp{.} onto 46. |
| 88 | |
| 89 | Note that a character set maps characters onto numbers, @emph{not |
| 90 | necessarily} onto bytes. For example, the Unicode character set maps |
| 91 | the Greek lower-case @samp{alpha} character onto the number 945, which |
| 92 | is not a byte. |
| 93 | |
| 94 | (This is what Internet standards would call a "coding character set".) |
| 95 | |
| 96 | @item encoding |
| 97 | An encoding maps numbers onto sequences of bytes. For example, the |
| 98 | UTF-8 encoding, defined in the Unicode Standard, would map the number |
| 99 | 945 onto the sequence of bytes @samp{206 177}. When using the |
| 100 | @sc{ASCII} character set, every number assigned also happens to be a |
| 101 | byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes. |
| 102 | |
| 103 | (This is what Internet standards would call a "character encoding |
| 104 | scheme".) |
| 105 | |
| 106 | @end table |
| 107 | |
| 108 | Thus, to turn a character into a sequence of bytes, you need a character |
| 109 | set to assign a number to that character, and then an encoding to turn |
| 110 | that number into a sequence of bytes. |
| 111 | |
| 112 | Likewise, to interpret a sequence of bytes as a sequence of characters, |
| 113 | you use an encoding to extract a sequence of numbers from the bytes, and |
| 114 | then a character set to turn the numbers into characters. |
| 115 | |
| 116 | Errors can occur while carrying out either of these processes. For |
| 117 | example, under a particular encoding, a given string of bytes might not |
| 118 | correspond to any number. For example, the byte sequence @samp{128 128} |
| 119 | is not a valid encoding of any number under UTF-8. |
| 120 | |
| 121 | Having carefully defined our terminology, we will now abuse it. |
| 122 | |
| 123 | We will sometimes use the word @dfn{character} to refer to the number |
| 124 | assigned to a character by a character set, in contexts where it's |
| 125 | obvious we mean a number. |
| 126 | |
| 127 | Sometimes there is a close association between a particular encoding and |
| 128 | a particular character set. Thus, we may sometimes refer to the |
| 129 | character set and encoding together as an @dfn{encoding}. |
| 130 | |
| 131 | |
| 132 | @node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C |
| 133 | @section Promised Properties of the Guile Multibyte Encoding |
| 134 | |
| 135 | Internally, Guile uses a single encoding for all text --- symbols, |
| 136 | strings, error messages, etc. Here we list a number of helpful |
| 137 | properties of Guile's encoding. It is correct to write code which |
| 138 | assumes these properties; code which uses these assumptions will be |
| 139 | portable to all future versions of Guile, as far as we know. |
| 140 | |
| 141 | @b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in |
| 142 | the obvious way.} This means that a standard C string containing only |
| 143 | @sc{ASCII} characters is a valid Guile string (except for the terminator; |
| 144 | Guile strings store the length explicitly, so they can contain null |
| 145 | characters). |
| 146 | |
| 147 | @b{The encodings of non-@sc{ASCII} characters use only bytes between 128 |
| 148 | and 255.} That is, when we turn a non-@sc{ASCII} character into a |
| 149 | series of bytes, none of those bytes can ever be mistaken for the |
| 150 | encoding of an @sc{ASCII} character. This means that you can search a |
| 151 | Guile string for an @sc{ASCII} character using the standard |
| 152 | @code{memchr} library function. By extension, you can search for an |
| 153 | @sc{ASCII} substring in a Guile string using a traditional substring |
| 154 | search algorithm --- you needn't add special checks to verify encoding |
| 155 | boundaries, etc. |
| 156 | |
| 157 | @b{No character encoding is a subsequence of any other character |
| 158 | encoding.} (This is just a stronger version of the previous promise.) |
| 159 | This means that you can search for occurrences of one Guile string |
| 160 | within another Guile string just as if they were raw byte strings. You |
| 161 | can use the stock @code{memmem} function (provided on GNU systems, at |
| 162 | least) for such searches. If you don't need the ability to represent |
| 163 | null characters in your text, you can still use null-termination for |
| 164 | strings, and use the traditional string-handling functions like |
| 165 | @code{strlen}, @code{strstr}, and @code{strcat}. |
| 166 | |
| 167 | @b{You can always determine the full length of a character's encoding |
| 168 | from its first byte.} Guile provides the macro @code{scm_mb_len} which |
| 169 | computes the encoding's length from its first byte. Given the first |
| 170 | rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <= |
| 171 | @var{b} <= 127}, returns 1. |
| 172 | |
| 173 | @b{Given an arbitrary byte position in a Guile string, you can always |
| 174 | find the beginning and end of the character containing that byte without |
| 175 | scanning too far in either direction.} This means that, if you are sure |
| 176 | a byte sequence is a valid encoding of a character sequence, you can |
| 177 | find character boundaries without keeping track of the beginning and |
| 178 | ending of the overall string. This promise relies on the fact that, in |
| 179 | addition to storing the string's length explicitly, Guile always either |
| 180 | terminates the string's storage with a zero byte, or shares it with |
| 181 | another string which is terminated this way. |
| 182 | |
| 183 | |
| 184 | @node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C |
| 185 | @section Functions for Operating on Multibyte Text |
| 186 | |
| 187 | Guile provides a variety of functions, variables, and types for working |
| 188 | with multibyte text. |
| 189 | |
| 190 | @menu |
| 191 | * Basic Multibyte Character Processing:: |
| 192 | * Finding Character Encoding Boundaries:: |
| 193 | * Multibyte String Functions:: |
| 194 | * Exchanging Guile Text With the Outside World in C:: |
| 195 | * Implementing Your Own Text Conversions:: |
| 196 | @end menu |
| 197 | |
| 198 | |
| 199 | @node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text |
| 200 | @subsection Basic Multibyte Character Processing |
| 201 | |
| 202 | Here are the essential types and functions for working with Guile text. |
| 203 | Guile uses the C type @code{unsigned char *} to refer to text encoded |
| 204 | with Guile's encoding. |
| 205 | |
| 206 | Note that any operation marked here as a ``Libguile Macro'' might |
| 207 | evaluate its argument multiple times. |
| 208 | |
| 209 | @deftp {Libguile Type} scm_char_t |
| 210 | This is a signed integral type large enough to hold any character in |
| 211 | Guile's character set. All character numbers are positive. |
| 212 | @end deftp |
| 213 | |
| 214 | @deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p}) |
| 215 | Return the character whose encoding starts at @var{p}. If @var{p} does |
| 216 | not point at a valid character encoding, the behavior is undefined. |
| 217 | @end deftypefn |
| 218 | |
| 219 | @deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c}) |
| 220 | Place the encoded form of the Guile character @var{c} at @var{p}, and |
| 221 | return its length in bytes. If @var{c} is not a Guile character, the |
| 222 | behavior is undefined. |
| 223 | @end deftypefn |
| 224 | |
| 225 | @deftypevr {Libguile Constant} int scm_mb_max_len |
| 226 | The maximum length of any character's encoding, in bytes. You may |
| 227 | assume this is relatively small --- less than a dozen or so. |
| 228 | @end deftypevr |
| 229 | |
| 230 | @deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b}) |
| 231 | If @var{b} is the first byte of a character's encoding, return the full |
| 232 | length of the character's encoding, in bytes. If @var{b} is not a valid |
| 233 | leading byte, the behavior is undefined. |
| 234 | @end deftypefn |
| 235 | |
| 236 | @deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c}) |
| 237 | Return the length of the encoding of the character @var{c}, in bytes. |
| 238 | If @var{c} is not a valid Guile character, the behavior is undefined. |
| 239 | @end deftypefn |
| 240 | |
| 241 | @deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p}) |
| 242 | @deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c}) |
| 243 | @deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b}) |
| 244 | @deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c}) |
| 245 | These are functions identical to the corresponding macros. You can use |
| 246 | them in situations where the overhead of a function call is acceptable, |
| 247 | and the cleaner semantics of function application are desireable. |
| 248 | @end deftypefn |
| 249 | |
| 250 | |
| 251 | @node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text |
| 252 | @subsection Finding Character Encoding Boundaries |
| 253 | |
| 254 | These are functions for finding the boundaries between characters in |
| 255 | multibyte text. |
| 256 | |
| 257 | Note that any operation marked here as a ``Libguile Macro'' might |
| 258 | evaluate its argument multiple times, unless the definition promises |
| 259 | otherwise. |
| 260 | |
| 261 | @deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p}) |
| 262 | Return non-zero iff @var{p} points to the start of a character in |
| 263 | multibyte text. |
| 264 | |
| 265 | This macro will evaluate its argument only once. |
| 266 | @end deftypefn |
| 267 | |
| 268 | @deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p}) |
| 269 | ``Round'' @var{p} to the previous character boundary. That is, if |
| 270 | @var{p} points to the middle of the encoding of a Guile character, |
| 271 | return a pointer to the first byte of the encoding. If @var{p} points |
| 272 | to the start of the encoding of a Guile character, return @var{p} |
| 273 | unchanged. |
| 274 | @end deftypefn |
| 275 | |
| 276 | @deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p}) |
| 277 | ``Round'' @var{p} to the next character boundary. That is, if @var{p} |
| 278 | points to the middle of the encoding of a Guile character, return a |
| 279 | pointer to the first byte of the encoding of the next character. If |
| 280 | @var{p} points to the start of the encoding of a Guile character, return |
| 281 | @var{p} unchanged. |
| 282 | @end deftypefn |
| 283 | |
| 284 | Note that it is usually not friendly for functions to silently correct |
| 285 | byte offsets that point into the middle of a character's encoding. Such |
| 286 | offsets almost always indicate a programming error, and they should be |
| 287 | reported as early as possible. So, when you write code which operates |
| 288 | on multibyte text, you should not use functions like these to ``clean |
| 289 | up'' byte offsets which the originator believes to be correct; instead, |
| 290 | your code should signal a @code{text:not-char-boundary} error as soon as |
| 291 | it detects an invalid offset. @xref{Multibyte Text Processing Errors}. |
| 292 | |
| 293 | |
| 294 | @node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text |
| 295 | @subsection Multibyte String Functions |
| 296 | |
| 297 | These functions allow you to operate on multibyte strings: sequences of |
| 298 | character encodings. |
| 299 | |
| 300 | @deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len}) |
| 301 | Return the number of Guile characters encoded by the @var{len} bytes at |
| 302 | @var{p}. |
| 303 | |
| 304 | If the sequence contains any invalid character encodings, or ends with |
| 305 | an incomplete character encoding, signal a @code{text:bad-encoding} |
| 306 | error. |
| 307 | @end deftypefn |
| 308 | |
| 309 | @deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp}) |
| 310 | Return the character whose encoding starts at @code{*@var{pp}}, and |
| 311 | advance @code{*@var{pp}} to the start of the next character. Return -1 |
| 312 | if @code{*@var{pp}} does not point to a valid character encoding. |
| 313 | @end deftypefn |
| 314 | |
| 315 | @deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p}) |
| 316 | If @var{p} points to the middle of the encoding of a Guile character, |
| 317 | return a pointer to the first byte of the encoding. If @var{p} points |
| 318 | to the start of the encoding of a Guile character, return the start of |
| 319 | the previous character's encoding. |
| 320 | |
| 321 | This is like @code{scm_mb_floor}, but the returned pointer will always |
| 322 | be before @var{p}. If you use this function to drive an iteration, it |
| 323 | guarantees backward progress. |
| 324 | @end deftypefn |
| 325 | |
| 326 | @deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p}) |
| 327 | If @var{p} points to the encoding of a Guile character, return a pointer |
| 328 | to the first byte of the encoding of the next character. |
| 329 | |
| 330 | This is like @code{scm_mb_ceiling}, but the returned pointer will always |
| 331 | be after @var{p}. If you use this function to drive an iteration, it |
| 332 | guarantees forward progress. |
| 333 | @end deftypefn |
| 334 | |
| 335 | @deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i}) |
| 336 | Assuming that the @var{len} bytes starting at @var{p} are a |
| 337 | concatenation of valid character encodings, return a pointer to the |
| 338 | start of the @var{i}'th character encoding in the sequence. |
| 339 | |
| 340 | This function scans the sequence from the beginning to find the |
| 341 | @var{i}'th character, and will generally require time proportional to |
| 342 | the distance from @var{p} to the returned address. |
| 343 | |
| 344 | If the sequence contains any invalid character encodings, or ends with |
| 345 | an incomplete character encoding, signal a @code{text:bad-encoding} |
| 346 | error. |
| 347 | @end deftypefn |
| 348 | |
| 349 | It is common to process the characters in a string from left to right. |
| 350 | However, if you fetch each character using @code{scm_mb_index}, each |
| 351 | call will scan the text from the beginning, so your loop will require |
| 352 | time proportional to at least the square of the length of the text. To |
| 353 | avoid this poor performance, you can use an @code{scm_mb_cache} |
| 354 | structure and the @code{scm_mb_index_cached} macro. |
| 355 | |
| 356 | @deftp {Libguile Type} {struct scm_mb_cache} |
| 357 | This structure holds information that allows a string scanning operation |
| 358 | to use the results from a previous scan of the string. It has the |
| 359 | following members: |
| 360 | @table @code |
| 361 | |
| 362 | @item character |
| 363 | An index, in characters, into the string. |
| 364 | |
| 365 | @item byte |
| 366 | The index, in bytes, of the start of that character. |
| 367 | |
| 368 | @end table |
| 369 | |
| 370 | In other words, @code{byte} is the byte offset of the |
| 371 | @code{character}'th character of the string. Note that if @code{byte} |
| 372 | and @code{character} are equal, then all characters before that point |
| 373 | must have encodings exactly one byte long, and the string can be indexed |
| 374 | normally. |
| 375 | |
| 376 | All elements of a @code{struct scm_mb_cache} structure should be |
| 377 | initialized to zero before its first use, and whenever the string's text |
| 378 | changes. |
| 379 | @end deftp |
| 380 | |
| 381 | @deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache}) |
| 382 | @deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache}) |
| 383 | This macro and this function are identical to @code{scm_mb_index}, |
| 384 | except that they may consult and update *@var{cache} in order to avoid |
| 385 | scanning the string from the beginning. @code{scm_mb_index_cached} is a |
| 386 | macro, so it may have less overhead than |
| 387 | @code{scm_mb_index_cached_func}, but it may evaluate its arguments more |
| 388 | than once. |
| 389 | |
| 390 | Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you |
| 391 | can scan a string from left to right, or from right to left, in time |
| 392 | proportional to the length of the string. As long as each character |
| 393 | fetched is less than some constant distance before or after the previous |
| 394 | character fetched with @var{cache}, each access will require constant |
| 395 | time. |
| 396 | @end deftypefn |
| 397 | |
| 398 | Guile also provides functions to convert between an encoded sequence of |
| 399 | characters, and an array of @code{scm_char_t} objects. |
| 400 | |
| 401 | @deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len}) |
| 402 | Convert the variable-width text in the @var{len} bytes at @var{p} |
| 403 | to an array of @code{scm_char_t} values. Return a pointer to the array, |
| 404 | and set @code{*@var{result_len}} to the number of elements it contains. |
| 405 | The returned array is allocated with @code{malloc}, and it is the |
| 406 | caller's responsibility to free it. |
| 407 | |
| 408 | If the text is not a sequence of valid character encodings, this |
| 409 | function will signal a @code{text:bad-encoding} error. |
| 410 | @end deftypefn |
| 411 | |
| 412 | @deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len}) |
| 413 | Convert the array of @code{scm_char_t} values to a sequence of |
| 414 | variable-width character encodings. Return a pointer to the array of |
| 415 | bytes, and set @code{*@var{result_len}} to its length, in bytes. |
| 416 | |
| 417 | The returned byte sequence is terminated with a zero byte, which is not |
| 418 | counted in the length returned in @code{*@var{result_len}}. |
| 419 | |
| 420 | The returned byte sequence is allocated with @code{malloc}; it is the |
| 421 | caller's responsibility to free it. |
| 422 | |
| 423 | If the text is not a sequence of valid character encodings, this |
| 424 | function will signal a @code{text:bad-encoding} error. |
| 425 | @end deftypefn |
| 426 | |
| 427 | |
| 428 | @node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text |
| 429 | @subsection Exchanging Guile Text With the Outside World in C |
| 430 | |
| 431 | [[This is kind of a heavy-weight model, given that one end of the |
| 432 | conversion is always going to be the Guile encoding. Any way to shorten |
| 433 | things a bit?]] |
| 434 | |
| 435 | Guile provides functions for converting between Guile's internal text |
| 436 | representation and encodings popular in the outside world. These |
| 437 | functions are closely modeled after the @code{iconv} functions available |
| 438 | on some systems. |
| 439 | |
| 440 | To convert text between two encodings, you should first call |
| 441 | @code{scm_mb_iconv_open} to indicate the source and destination |
| 442 | encodings; this function returns a context object which records the |
| 443 | conversion to perform. |
| 444 | |
| 445 | Then, you should call @code{scm_mb_iconv} to actually convert the text. |
| 446 | This function expects input and output buffers, and a pointer to the |
| 447 | context you got from @var{scm_mb_iconv_open}. You don't need to pass |
| 448 | all your input to @code{scm_mb_iconv} at once; you can invoke it on |
| 449 | successive blocks of input (as you read it from a file, say), and it |
| 450 | will convert as much as it can each time, indicating when you should |
| 451 | grow your output buffer. |
| 452 | |
| 453 | An encoding may be @dfn{stateless}, or @dfn{stateful}. In most |
| 454 | encodings, a contiguous group of bytes from the sequence completely |
| 455 | specifies a particular character; these are stateless encodings. |
| 456 | However, some encodings require you to look back an unbounded number of |
| 457 | bytes in the stream to assign a meaning to a particular byte sequence; |
| 458 | such encodings are stateful. |
| 459 | |
| 460 | For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the |
| 461 | byte sequence @samp{27 36 66} indicates that subsequent bytes should be |
| 462 | taken in pairs and interpreted as characters from the JIS-0208 character |
| 463 | set. An arbitrary number of byte pairs may follow this sequence. The |
| 464 | byte sequence @samp{27 40 66} indicates that subsequent bytes should be |
| 465 | interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a |
| 466 | given byte is an @sc{ASCII} character without looking back an arbitrary |
| 467 | distance for the most recent escape sequence, so it is a stateful |
| 468 | encoding. |
| 469 | |
| 470 | In Guile, if a conversion involves a stateful encoding, the context |
| 471 | object carries any necessary state. Thus, you can have many independent |
| 472 | conversions to or from stateful encodings taking place simultaneously, |
| 473 | as long as each data stream uses its own context object for the |
| 474 | conversion. |
| 475 | |
| 476 | @deftp {Libguile Type} {struct scm_mb_iconv} |
| 477 | This is the type for context objects, which represent the encodings and |
| 478 | current state of an ongoing text conversion. A @code{struct |
| 479 | scm_mb_iconv} records the source and destination encodings, and keeps |
| 480 | track of any information needed to handle stateful encodings. |
| 481 | @end deftp |
| 482 | |
| 483 | @deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode}) |
| 484 | Return a pointer to a new @code{struct scm_mb_iconv} context object, |
| 485 | ready to convert from the encoding named @var{fromcode} to the encoding |
| 486 | named @var{tocode}. For stateful encodings, the context object is in |
| 487 | some appropriate initial state, ready for use with the |
| 488 | @code{scm_mb_iconv} function. |
| 489 | |
| 490 | When you are done using a context object, you may call |
| 491 | @code{scm_mb_iconv_close} to free it. |
| 492 | |
| 493 | If either @var{tocode} or @var{fromcode} is not the name of a known |
| 494 | encoding, this function will signal the @code{text:unknown-conversion} |
| 495 | error, described below. |
| 496 | |
| 497 | @c Try to use names here from the IANA list: |
| 498 | @c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets |
| 499 | Guile supports at least these encodings: |
| 500 | @table @samp |
| 501 | |
| 502 | @item US-ASCII |
| 503 | @sc{US-ASCII}, in the standard one-character-per-byte encoding. |
| 504 | |
| 505 | @item ISO-8859-1 |
| 506 | The usual character set for Western European languages, in its usual |
| 507 | one-character-per-byte encoding. |
| 508 | |
| 509 | @item Guile-MB |
| 510 | Guile's current internal multibyte encoding. The actual encoding this |
| 511 | name refers to will change from one version of Guile to the next. You |
| 512 | should use this when converting data between external sources and the |
| 513 | encoding used by Guile objects. |
| 514 | |
| 515 | You should @emph{not} use this as the encoding for data presented to the |
| 516 | outside world, for two reasons. 1) Its meaning will change over time, |
| 517 | so data written using the @samp{guile} encoding with one version of |
| 518 | Guile might not be readable with the @samp{guile} encoding in another |
| 519 | version of Guile. 2) It currently corresponds to @samp{Emacs-Mule}, |
| 520 | which invented for Emacs's internal use, and was never intended to serve |
| 521 | as an exchange medium. |
| 522 | |
| 523 | @item Guile-Wide |
| 524 | Guile's character set, as an array of @code{scm_char_t} values. |
| 525 | |
| 526 | Note that this encoding is even less suitable for public use than |
| 527 | @samp{Guile}, since the exact sequence of bytes depends heavily on the |
| 528 | size and endianness the host system uses for @code{scm_char_t}. Using |
| 529 | this encoding is very much like calling the |
| 530 | @code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte} |
| 531 | functions, except that @code{scm_mb_iconv} gives you more control over |
| 532 | buffer allocation and management. |
| 533 | |
| 534 | @item Emacs-Mule |
| 535 | This is the variable-length encoding for multi-lingual text by GNU |
| 536 | Emacs, at least through version 20.4. You probably should not use this |
| 537 | encoding, as it is designed only for Emacs's internal use. However, we |
| 538 | provide it here because it's trivial to support, and some people |
| 539 | probably do have @samp{emacs-mule}-format files lying around. |
| 540 | |
| 541 | @end table |
| 542 | |
| 543 | (At the moment, this list doesn't include any character sets suitable for |
| 544 | external use that can actually handle multilingual data; this is |
| 545 | unfortunate, as it encourages users to write data in Emacs-Mule format, |
| 546 | which nobody but Emacs and Guile understands. We hope to add support |
| 547 | for Unicode in UTF-8 soon, which should solve this problem.) |
| 548 | |
| 549 | Case is not significant in encoding names. |
| 550 | |
| 551 | You can define your own conversions; see @ref{Implementing Your Own Text |
| 552 | Conversions}. |
| 553 | @end deftypefn |
| 554 | |
| 555 | @deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding}) |
| 556 | Return a non-zero value if Guile supports the encoding named @var{encoding}[[]] |
| 557 | @end deftypefn |
| 558 | |
| 559 | @deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) |
| 560 | Convert a sequence of characters from one encoding to another. The |
| 561 | argument @var{context} specifies the encodings to use for the input and |
| 562 | output, and carries state for stateful encodings; use |
| 563 | @code{scm_mb_iconv_open} to create a @var{context} object for a |
| 564 | particular conversion. |
| 565 | |
| 566 | Upon entry to the function, @code{*@var{inbuf}} should point to the |
| 567 | input buffer, and @code{*@var{inbytesleft}} should hold the number of |
| 568 | input bytes present in the buffer; @code{*@var{outbuf}} should point to |
| 569 | the output buffer, and @code{*@var{outbytesleft}} should hold the number |
| 570 | of bytes available to hold the conversion results in that buffer. |
| 571 | |
| 572 | Upon exit from the function, @code{*@var{inbuf}} points to the first |
| 573 | unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number |
| 574 | of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after |
| 575 | the last output byte, and @code{*@var{outbyteleft}} holds the number of |
| 576 | bytes left unused in the output buffer. |
| 577 | |
| 578 | For stateful encodings, @var{context} carries encoding state from one |
| 579 | call to @code{scm_mb_iconv} to the next. Thus, successive calls to |
| 580 | @var{scm_mb_iconv} which use the same context object can convert a |
| 581 | stream of data one chunk at a time. |
| 582 | |
| 583 | If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is |
| 584 | taken as a request to reset the states of the input and the output |
| 585 | encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is |
| 586 | non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output |
| 587 | buffer to put the output encoding in its initial state. If the output |
| 588 | buffer is not large enough to hold this byte sequence, |
| 589 | @code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves |
| 590 | the shift states of @var{context}'s input and output encodings |
| 591 | unchanged. |
| 592 | |
| 593 | The @code{scm_mb_iconv} function always consumes only complete |
| 594 | characters or shift sequences from the input buffer, and the output |
| 595 | buffer always contains a sequence of complete characters or escape |
| 596 | sequences. |
| 597 | |
| 598 | If the input sequence contains characters which are not expressible in |
| 599 | the output encoding, @code{scm_mb_iconv} converts it in an |
| 600 | implementation-defined way. It may simply delete the character. |
| 601 | |
| 602 | Some encodings use byte sequences which do not correspond to any textual |
| 603 | character. For example, the escape sequence of a stateful encoding has |
| 604 | no textual meaning. When converting from such an encoding, a call to |
| 605 | @code{scm_mb_iconv} might consume input but produce no output, since the |
| 606 | input sequence might contain only escape sequences. |
| 607 | |
| 608 | Normally, @code{scm_mb_iconv} returns the number of input characters it |
| 609 | could not convert perfectly to the ouput encoding. However, it may |
| 610 | return one of the @code{scm_mb_iconv_} codes described below, to |
| 611 | indicate an error. All of these codes are negative values. |
| 612 | |
| 613 | If the input sequence contains an invalid character encoding, conversion |
| 614 | stops before the invalid input character, and @code{scm_mb_iconv} |
| 615 | returns the constant value @code{scm_mb_iconv_bad_encoding}. |
| 616 | |
| 617 | If the input sequence ends with an incomplete character encoding, |
| 618 | @code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and |
| 619 | return the constant value @code{scm_mb_iconv_incomplete_encoding}. This |
| 620 | is not necessarily an error, if you expect to call @code{scm_mb_iconv} |
| 621 | again with more data which might contain the rest of the encoding |
| 622 | fragment. |
| 623 | |
| 624 | If the output buffer does not contain enough room to hold the converted |
| 625 | form of the complete input text, @code{scm_mb_iconv} converts as much as |
| 626 | it can, changes the input and output pointers to reflect the amount of |
| 627 | text successfully converted, and then returns |
| 628 | @code{scm_mb_iconv_too_big}. |
| 629 | @end deftypefn |
| 630 | |
| 631 | Here are the status codes that might be returned by @code{scm_mb_iconv}. |
| 632 | They are all negative integers. |
| 633 | @table @code |
| 634 | |
| 635 | @item scm_mb_iconv_too_big |
| 636 | The conversion needs more room in the output buffer. Some characters |
| 637 | may have been consumed from the input buffer, and some characters may |
| 638 | have been placed in the available space in the output buffer. |
| 639 | |
| 640 | @item scm_mb_iconv_bad_encoding |
| 641 | @code{scm_mb_iconv} encountered an invalid character encoding in the |
| 642 | input buffer. Conversion stopped before the invalid character, so there |
| 643 | may be some characters consumed from the input buffer, and some |
| 644 | converted text in the output buffer. |
| 645 | |
| 646 | @item scm_mb_iconv_incomplete_encoding |
| 647 | The input buffer ends with an incomplete character encoding. The |
| 648 | incomplete encoding is left in the input buffer, unconsumed. This is |
| 649 | not necessarily an error, if you expect to call @code{scm_mb_iconv} |
| 650 | again with more data which might contain the rest of the incomplete |
| 651 | encoding. |
| 652 | |
| 653 | @end table |
| 654 | |
| 655 | |
| 656 | Finally, Guile provides a function for destroying conversion contexts. |
| 657 | |
| 658 | @deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context}) |
| 659 | Deallocate the conversion context object @var{context}, and all other |
| 660 | resources allocated by the call to @code{scm_mb_iconv_open} which |
| 661 | returned @var{context}. |
| 662 | @end deftypefn |
| 663 | |
| 664 | |
| 665 | @node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text |
| 666 | @subsection Implementing Your Own Text Conversions |
| 667 | |
| 668 | [[note that conversions to and from Guile must produce streams |
| 669 | containing only valid character encodings, or else Guile will crash]] |
| 670 | |
| 671 | This section describes the interface for adding your own encoding |
| 672 | conversions for use with @code{scm_mb_iconv}. The interface here is |
| 673 | borrowed from the GNOME Project's @file{libunicode} library. |
| 674 | |
| 675 | Guile's @code{scm_mb_iconv} function works by converting the input text |
| 676 | to a stream of @code{scm_char_t} characters, and then converting |
| 677 | those characters to the desired output encoding. This makes it easy |
| 678 | for Guile to choose the appropriate conversion back ends for an |
| 679 | arbitrary pair of input and output encodings, but it also means that the |
| 680 | accuracy and quality of the conversions depends on the fidelity of |
| 681 | Guile's internal character set to the source and destination encodings. |
| 682 | Since @code{scm_mb_iconv} will be used almost exclusively for converting |
| 683 | to and from Guile's internal character set, this shouldn't be a problem. |
| 684 | |
| 685 | To add support for a particular encoding to Guile, you must provide one |
| 686 | function (called the @dfn{read} function) which converts from your |
| 687 | encoding to an array of @code{scm_char_t}'s, and another function |
| 688 | (called the @dfn{write} function) to convert from an array of |
| 689 | @code{scm_char_t}'s back into your encoding. To convert from some |
| 690 | encoding @var{a} to some other encoding @var{b}, Guile pairs up |
| 691 | @var{a}'s read function with @var{b}'s write function. Each call to |
| 692 | @code{scm_mb_iconv} passes text in encoding @var{a} through the read |
| 693 | function, to produce an array of @code{scm_char_t}'s, and then passes |
| 694 | that array to the write function, to produce text in encoding @var{b}. |
| 695 | |
| 696 | For stateful encodings, a read or write function can hang its own data |
| 697 | structures off the conversion object, and provide its own functions to |
| 698 | allocate and destroy them; this allows read and write functions to |
| 699 | maintain whatever state they like. |
| 700 | |
| 701 | The Guile conversion back end represents each available encoding with a |
| 702 | @code{struct scm_mb_encoding} object. |
| 703 | |
| 704 | @deftp {Libguile Type} {struct scm_mb_encoding} |
| 705 | This data structure describes an encoding. It has the following |
| 706 | members: |
| 707 | |
| 708 | @table @code |
| 709 | |
| 710 | @item char **names |
| 711 | An array of strings, giving the various names for this encoding. The |
| 712 | array should be terminated by a zero pointer. Case is not significant |
| 713 | in encoding names. |
| 714 | |
| 715 | The @code{scm_mb_iconv_open} function searches the list of registered |
| 716 | encodings for an encoding whose @code{names} array matches its |
| 717 | @var{tocode} or @var{fromcode} argument. |
| 718 | |
| 719 | @item int (*init) (void **@var{cookie}) |
| 720 | An initialization function for the encoding's private data. |
| 721 | @code{scm_mb_iconv_open} will call this function, passing it the address |
| 722 | of the cookie for this encoding in this context. (We explain cookies |
| 723 | below.) There is no way for the @code{init} function to tell whether |
| 724 | the encoding will be used for reading or writing. |
| 725 | |
| 726 | Note that @code{init} receives a @emph{pointer} to the cookie, not the |
| 727 | cookie itself. Because the type of @var{cookie} is @code{void **}, the |
| 728 | C compiler will not check it as carefully as it would other types. |
| 729 | |
| 730 | The @code{init} member may be zero, indicating that no initialization is |
| 731 | necessary for this encoding. |
| 732 | |
| 733 | @item int (*destroy) (void **@var{cookie}) |
| 734 | A deallocation function for the encoding's private data. |
| 735 | @code{scm_mb_iconv_close} calls this function, passing it the address of |
| 736 | the cookie for this encoding in this context. The @code{destroy} |
| 737 | function should free any data the @code{init} function allocated. |
| 738 | |
| 739 | Note that @code{destroy} receives a @emph{pointer} to the cookie, not the |
| 740 | cookie itself. Because the type of @var{cookie} is @code{void **}, the |
| 741 | C compiler will not check it as carefully as it would other types. |
| 742 | |
| 743 | The @code{destroy} member may be zero, indicating that this encoding |
| 744 | doesn't need to perform any special action to destroy its local data. |
| 745 | |
| 746 | @item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft}) |
| 747 | Put the encoding into its initial shift state. Guile calls this |
| 748 | function whether the encoding is being used for input or output, so this |
| 749 | should take appropriate steps for both directions. If @var{outbuf} and |
| 750 | @var{outbytesleft} are valid, the reset function should emit an escape |
| 751 | sequence to reset the output stream to its initial state; @var{outbuf} |
| 752 | and @var{outbytesleft} should be handled just as for |
| 753 | @code{scm_mb_iconv}. |
| 754 | |
| 755 | This function can return an @code{scm_mb_iconv_} error code |
| 756 | (@pxref{Exchanging Guile Text With the Outside World in C}). If it |
| 757 | returns @code{scm_mb_iconv_too_big}, then the output buffer's shift |
| 758 | state must be left unchanged. |
| 759 | |
| 760 | Note that @code{reset} receives the cookie's value itself, not a pointer |
| 761 | to the cookie, as the @code{init} and @code{destroy} functions do. |
| 762 | |
| 763 | The @code{reset} member may be zero, indicating that this encoding |
| 764 | doesn't use a shift state. |
| 765 | |
| 766 | @item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf}, size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft}) |
| 767 | Read some bytes and convert into an array of Guile characters. This is |
| 768 | the encoding's read function. |
| 769 | |
| 770 | On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to |
| 771 | be converted, and *@var{outcharsleft} characters available at |
| 772 | *@var{outbuf} to hold the results. |
| 773 | |
| 774 | On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes |
| 775 | still not consumed. *@var{outcharsleft} and *@var{outbuf} indicate the |
| 776 | output buffer space still not filled. (By exclusion, these indicate |
| 777 | which input bytes were consumed, and which output characters were |
| 778 | produced.) |
| 779 | |
| 780 | Return one of the @code{enum scm_mb_read_result} values, described below. |
| 781 | |
| 782 | Note that @code{read} receives the cookie's value itself, not a pointer |
| 783 | to the cookie, as the @code{init} and @code{destroy} functions do. |
| 784 | |
| 785 | @item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft}) |
| 786 | Convert an array of Guile characters to output bytes. This is |
| 787 | the encoding's write function. |
| 788 | |
| 789 | On entry, there are *@var{incharsleft} Guile characters available at |
| 790 | *@var{inbuf}, and *@var{outbytesleft} bytes available to store output at |
| 791 | *@var{outbuf}. |
| 792 | |
| 793 | On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of |
| 794 | Guile characters left unconverted (because there was insufficient room |
| 795 | in the output buffer to hold their converted forms), and |
| 796 | *@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the |
| 797 | output buffer. |
| 798 | |
| 799 | Return one of the @code{scm_mb_write_result} values, described below. |
| 800 | |
| 801 | Note that @code{write} receives the cookie's value itself, not a pointer |
| 802 | to the cookie, as the @code{init} and @code{destroy} functions do. |
| 803 | |
| 804 | @item struct scm_mb_encoding *next |
| 805 | This is used by Guile to maintain a linked list of encodings. It is |
| 806 | filled in when you call @code{scm_mb_register_encoding} to add your |
| 807 | encoding to the list. |
| 808 | |
| 809 | @end table |
| 810 | @end deftp |
| 811 | |
| 812 | Here is the enumerated type for the values an encoding's read function |
| 813 | can return: |
| 814 | |
| 815 | @deftp {Libguile Type} {enum scm_mb_read_result} |
| 816 | This type represents the result of a call to an encoding's read |
| 817 | function. It has the following values: |
| 818 | |
| 819 | @table @code |
| 820 | |
| 821 | @item scm_mb_read_ok |
| 822 | The read function consumed at least one byte of input. |
| 823 | |
| 824 | @item scm_mb_read_incomplete |
| 825 | The data present in the input buffer does not contain a complete |
| 826 | character encoding. No input was consumed, and no characters were |
| 827 | produced as output. This is not necessarily an error status, if there |
| 828 | is more data to pass through. |
| 829 | |
| 830 | @item scm_mb_read_error |
| 831 | The input contains an invalid character encoding. |
| 832 | |
| 833 | @end table |
| 834 | @end deftp |
| 835 | |
| 836 | Here is the enumerated type for the values an encoding's write function |
| 837 | can return: |
| 838 | |
| 839 | @deftp {Libguile Type} {enum scm_mb_write_result} |
| 840 | This type represents the result of a call to an encoding's write |
| 841 | function. It has the following values: |
| 842 | |
| 843 | @table @code |
| 844 | |
| 845 | @item scm_mb_write_ok |
| 846 | The write function was able to convert all the characters in @var{inbuf} |
| 847 | successfully. |
| 848 | |
| 849 | @item scm_mb_write_too_big |
| 850 | The write function filled the output buffer, but there are still |
| 851 | characters in @var{inbuf} left unconsumed; @var{inbuf} and |
| 852 | @var{incharsleft} indicate the unconsumed portion of the input buffer. |
| 853 | |
| 854 | @end table |
| 855 | @end deftp |
| 856 | |
| 857 | |
| 858 | Conversions to or from stateful encodings need to keep track of each |
| 859 | encoding's current state. Each conversion context contains two |
| 860 | @code{void *} variables called @dfn{cookies}, one for the input |
| 861 | encoding, and one for the output encoding. These cookies are passed to |
| 862 | the encodings' functions, for them to use however they please. A |
| 863 | stateful encoding can use its cookie to hold a pointer to some object |
| 864 | which maintains the context's current shift state. Stateless encodings |
| 865 | will probably not use their cookies. |
| 866 | |
| 867 | The cookies' lifetime is the same as that of the context object. When |
| 868 | the user calls @code{scm_mb_iconv_close} to destroy a context object, |
| 869 | @code{scm_mb_iconv_close} calls the input and output encodings' |
| 870 | @code{destroy} functions, passing them their respective cookies, so each |
| 871 | encoding can free any data it allocated for that context. |
| 872 | |
| 873 | Note that, if a read or write function returns a successful result code |
| 874 | like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining |
| 875 | input, together with the output, must together represent the complete |
| 876 | input text; the encoding may not store any text temporarily in its |
| 877 | cookie. This is because, if @code{scm_mb_iconv} returns a successful |
| 878 | result to the user, it is correct for the user to assume that all the |
| 879 | consumed input has been converted and placed in the output buffer. |
| 880 | There is no ``flush'' operation to push any final results out of the |
| 881 | encodings' buffers. |
| 882 | |
| 883 | Here is the function you call to register a new encoding with the |
| 884 | conversion system: |
| 885 | |
| 886 | @deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding}) |
| 887 | Add the encoding described by @code{*@var{encoding}} to the set |
| 888 | understood by @code{scm_mb_iconv_open}. Once you have registered your |
| 889 | encoding, you can use it by calling @code{scm_mb_iconv_open} with one of |
| 890 | the names in @code{@var{encoding}->names}. |
| 891 | @end deftypefn |
| 892 | |
| 893 | |
| 894 | @node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C |
| 895 | @section Multibyte Text Processing Errors |
| 896 | |
| 897 | This section describes error conditions which code can signal to |
| 898 | indicate problems encountered while processing multibyte text. In each |
| 899 | case, the arguments @var{message} and @var{args} are an error format |
| 900 | string and arguments to be substituted into the string, as accepted by |
| 901 | the @code{display-error} function. |
| 902 | |
| 903 | @deffn Condition text:not-char-boundary func message args object offset |
| 904 | By calling @var{func}, the program attempted to access a character at |
| 905 | byte offset @var{offset} in the Guile object @var{object}, but |
| 906 | @var{offset} is not the start of a character's encoding in @var{object}. |
| 907 | |
| 908 | Typically, @var{object} is a string or symbol. If the function signalling |
| 909 | the error cannot find the Guile object that contains the text it is |
| 910 | inspecting, it should use @code{#f} for @var{object}. |
| 911 | @end deffn |
| 912 | |
| 913 | @deffn Condition text:bad-encoding func message args object |
| 914 | By calling @var{func}, the program attempted to interpret the text in |
| 915 | @var{object}, but @var{object} contains a byte sequence which is not a |
| 916 | valid encoding for any character. |
| 917 | @end deffn |
| 918 | |
| 919 | @deffn Condition text:not-guile-char func message args number |
| 920 | By calling @var{func}, the program attempted to treat @var{number} as the |
| 921 | number of a character in the Guile character set, but @var{number} does |
| 922 | not correspond to any character in the Guile character set. |
| 923 | @end deffn |
| 924 | |
| 925 | @deffn Condition text:unknown-conversion func message args from to |
| 926 | By calling @var{func}, the program attempted to convert from an encoding |
| 927 | named @var{from} to an encoding named @var{to}, but Guile does not |
| 928 | support such a conversion. |
| 929 | @end deffn |
| 930 | |
| 931 | @deftypevr {Libguile Variable} SCM scm_text_not_char_boundary |
| 932 | @deftypevrx {Libguile Variable} SCM scm_text_bad_encoding |
| 933 | @deftypevrx {Libguile Variable} SCM scm_text_not_guile_char |
| 934 | These variables hold the scheme symbol objects whose names are the |
| 935 | condition symbols above. You can use these when signalling these |
| 936 | errors, instead of looking them up yourself. |
| 937 | @end deftypevr |
| 938 | |
| 939 | |
| 940 | @node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C |
| 941 | @section Why Guile Does Not Use a Fixed-Width Encoding |
| 942 | |
| 943 | Multibyte encodings are clumsier to work with than encodings which use a |
| 944 | fixed number of bytes for every character. For example, using a |
| 945 | fixed-width encoding, we can extract the @var{i}th character of a string |
| 946 | in constant time, and we can always substitute the @var{i}th character |
| 947 | of a string with any other character without reallocating or copying the |
| 948 | string. |
| 949 | |
| 950 | However, there are no fixed-width encodings which include the characters |
| 951 | we wish to include, and also fit in a reasonable amount of space. |
| 952 | Despite the Unicode standard's claims to the contrary, Unicode is not |
| 953 | really a fixed-width encoding. Unicode uses surrogate pairs to |
| 954 | represent characters outside the 16-bit range; a surrogate pair must be |
| 955 | treated as a single character, but occupies two 16-bit spaces. As of |
| 956 | this writing, there are already plans to assign characters to the |
| 957 | surrogate character codes. Three- and four-byte encodings are |
| 958 | too wasteful for a majority of Guile's users, who only need @sc{ASCII} |
| 959 | and a few accented characters. |
| 960 | |
| 961 | Another alternative would be to have several different fixed-width |
| 962 | string representations, each with a different element size. For each |
| 963 | string, Guile would use the smallest element size capable of |
| 964 | accomodating the string's text. This would allow users of English and |
| 965 | the Western European languages to use the traditional memory-efficient |
| 966 | encodings. However, if Guile has @var{n} string representations, then |
| 967 | users must write @var{n} versions of any code which manipulates text |
| 968 | directly --- one for each element size. And if a user wants to operate |
| 969 | on two strings simultaneously, and wants to avoid testing the string |
| 970 | sizes within the loop, she must make @var{n}*@var{n} copies of the loop. |
| 971 | Most users will simply not bother. Instead, they will write code which |
| 972 | supports only one string size, leaving us back where we started. By |
| 973 | using a single internal representation, Guile makes it easier for users |
| 974 | to write multilingual code. |
| 975 | |
| 976 | [[What about tagging each string with its encoding? |
| 977 | "Every extension must be written to deal with every encoding"]] |
| 978 | |
| 979 | [[You don't really want to index strings anyway.]] |
| 980 | |
| 981 | Finally, Guile's multibyte encoding is not so bad. Unlike a two- or |
| 982 | four-byte encoding, it is efficient in space for American and European |
| 983 | users. Furthermore, the properties described above mean that many |
| 984 | functions can be coded just as they would for a single-byte encoding; |
| 985 | see @ref{Promised Properties of the Guile Multibyte Encoding}. |
| 986 | |
| 987 | @bye |