HCoop Git - bpt/guile.git/blame_incremental

... / ...

Commit	Line	Data
	1	\input texinfo
	2	@setfilename mbapi.info
	3	@settitle Multibyte API
	4	@setchapternewpage off
	5
	6	@c Open issues:
	7
	8	@c What's the best way to report errors? Should functions return a
	9	@c magic value, according to C tradition, or should they signal a
	10	@c Guile exception?
	11
	12	@c
	13
	14
	15	@node Working With Multibyte Strings in C
	16	@chapter Working With Multibyte Strings in C
	17
	18	Guile allows strings to contain characters drawn from a wide variety of
	19	languages, including many Asian, Eastern European, and Middle Eastern
	20	languages, in a uniform and unrestricted way. The string representation
	21	normally used in C code --- an array of @sc{ASCII} characters --- is not
	22	sufficient for Guile strings, since they may contain characters not
	23	present in @sc{ASCII}.
	24
	25	Instead, Guile uses a very large character set, and encodes each
	26	character as a sequence of one or more bytes. We call this
	27	variable-width encoding a @dfn{multibyte} encoding. Guile uses this
	28	single encoding internally for all strings, symbol names, error
	29	messages, etc., and performs appropriate conversions upon input and
	30	output.
	31
	32	The use of this variable-width encoding is almost invisible to Scheme
	33	code. Strings are still indexed by character number, not by byte
	34	offset; @code{string-length} still returns the length of a string in
	35	characters, not in bytes. @code{string-ref} and @code{string-set!} are
	36	no longer guaranteed to be constant-time operations, but Guile uses
	37	various strategies to reduce the impact of this change.
	38
	39	However, the encoding is visible via Guile's C interface, which gives
	40	the user direct access to a string's bytes. This chapter explains how
	41	to work with Guile multibyte text in C code. Since variable-width
	42	encodings are clumsier to work with than simple fixed-width encodings,
	43	Guile provides a set of standard macros and functions for manipulating
	44	multibyte text to make the job easier. Furthermore, Guile makes some
	45	promises about the encoding which you can use in writing your own text
	46	processing code.
	47
	48	While we discuss guaranteed properties of Guile's encoding, and provide
	49	functions to operate on its character set, we do not actually specify
	50	either the character set or encoding here. This is because we expect
	51	both of them to change in the future: currently, Guile uses the same
	52	encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs
	53	as well) to use Unicode and UTF-8, with some extensions. This will make
	54	it more comfortable to use Guile with other systems which use UTF-8,
	55	like the GTk user interface toolkit.
	56
	57	@menu
	58	* Multibyte String Terminology::
	59	* Promised Properties of the Guile Multibyte Encoding::
	60	* Functions for Operating on Multibyte Text::
	61	* Multibyte Text Processing Errors::
	62	* Why Guile Does Not Use a Fixed-Width Encoding::
	63	@end menu
	64
	65
	66	@node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C
	67	@section Multibyte String Terminology
	68
	69	In the descriptions which follow, we make the following definitions:
	70	@table @dfn
	71
	72	@item byte
	73	A @dfn{byte} is a number between 0 and 255. It has no inherent textual
	74	interpretation. So 65 is a byte, not a character.
	75
	76	@item character
	77	A @dfn{character} is a unit of text. It has no inherent numeric value.
	78	@samp{A} and @samp{.} are characters, not bytes. (This is different
	79	from the C language's definition of @dfn{character}; in this chapter, we
	80	will always use a phrase like ``the C language's @code{char} type'' when
	81	that's what we mean.)
	82
	83	@item character set
	84	A @dfn{character set} is an invertible mapping between numbers and a
	85	given set of characters. @sc{ASCII} is a character set assigning
	86	characters to the numbers 0 through 127. It maps @samp{A} onto the
	87	number 65, and @samp{.} onto 46.
	88
	89	Note that a character set maps characters onto numbers, @emph{not
	90	necessarily} onto bytes. For example, the Unicode character set maps
	91	the Greek lower-case @samp{alpha} character onto the number 945, which
	92	is not a byte.
	93
	94	(This is what Internet standards would call a "coding character set".)
	95
	96	@item encoding
	97	An encoding maps numbers onto sequences of bytes. For example, the
	98	UTF-8 encoding, defined in the Unicode Standard, would map the number
	99	945 onto the sequence of bytes @samp{206 177}. When using the
	100	@sc{ASCII} character set, every number assigned also happens to be a
	101	byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes.
	102
	103	(This is what Internet standards would call a "character encoding
	104	scheme".)
	105
	106	@end table
	107
	108	Thus, to turn a character into a sequence of bytes, you need a character
	109	set to assign a number to that character, and then an encoding to turn
	110	that number into a sequence of bytes.
	111
	112	Likewise, to interpret a sequence of bytes as a sequence of characters,
	113	you use an encoding to extract a sequence of numbers from the bytes, and
	114	then a character set to turn the numbers into characters.
	115
	116	Errors can occur while carrying out either of these processes. For
	117	example, under a particular encoding, a given string of bytes might not
	118	correspond to any number. For example, the byte sequence @samp{128 128}
	119	is not a valid encoding of any number under UTF-8.
	120
	121	Having carefully defined our terminology, we will now abuse it.
	122
	123	We will sometimes use the word @dfn{character} to refer to the number
	124	assigned to a character by a character set, in contexts where it's
	125	obvious we mean a number.
	126
	127	Sometimes there is a close association between a particular encoding and
	128	a particular character set. Thus, we may sometimes refer to the
	129	character set and encoding together as an @dfn{encoding}.
	130
	131
	132	@node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C
	133	@section Promised Properties of the Guile Multibyte Encoding
	134
	135	Internally, Guile uses a single encoding for all text --- symbols,
	136	strings, error messages, etc. Here we list a number of helpful
	137	properties of Guile's encoding. It is correct to write code which
	138	assumes these properties; code which uses these assumptions will be
	139	portable to all future versions of Guile, as far as we know.
	140
	141	@b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in
	142	the obvious way.} This means that a standard C string containing only
	143	@sc{ASCII} characters is a valid Guile string (except for the terminator;
	144	Guile strings store the length explicitly, so they can contain null
	145	characters).
	146
	147	@b{The encodings of non-@sc{ASCII} characters use only bytes between 128
	148	and 255.} That is, when we turn a non-@sc{ASCII} character into a
	149	series of bytes, none of those bytes can ever be mistaken for the
	150	encoding of an @sc{ASCII} character. This means that you can search a
	151	Guile string for an @sc{ASCII} character using the standard
	152	@code{memchr} library function. By extension, you can search for an
	153	@sc{ASCII} substring in a Guile string using a traditional substring
	154	search algorithm --- you needn't add special checks to verify encoding
	155	boundaries, etc.
	156
	157	@b{No character encoding is a subsequence of any other character
	158	encoding.} (This is just a stronger version of the previous promise.)
	159	This means that you can search for occurrences of one Guile string
	160	within another Guile string just as if they were raw byte strings. You
	161	can use the stock @code{memmem} function (provided on GNU systems, at
	162	least) for such searches. If you don't need the ability to represent
	163	null characters in your text, you can still use null-termination for
	164	strings, and use the traditional string-handling functions like
	165	@code{strlen}, @code{strstr}, and @code{strcat}.
	166
	167	@b{You can always determine the full length of a character's encoding
	168	from its first byte.} Guile provides the macro @code{scm_mb_len} which
	169	computes the encoding's length from its first byte. Given the first
	170	rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <=
	171	@var{b} <= 127}, returns 1.
	172
	173	@b{Given an arbitrary byte position in a Guile string, you can always
	174	find the beginning and end of the character containing that byte without
	175	scanning too far in either direction.} This means that, if you are sure
	176	a byte sequence is a valid encoding of a character sequence, you can
	177	find character boundaries without keeping track of the beginning and
	178	ending of the overall string. This promise relies on the fact that, in
	179	addition to storing the string's length explicitly, Guile always either
	180	terminates the string's storage with a zero byte, or shares it with
	181	another string which is terminated this way.
	182
	183
	184	@node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C
	185	@section Functions for Operating on Multibyte Text
	186
	187	Guile provides a variety of functions, variables, and types for working
	188	with multibyte text.
	189
	190	@menu
	191	* Basic Multibyte Character Processing::
	192	* Finding Character Encoding Boundaries::
	193	* Multibyte String Functions::
	194	* Exchanging Guile Text With the Outside World in C::
	195	* Implementing Your Own Text Conversions::
	196	@end menu
	197
	198
	199	@node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text
	200	@subsection Basic Multibyte Character Processing
	201
	202	Here are the essential types and functions for working with Guile text.
	203	Guile uses the C type @code{unsigned char *} to refer to text encoded
	204	with Guile's encoding.
	205
	206	Note that any operation marked here as a ``Libguile Macro'' might
	207	evaluate its argument multiple times.
	208
	209	@deftp {Libguile Type} scm_char_t
	210	This is a signed integral type large enough to hold any character in
	211	Guile's character set. All character numbers are positive.
	212	@end deftp
	213
	214	@deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p})
	215	Return the character whose encoding starts at @var{p}. If @var{p} does
	216	not point at a valid character encoding, the behavior is undefined.
	217	@end deftypefn
	218
	219	@deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c})
	220	Place the encoded form of the Guile character @var{c} at @var{p}, and
	221	return its length in bytes. If @var{c} is not a Guile character, the
	222	behavior is undefined.
	223	@end deftypefn
	224
	225	@deftypevr {Libguile Constant} int scm_mb_max_len
	226	The maximum length of any character's encoding, in bytes. You may
	227	assume this is relatively small --- less than a dozen or so.
	228	@end deftypevr
	229
	230	@deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b})
	231	If @var{b} is the first byte of a character's encoding, return the full
	232	length of the character's encoding, in bytes. If @var{b} is not a valid
	233	leading byte, the behavior is undefined.
	234	@end deftypefn
	235
	236	@deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c})
	237	Return the length of the encoding of the character @var{c}, in bytes.
	238	If @var{c} is not a valid Guile character, the behavior is undefined.
	239	@end deftypefn
	240
	241	@deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p})
	242	@deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c})
	243	@deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b})
	244	@deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c})
	245	These are functions identical to the corresponding macros. You can use
	246	them in situations where the overhead of a function call is acceptable,
	247	and the cleaner semantics of function application are desireable.
	248	@end deftypefn
	249
	250
	251	@node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text
	252	@subsection Finding Character Encoding Boundaries
	253
	254	These are functions for finding the boundaries between characters in
	255	multibyte text.
	256
	257	Note that any operation marked here as a ``Libguile Macro'' might
	258	evaluate its argument multiple times, unless the definition promises
	259	otherwise.
	260
	261	@deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p})
	262	Return non-zero iff @var{p} points to the start of a character in
	263	multibyte text.
	264
	265	This macro will evaluate its argument only once.
	266	@end deftypefn
	267
	268	@deftypefn {Libguile Function} {const unsigned char } scm_mb_floor (const unsigned char @var{p})
	269	``Round'' @var{p} to the previous character boundary. That is, if
	270	@var{p} points to the middle of the encoding of a Guile character,
	271	return a pointer to the first byte of the encoding. If @var{p} points
	272	to the start of the encoding of a Guile character, return @var{p}
	273	unchanged.
	274	@end deftypefn
	275
	276	@deftypefn {libguile Function} {const unsigned char } scm_mb_ceiling (const unsigned char @var{p})
	277	``Round'' @var{p} to the next character boundary. That is, if @var{p}
	278	points to the middle of the encoding of a Guile character, return a
	279	pointer to the first byte of the encoding of the next character. If
	280	@var{p} points to the start of the encoding of a Guile character, return
	281	@var{p} unchanged.
	282	@end deftypefn
	283
	284	Note that it is usually not friendly for functions to silently correct
	285	byte offsets that point into the middle of a character's encoding. Such
	286	offsets almost always indicate a programming error, and they should be
	287	reported as early as possible. So, when you write code which operates
	288	on multibyte text, you should not use functions like these to ``clean
	289	up'' byte offsets which the originator believes to be correct; instead,
	290	your code should signal a @code{text:not-char-boundary} error as soon as
	291	it detects an invalid offset. @xref{Multibyte Text Processing Errors}.
	292
	293
	294	@node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text
	295	@subsection Multibyte String Functions
	296
	297	These functions allow you to operate on multibyte strings: sequences of
	298	character encodings.
	299
	300	@deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len})
	301	Return the number of Guile characters encoded by the @var{len} bytes at
	302	@var{p}.
	303
	304	If the sequence contains any invalid character encodings, or ends with
	305	an incomplete character encoding, signal a @code{text:bad-encoding}
	306	error.
	307	@end deftypefn
	308
	309	@deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp})
	310	Return the character whose encoding starts at @code{*@var{pp}}, and
	311	advance @code{*@var{pp}} to the start of the next character. Return -1
	312	if @code{*@var{pp}} does not point to a valid character encoding.
	313	@end deftypefn
	314
	315	@deftypefn {Libguile Function} {const unsigned char } scm_mb_prev (const unsigned char @var{p})
	316	If @var{p} points to the middle of the encoding of a Guile character,
	317	return a pointer to the first byte of the encoding. If @var{p} points
	318	to the start of the encoding of a Guile character, return the start of
	319	the previous character's encoding.
	320
	321	This is like @code{scm_mb_floor}, but the returned pointer will always
	322	be before @var{p}. If you use this function to drive an iteration, it
	323	guarantees backward progress.
	324	@end deftypefn
	325
	326	@deftypefn {Libguile Function} {const unsigned char } scm_mb_next (const unsigned char @var{p})
	327	If @var{p} points to the encoding of a Guile character, return a pointer
	328	to the first byte of the encoding of the next character.
	329
	330	This is like @code{scm_mb_ceiling}, but the returned pointer will always
	331	be after @var{p}. If you use this function to drive an iteration, it
	332	guarantees forward progress.
	333	@end deftypefn
	334
	335	@deftypefn {Libguile Function} {const unsigned char } scm_mb_index (const unsigned char @var{p}, int @var{len}, int @var{i})
	336	Assuming that the @var{len} bytes starting at @var{p} are a
	337	concatenation of valid character encodings, return a pointer to the
	338	start of the @var{i}'th character encoding in the sequence.
	339
	340	This function scans the sequence from the beginning to find the
	341	@var{i}'th character, and will generally require time proportional to
	342	the distance from @var{p} to the returned address.
	343
	344	If the sequence contains any invalid character encodings, or ends with
	345	an incomplete character encoding, signal a @code{text:bad-encoding}
	346	error.
	347	@end deftypefn
	348
	349	It is common to process the characters in a string from left to right.
	350	However, if you fetch each character using @code{scm_mb_index}, each
	351	call will scan the text from the beginning, so your loop will require
	352	time proportional to at least the square of the length of the text. To
	353	avoid this poor performance, you can use an @code{scm_mb_cache}
	354	structure and the @code{scm_mb_index_cached} macro.
	355
	356	@deftp {Libguile Type} {struct scm_mb_cache}
	357	This structure holds information that allows a string scanning operation
	358	to use the results from a previous scan of the string. It has the
	359	following members:
	360	@table @code
	361
	362	@item character
	363	An index, in characters, into the string.
	364
	365	@item byte
	366	The index, in bytes, of the start of that character.
	367
	368	@end table
	369
	370	In other words, @code{byte} is the byte offset of the
	371	@code{character}'th character of the string. Note that if @code{byte}
	372	and @code{character} are equal, then all characters before that point
	373	must have encodings exactly one byte long, and the string can be indexed
	374	normally.
	375
	376	All elements of a @code{struct scm_mb_cache} structure should be
	377	initialized to zero before its first use, and whenever the string's text
	378	changes.
	379	@end deftp
	380
	381	@deftypefn {Libguile Macro} const unsigned char scm_mb_index_cached (const unsigned char @var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
	382	@deftypefnx {Libguile Function} const unsigned char scm_mb_index_cached_func (const unsigned char @var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
	383	This macro and this function are identical to @code{scm_mb_index},
	384	except that they may consult and update *@var{cache} in order to avoid
	385	scanning the string from the beginning. @code{scm_mb_index_cached} is a
	386	macro, so it may have less overhead than
	387	@code{scm_mb_index_cached_func}, but it may evaluate its arguments more
	388	than once.
	389
	390	Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you
	391	can scan a string from left to right, or from right to left, in time
	392	proportional to the length of the string. As long as each character
	393	fetched is less than some constant distance before or after the previous
	394	character fetched with @var{cache}, each access will require constant
	395	time.
	396	@end deftypefn
	397
	398	Guile also provides functions to convert between an encoded sequence of
	399	characters, and an array of @code{scm_char_t} objects.
	400
	401	@deftypefn {Libguile Function} scm_char_t scm_mb_multibyte_to_fixed (const unsigned char @var{p}, int @var{len}, int *@var{result_len})
	402	Convert the variable-width text in the @var{len} bytes at @var{p}
	403	to an array of @code{scm_char_t} values. Return a pointer to the array,
	404	and set @code{*@var{result_len}} to the number of elements it contains.
	405	The returned array is allocated with @code{malloc}, and it is the
	406	caller's responsibility to free it.
	407
	408	If the text is not a sequence of valid character encodings, this
	409	function will signal a @code{text:bad-encoding} error.
	410	@end deftypefn
	411
	412	@deftypefn {Libguile Function} unsigned char scm_mb_fixed_to_multibyte (const scm_char_t @var{fixed}, int @var{len}, int *@var{result_len})
	413	Convert the array of @code{scm_char_t} values to a sequence of
	414	variable-width character encodings. Return a pointer to the array of
	415	bytes, and set @code{*@var{result_len}} to its length, in bytes.
	416
	417	The returned byte sequence is terminated with a zero byte, which is not
	418	counted in the length returned in @code{*@var{result_len}}.
	419
	420	The returned byte sequence is allocated with @code{malloc}; it is the
	421	caller's responsibility to free it.
	422
	423	If the text is not a sequence of valid character encodings, this
	424	function will signal a @code{text:bad-encoding} error.
	425	@end deftypefn
	426
	427
	428	@node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text
	429	@subsection Exchanging Guile Text With the Outside World in C
	430
	431	[[This is kind of a heavy-weight model, given that one end of the
	432	conversion is always going to be the Guile encoding. Any way to shorten
	433	things a bit?]]
	434
	435	Guile provides functions for converting between Guile's internal text
	436	representation and encodings popular in the outside world. These
	437	functions are closely modeled after the @code{iconv} functions available
	438	on some systems.
	439
	440	To convert text between two encodings, you should first call
	441	@code{scm_mb_iconv_open} to indicate the source and destination
	442	encodings; this function returns a context object which records the
	443	conversion to perform.
	444
	445	Then, you should call @code{scm_mb_iconv} to actually convert the text.
	446	This function expects input and output buffers, and a pointer to the
	447	context you got from @var{scm_mb_iconv_open}. You don't need to pass
	448	all your input to @code{scm_mb_iconv} at once; you can invoke it on
	449	successive blocks of input (as you read it from a file, say), and it
	450	will convert as much as it can each time, indicating when you should
	451	grow your output buffer.
	452
	453	An encoding may be @dfn{stateless}, or @dfn{stateful}. In most
	454	encodings, a contiguous group of bytes from the sequence completely
	455	specifies a particular character; these are stateless encodings.
	456	However, some encodings require you to look back an unbounded number of
	457	bytes in the stream to assign a meaning to a particular byte sequence;
	458	such encodings are stateful.
	459
	460	For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the
	461	byte sequence @samp{27 36 66} indicates that subsequent bytes should be
	462	taken in pairs and interpreted as characters from the JIS-0208 character
	463	set. An arbitrary number of byte pairs may follow this sequence. The
	464	byte sequence @samp{27 40 66} indicates that subsequent bytes should be
	465	interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a
	466	given byte is an @sc{ASCII} character without looking back an arbitrary
	467	distance for the most recent escape sequence, so it is a stateful
	468	encoding.
	469
	470	In Guile, if a conversion involves a stateful encoding, the context
	471	object carries any necessary state. Thus, you can have many independent
	472	conversions to or from stateful encodings taking place simultaneously,
	473	as long as each data stream uses its own context object for the
	474	conversion.
	475
	476	@deftp {Libguile Type} {struct scm_mb_iconv}
	477	This is the type for context objects, which represent the encodings and
	478	current state of an ongoing text conversion. A @code{struct
	479	scm_mb_iconv} records the source and destination encodings, and keeps
	480	track of any information needed to handle stateful encodings.
	481	@end deftp
	482
	483	@deftypefn {Libguile Function} {struct scm_mb_iconv } scm_mb_iconv_open (const char @var{tocode}, const char *@var{fromcode})
	484	Return a pointer to a new @code{struct scm_mb_iconv} context object,
	485	ready to convert from the encoding named @var{fromcode} to the encoding
	486	named @var{tocode}. For stateful encodings, the context object is in
	487	some appropriate initial state, ready for use with the
	488	@code{scm_mb_iconv} function.
	489
	490	When you are done using a context object, you may call
	491	@code{scm_mb_iconv_close} to free it.
	492
	493	If either @var{tocode} or @var{fromcode} is not the name of a known
	494	encoding, this function will signal the @code{text:unknown-conversion}
	495	error, described below.
	496
	497	@c Try to use names here from the IANA list:
	498	@c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
	499	Guile supports at least these encodings:
	500	@table @samp
	501
	502	@item US-ASCII
	503	@sc{US-ASCII}, in the standard one-character-per-byte encoding.
	504
	505	@item ISO-8859-1
	506	The usual character set for Western European languages, in its usual
	507	one-character-per-byte encoding.
	508
	509	@item Guile-MB
	510	Guile's current internal multibyte encoding. The actual encoding this
	511	name refers to will change from one version of Guile to the next. You
	512	should use this when converting data between external sources and the
	513	encoding used by Guile objects.
	514
	515	You should @emph{not} use this as the encoding for data presented to the
	516	outside world, for two reasons. 1) Its meaning will change over time,
	517	so data written using the @samp{guile} encoding with one version of
	518	Guile might not be readable with the @samp{guile} encoding in another
	519	version of Guile. 2) It currently corresponds to @samp{Emacs-Mule},
	520	which invented for Emacs's internal use, and was never intended to serve
	521	as an exchange medium.
	522
	523	@item Guile-Wide
	524	Guile's character set, as an array of @code{scm_char_t} values.
	525
	526	Note that this encoding is even less suitable for public use than
	527	@samp{Guile}, since the exact sequence of bytes depends heavily on the
	528	size and endianness the host system uses for @code{scm_char_t}. Using
	529	this encoding is very much like calling the
	530	@code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte}
	531	functions, except that @code{scm_mb_iconv} gives you more control over
	532	buffer allocation and management.
	533
	534	@item Emacs-Mule
	535	This is the variable-length encoding for multi-lingual text by GNU
	536	Emacs, at least through version 20.4. You probably should not use this
	537	encoding, as it is designed only for Emacs's internal use. However, we
	538	provide it here because it's trivial to support, and some people
	539	probably do have @samp{emacs-mule}-format files lying around.
	540
	541	@end table
	542
	543	(At the moment, this list doesn't include any character sets suitable for
	544	external use that can actually handle multilingual data; this is
	545	unfortunate, as it encourages users to write data in Emacs-Mule format,
	546	which nobody but Emacs and Guile understands. We hope to add support
	547	for Unicode in UTF-8 soon, which should solve this problem.)
	548
	549	Case is not significant in encoding names.
	550
	551	You can define your own conversions; see @ref{Implementing Your Own Text
	552	Conversions}.
	553	@end deftypefn
	554
	555	@deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding})
	556	Return a non-zero value if Guile supports the encoding named @var{encoding}[[]]
	557	@end deftypefn
	558
	559	@deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv @var{context}, const char @var{inbuf}, size_t @var{inbytesleft}, char *@var{outbuf}, size_t @var{outbytesleft})
	560	Convert a sequence of characters from one encoding to another. The
	561	argument @var{context} specifies the encodings to use for the input and
	562	output, and carries state for stateful encodings; use
	563	@code{scm_mb_iconv_open} to create a @var{context} object for a
	564	particular conversion.
	565
	566	Upon entry to the function, @code{*@var{inbuf}} should point to the
	567	input buffer, and @code{*@var{inbytesleft}} should hold the number of
	568	input bytes present in the buffer; @code{*@var{outbuf}} should point to
	569	the output buffer, and @code{*@var{outbytesleft}} should hold the number
	570	of bytes available to hold the conversion results in that buffer.
	571
	572	Upon exit from the function, @code{*@var{inbuf}} points to the first
	573	unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number
	574	of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after
	575	the last output byte, and @code{*@var{outbyteleft}} holds the number of
	576	bytes left unused in the output buffer.
	577
	578	For stateful encodings, @var{context} carries encoding state from one
	579	call to @code{scm_mb_iconv} to the next. Thus, successive calls to
	580	@var{scm_mb_iconv} which use the same context object can convert a
	581	stream of data one chunk at a time.
	582
	583	If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is
	584	taken as a request to reset the states of the input and the output
	585	encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is
	586	non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output
	587	buffer to put the output encoding in its initial state. If the output
	588	buffer is not large enough to hold this byte sequence,
	589	@code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves
	590	the shift states of @var{context}'s input and output encodings
	591	unchanged.
	592
	593	The @code{scm_mb_iconv} function always consumes only complete
	594	characters or shift sequences from the input buffer, and the output
	595	buffer always contains a sequence of complete characters or escape
	596	sequences.
	597
	598	If the input sequence contains characters which are not expressible in
	599	the output encoding, @code{scm_mb_iconv} converts it in an
	600	implementation-defined way. It may simply delete the character.
	601
	602	Some encodings use byte sequences which do not correspond to any textual
	603	character. For example, the escape sequence of a stateful encoding has
	604	no textual meaning. When converting from such an encoding, a call to
	605	@code{scm_mb_iconv} might consume input but produce no output, since the
	606	input sequence might contain only escape sequences.
	607
	608	Normally, @code{scm_mb_iconv} returns the number of input characters it
	609	could not convert perfectly to the ouput encoding. However, it may
	610	return one of the @code{scm_mb_iconv_} codes described below, to
	611	indicate an error. All of these codes are negative values.
	612
	613	If the input sequence contains an invalid character encoding, conversion
	614	stops before the invalid input character, and @code{scm_mb_iconv}
	615	returns the constant value @code{scm_mb_iconv_bad_encoding}.
	616
	617	If the input sequence ends with an incomplete character encoding,
	618	@code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and
	619	return the constant value @code{scm_mb_iconv_incomplete_encoding}. This
	620	is not necessarily an error, if you expect to call @code{scm_mb_iconv}
	621	again with more data which might contain the rest of the encoding
	622	fragment.
	623
	624	If the output buffer does not contain enough room to hold the converted
	625	form of the complete input text, @code{scm_mb_iconv} converts as much as
	626	it can, changes the input and output pointers to reflect the amount of
	627	text successfully converted, and then returns
	628	@code{scm_mb_iconv_too_big}.
	629	@end deftypefn
	630
	631	Here are the status codes that might be returned by @code{scm_mb_iconv}.
	632	They are all negative integers.
	633	@table @code
	634
	635	@item scm_mb_iconv_too_big
	636	The conversion needs more room in the output buffer. Some characters
	637	may have been consumed from the input buffer, and some characters may
	638	have been placed in the available space in the output buffer.
	639
	640	@item scm_mb_iconv_bad_encoding
	641	@code{scm_mb_iconv} encountered an invalid character encoding in the
	642	input buffer. Conversion stopped before the invalid character, so there
	643	may be some characters consumed from the input buffer, and some
	644	converted text in the output buffer.
	645
	646	@item scm_mb_iconv_incomplete_encoding
	647	The input buffer ends with an incomplete character encoding. The
	648	incomplete encoding is left in the input buffer, unconsumed. This is
	649	not necessarily an error, if you expect to call @code{scm_mb_iconv}
	650	again with more data which might contain the rest of the incomplete
	651	encoding.
	652
	653	@end table
	654
	655
	656	Finally, Guile provides a function for destroying conversion contexts.
	657
	658	@deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context})
	659	Deallocate the conversion context object @var{context}, and all other
	660	resources allocated by the call to @code{scm_mb_iconv_open} which
	661	returned @var{context}.
	662	@end deftypefn
	663
	664
	665	@node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text
	666	@subsection Implementing Your Own Text Conversions
	667
	668	[[note that conversions to and from Guile must produce streams
	669	containing only valid character encodings, or else Guile will crash]]
	670
	671	This section describes the interface for adding your own encoding
	672	conversions for use with @code{scm_mb_iconv}. The interface here is
	673	borrowed from the GNOME Project's @file{libunicode} library.
	674
	675	Guile's @code{scm_mb_iconv} function works by converting the input text
	676	to a stream of @code{scm_char_t} characters, and then converting
	677	those characters to the desired output encoding. This makes it easy
	678	for Guile to choose the appropriate conversion back ends for an
	679	arbitrary pair of input and output encodings, but it also means that the
	680	accuracy and quality of the conversions depends on the fidelity of
	681	Guile's internal character set to the source and destination encodings.
	682	Since @code{scm_mb_iconv} will be used almost exclusively for converting
	683	to and from Guile's internal character set, this shouldn't be a problem.
	684
	685	To add support for a particular encoding to Guile, you must provide one
	686	function (called the @dfn{read} function) which converts from your
	687	encoding to an array of @code{scm_char_t}'s, and another function
	688	(called the @dfn{write} function) to convert from an array of
	689	@code{scm_char_t}'s back into your encoding. To convert from some
	690	encoding @var{a} to some other encoding @var{b}, Guile pairs up
	691	@var{a}'s read function with @var{b}'s write function. Each call to
	692	@code{scm_mb_iconv} passes text in encoding @var{a} through the read
	693	function, to produce an array of @code{scm_char_t}'s, and then passes
	694	that array to the write function, to produce text in encoding @var{b}.
	695
	696	For stateful encodings, a read or write function can hang its own data
	697	structures off the conversion object, and provide its own functions to
	698	allocate and destroy them; this allows read and write functions to
	699	maintain whatever state they like.
	700
	701	The Guile conversion back end represents each available encoding with a
	702	@code{struct scm_mb_encoding} object.
	703
	704	@deftp {Libguile Type} {struct scm_mb_encoding}
	705	This data structure describes an encoding. It has the following
	706	members:
	707
	708	@table @code
	709
	710	@item char **names
	711	An array of strings, giving the various names for this encoding. The
	712	array should be terminated by a zero pointer. Case is not significant
	713	in encoding names.
	714
	715	The @code{scm_mb_iconv_open} function searches the list of registered
	716	encodings for an encoding whose @code{names} array matches its
	717	@var{tocode} or @var{fromcode} argument.
	718
	719	@item int (init) (void *@var{cookie})
	720	An initialization function for the encoding's private data.
	721	@code{scm_mb_iconv_open} will call this function, passing it the address
	722	of the cookie for this encoding in this context. (We explain cookies
	723	below.) There is no way for the @code{init} function to tell whether
	724	the encoding will be used for reading or writing.
	725
	726	Note that @code{init} receives a @emph{pointer} to the cookie, not the
	727	cookie itself. Because the type of @var{cookie} is @code{void **}, the
	728	C compiler will not check it as carefully as it would other types.
	729
	730	The @code{init} member may be zero, indicating that no initialization is
	731	necessary for this encoding.
	732
	733	@item int (destroy) (void *@var{cookie})
	734	A deallocation function for the encoding's private data.
	735	@code{scm_mb_iconv_close} calls this function, passing it the address of
	736	the cookie for this encoding in this context. The @code{destroy}
	737	function should free any data the @code{init} function allocated.
	738
	739	Note that @code{destroy} receives a @emph{pointer} to the cookie, not the
	740	cookie itself. Because the type of @var{cookie} is @code{void **}, the
	741	C compiler will not check it as carefully as it would other types.
	742
	743	The @code{destroy} member may be zero, indicating that this encoding
	744	doesn't need to perform any special action to destroy its local data.
	745
	746	@item int (reset) (void @var{cookie}, char *@var{outbuf}, size_t @var{outbytesleft})
	747	Put the encoding into its initial shift state. Guile calls this
	748	function whether the encoding is being used for input or output, so this
	749	should take appropriate steps for both directions. If @var{outbuf} and
	750	@var{outbytesleft} are valid, the reset function should emit an escape
	751	sequence to reset the output stream to its initial state; @var{outbuf}
	752	and @var{outbytesleft} should be handled just as for
	753	@code{scm_mb_iconv}.
	754
	755	This function can return an @code{scm_mb_iconv_} error code
	756	(@pxref{Exchanging Guile Text With the Outside World in C}). If it
	757	returns @code{scm_mb_iconv_too_big}, then the output buffer's shift
	758	state must be left unchanged.
	759
	760	Note that @code{reset} receives the cookie's value itself, not a pointer
	761	to the cookie, as the @code{init} and @code{destroy} functions do.
	762
	763	The @code{reset} member may be zero, indicating that this encoding
	764	doesn't use a shift state.
	765
	766	@item enum scm_mb_read_result (read) (void @var{cookie}, const char *@var{inbuf}, size_t @var{inbytesleft}, scm_char_t *@var{outbuf}, size_t @var{outcharsleft})
	767	Read some bytes and convert into an array of Guile characters. This is
	768	the encoding's read function.
	769
	770	On entry, there are @var{inbytesleft} bytes of text at @var{inbuf} to
	771	be converted, and *@var{outcharsleft} characters available at
	772	*@var{outbuf} to hold the results.
	773
	774	On exit, @var{inbytesleft} and @var{inbuf} indicate the input bytes
	775	still not consumed. @var{outcharsleft} and @var{outbuf} indicate the
	776	output buffer space still not filled. (By exclusion, these indicate
	777	which input bytes were consumed, and which output characters were
	778	produced.)
	779
	780	Return one of the @code{enum scm_mb_read_result} values, described below.
	781
	782	Note that @code{read} receives the cookie's value itself, not a pointer
	783	to the cookie, as the @code{init} and @code{destroy} functions do.
	784
	785	@item enum scm_mb_write_result (write) (void @var{cookie}, scm_char_t *@var{inbuf}, size_t @var{incharsleft}, *@var{outbuf}, size_t @var{outbytesleft})
	786	Convert an array of Guile characters to output bytes. This is
	787	the encoding's write function.
	788
	789	On entry, there are *@var{incharsleft} Guile characters available at
	790	@var{inbuf}, and @var{outbytesleft} bytes available to store output at
	791	*@var{outbuf}.
	792
	793	On exit, @var{incharsleft} and @var{inbuf} indicate the number of
	794	Guile characters left unconverted (because there was insufficient room
	795	in the output buffer to hold their converted forms), and
	796	@var{outbytesleft} and @var{outbuf} indicate the unused portion of the
	797	output buffer.
	798
	799	Return one of the @code{scm_mb_write_result} values, described below.
	800
	801	Note that @code{write} receives the cookie's value itself, not a pointer
	802	to the cookie, as the @code{init} and @code{destroy} functions do.
	803
	804	@item struct scm_mb_encoding *next
	805	This is used by Guile to maintain a linked list of encodings. It is
	806	filled in when you call @code{scm_mb_register_encoding} to add your
	807	encoding to the list.
	808
	809	@end table
	810	@end deftp
	811
	812	Here is the enumerated type for the values an encoding's read function
	813	can return:
	814
	815	@deftp {Libguile Type} {enum scm_mb_read_result}
	816	This type represents the result of a call to an encoding's read
	817	function. It has the following values:
	818
	819	@table @code
	820
	821	@item scm_mb_read_ok
	822	The read function consumed at least one byte of input.
	823
	824	@item scm_mb_read_incomplete
	825	The data present in the input buffer does not contain a complete
	826	character encoding. No input was consumed, and no characters were
	827	produced as output. This is not necessarily an error status, if there
	828	is more data to pass through.
	829
	830	@item scm_mb_read_error
	831	The input contains an invalid character encoding.
	832
	833	@end table
	834	@end deftp
	835
	836	Here is the enumerated type for the values an encoding's write function
	837	can return:
	838
	839	@deftp {Libguile Type} {enum scm_mb_write_result}
	840	This type represents the result of a call to an encoding's write
	841	function. It has the following values:
	842
	843	@table @code
	844
	845	@item scm_mb_write_ok
	846	The write function was able to convert all the characters in @var{inbuf}
	847	successfully.
	848
	849	@item scm_mb_write_too_big
	850	The write function filled the output buffer, but there are still
	851	characters in @var{inbuf} left unconsumed; @var{inbuf} and
	852	@var{incharsleft} indicate the unconsumed portion of the input buffer.
	853
	854	@end table
	855	@end deftp
	856
	857
	858	Conversions to or from stateful encodings need to keep track of each
	859	encoding's current state. Each conversion context contains two
	860	@code{void *} variables called @dfn{cookies}, one for the input
	861	encoding, and one for the output encoding. These cookies are passed to
	862	the encodings' functions, for them to use however they please. A
	863	stateful encoding can use its cookie to hold a pointer to some object
	864	which maintains the context's current shift state. Stateless encodings
	865	will probably not use their cookies.
	866
	867	The cookies' lifetime is the same as that of the context object. When
	868	the user calls @code{scm_mb_iconv_close} to destroy a context object,
	869	@code{scm_mb_iconv_close} calls the input and output encodings'
	870	@code{destroy} functions, passing them their respective cookies, so each
	871	encoding can free any data it allocated for that context.
	872
	873	Note that, if a read or write function returns a successful result code
	874	like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining
	875	input, together with the output, must together represent the complete
	876	input text; the encoding may not store any text temporarily in its
	877	cookie. This is because, if @code{scm_mb_iconv} returns a successful
	878	result to the user, it is correct for the user to assume that all the
	879	consumed input has been converted and placed in the output buffer.
	880	There is no ``flush'' operation to push any final results out of the
	881	encodings' buffers.
	882
	883	Here is the function you call to register a new encoding with the
	884	conversion system:
	885
	886	@deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding})
	887	Add the encoding described by @code{*@var{encoding}} to the set
	888	understood by @code{scm_mb_iconv_open}. Once you have registered your
	889	encoding, you can use it by calling @code{scm_mb_iconv_open} with one of
	890	the names in @code{@var{encoding}->names}.
	891	@end deftypefn
	892
	893
	894	@node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C
	895	@section Multibyte Text Processing Errors
	896
	897	This section describes error conditions which code can signal to
	898	indicate problems encountered while processing multibyte text. In each
	899	case, the arguments @var{message} and @var{args} are an error format
	900	string and arguments to be substituted into the string, as accepted by
	901	the @code{display-error} function.
	902
	903	@deffn Condition text:not-char-boundary func message args object offset
	904	By calling @var{func}, the program attempted to access a character at
	905	byte offset @var{offset} in the Guile object @var{object}, but
	906	@var{offset} is not the start of a character's encoding in @var{object}.
	907
	908	Typically, @var{object} is a string or symbol. If the function signalling
	909	the error cannot find the Guile object that contains the text it is
	910	inspecting, it should use @code{#f} for @var{object}.
	911	@end deffn
	912
	913	@deffn Condition text:bad-encoding func message args object
	914	By calling @var{func}, the program attempted to interpret the text in
	915	@var{object}, but @var{object} contains a byte sequence which is not a
	916	valid encoding for any character.
	917	@end deffn
	918
	919	@deffn Condition text:not-guile-char func message args number
	920	By calling @var{func}, the program attempted to treat @var{number} as the
	921	number of a character in the Guile character set, but @var{number} does
	922	not correspond to any character in the Guile character set.
	923	@end deffn
	924
	925	@deffn Condition text:unknown-conversion func message args from to
	926	By calling @var{func}, the program attempted to convert from an encoding
	927	named @var{from} to an encoding named @var{to}, but Guile does not
	928	support such a conversion.
	929	@end deffn
	930
	931	@deftypevr {Libguile Variable} SCM scm_text_not_char_boundary
	932	@deftypevrx {Libguile Variable} SCM scm_text_bad_encoding
	933	@deftypevrx {Libguile Variable} SCM scm_text_not_guile_char
	934	These variables hold the scheme symbol objects whose names are the
	935	condition symbols above. You can use these when signalling these
	936	errors, instead of looking them up yourself.
	937	@end deftypevr
	938
	939
	940	@node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C
	941	@section Why Guile Does Not Use a Fixed-Width Encoding
	942
	943	Multibyte encodings are clumsier to work with than encodings which use a
	944	fixed number of bytes for every character. For example, using a
	945	fixed-width encoding, we can extract the @var{i}th character of a string
	946	in constant time, and we can always substitute the @var{i}th character
	947	of a string with any other character without reallocating or copying the
	948	string.
	949
	950	However, there are no fixed-width encodings which include the characters
	951	we wish to include, and also fit in a reasonable amount of space.
	952	Despite the Unicode standard's claims to the contrary, Unicode is not
	953	really a fixed-width encoding. Unicode uses surrogate pairs to
	954	represent characters outside the 16-bit range; a surrogate pair must be
	955	treated as a single character, but occupies two 16-bit spaces. As of
	956	this writing, there are already plans to assign characters to the
	957	surrogate character codes. Three- and four-byte encodings are
	958	too wasteful for a majority of Guile's users, who only need @sc{ASCII}
	959	and a few accented characters.
	960
	961	Another alternative would be to have several different fixed-width
	962	string representations, each with a different element size. For each
	963	string, Guile would use the smallest element size capable of
	964	accomodating the string's text. This would allow users of English and
	965	the Western European languages to use the traditional memory-efficient
	966	encodings. However, if Guile has @var{n} string representations, then
	967	users must write @var{n} versions of any code which manipulates text
	968	directly --- one for each element size. And if a user wants to operate
	969	on two strings simultaneously, and wants to avoid testing the string
	970	sizes within the loop, she must make @var{n}*@var{n} copies of the loop.
	971	Most users will simply not bother. Instead, they will write code which
	972	supports only one string size, leaving us back where we started. By
	973	using a single internal representation, Guile makes it easier for users
	974	to write multilingual code.
	975
	976	[[What about tagging each string with its encoding?
	977	"Every extension must be written to deal with every encoding"]]
	978
	979	[[You don't really want to index strings anyway.]]
	980
	981	Finally, Guile's multibyte encoding is not so bad. Unlike a two- or
	982	four-byte encoding, it is efficient in space for American and European
	983	users. Furthermore, the properties described above mean that many
	984	functions can be coded just as they would for a single-byte encoding;
	985	see @ref{Promised Properties of the Guile Multibyte Encoding}.
	986
	987	@bye