Doc fix.
[bpt/emacs.git] / doc / lispref / nonascii.texi
CommitLineData
b8d4c8d0
GM
1@c -*-texinfo-*-
2@c This is part of the GNU Emacs Lisp Reference Manual.
ab422c4d 3@c Copyright (C) 1998-1999, 2001-2013 Free Software Foundation, Inc.
b8d4c8d0 4@c See the file elisp.texi for copying conditions.
ecc6530d 5@node Non-ASCII Characters
b8d4c8d0
GM
6@chapter Non-@acronym{ASCII} Characters
7@cindex multibyte characters
8@cindex characters, multi-byte
9@cindex non-@acronym{ASCII} characters
10
c4526e93
EZ
11 This chapter covers the special issues relating to characters and
12how they are stored in strings and buffers.
b8d4c8d0
GM
13
14@menu
c4526e93 15* Text Representations:: How Emacs represents text.
64a695bd 16* Disabling Multibyte:: Controlling whether to use multibyte characters.
b8d4c8d0
GM
17* Converting Representations:: Converting unibyte to multibyte and vice versa.
18* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
19* Character Codes:: How unibyte and multibyte relate to
20 codes of individual characters.
91211f07
EZ
21* Character Properties:: Character attributes that define their
22 behavior and handling.
b8d4c8d0
GM
23* Character Sets:: The space of possible character codes
24 is divided into various character sets.
b8d4c8d0
GM
25* Scanning Charsets:: Which character sets are used in a buffer?
26* Translation of Characters:: Translation tables are used for conversion.
27* Coding Systems:: Coding systems are conversions for saving files.
28* Input Methods:: Input methods allow users to enter various
29 non-ASCII characters without special keyboards.
30* Locales:: Interacting with the POSIX locale.
31@end menu
32
33@node Text Representations
34@section Text Representations
c4526e93
EZ
35@cindex text representation
36
37 Emacs buffers and strings support a large repertoire of characters
97d8273f 38from many different scripts, allowing users to type and display text
8cc8cecf 39in almost any known written language.
c4526e93
EZ
40
41@cindex character codepoint
42@cindex codespace
43@cindex Unicode
44 To support this multitude of characters and scripts, Emacs closely
45follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
46unique number, called a @dfn{codepoint}, to each and every character.
47The range of codepoints defined by Unicode, or the Unicode
85eeac93
CY
48@dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation),
49inclusive. Emacs extends this range with codepoints in the range
50@code{#x110000..#x3FFFFF}, which it uses for representing characters
51that are not unified with Unicode and @dfn{raw 8-bit bytes} that
52cannot be interpreted as characters. Thus, a character codepoint in
53Emacs is a 22-bit integer number.
c4526e93
EZ
54
55@cindex internal representation of characters
56@cindex characters, representation in buffers and strings
57@cindex multibyte text
58 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
59that are codepoints of text characters within buffers and strings.
60Rather, Emacs uses a variable-length internal representation of
61characters, that stores each character as a sequence of 1 to 5 8-bit
62bytes, depending on the magnitude of its codepoint@footnote{
63This internal representation is based on one of the encodings defined
64by the Unicode Standard, called @dfn{UTF-8}, for representing any
65Unicode codepoint, but Emacs extends UTF-8 to represent the additional
8b80cdf5 66codepoints it uses for raw 8-bit bytes and characters not unified with
97d8273f
CY
67Unicode.}. For example, any @acronym{ASCII} character takes up only 1
68byte, a Latin-1 character takes up 2 bytes, etc. We call this
69representation of text @dfn{multibyte}.
c4526e93
EZ
70
71 Outside Emacs, characters can be represented in many different
72encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
97d8273f 73between these external encodings and its internal representation, as
c4526e93
EZ
74appropriate, when it reads text into a buffer or a string, or when it
75writes text to a disk file or passes it to some other process.
76
77 Occasionally, Emacs needs to hold and manipulate encoded text or
031c41de
EZ
78binary non-text data in its buffers or strings. For example, when
79Emacs visits a file, it first reads the file's text verbatim into a
80buffer, and only then converts it to the internal representation.
81Before the conversion, the buffer holds encoded text.
b8d4c8d0
GM
82
83@cindex unibyte text
c4526e93
EZ
84 Encoded text is not really text, as far as Emacs is concerned, but
85rather a sequence of raw 8-bit bytes. We call buffers and strings
86that hold encoded text @dfn{unibyte} buffers and strings, because
97d8273f
CY
87Emacs treats them as a sequence of individual bytes. Usually, Emacs
88displays unibyte buffers and strings as octal codes such as
89@code{\237}. We recommend that you never use unibyte buffers and
c4526e93 90strings except for manipulating encoded text or binary non-text data.
b8d4c8d0
GM
91
92 In a buffer, the buffer-local value of the variable
93@code{enable-multibyte-characters} specifies the representation used.
94The representation for a string is determined and recorded in the string
95when the string is constructed.
96
8a14dec7 97@defvar enable-multibyte-characters
b8d4c8d0
GM
98This variable specifies the current buffer's text representation.
99If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
c4526e93 100it contains unibyte encoded text or binary non-text data.
b8d4c8d0
GM
101
102You cannot set this variable directly; instead, use the function
103@code{set-buffer-multibyte} to change a buffer's representation.
8a14dec7 104@end defvar
b8d4c8d0
GM
105
106@defun position-bytes position
c4526e93
EZ
107Buffer positions are measured in character units. This function
108returns the byte-position corresponding to buffer position
b8d4c8d0
GM
109@var{position} in the current buffer. This is 1 at the start of the
110buffer, and counts upward in bytes. If @var{position} is out of
111range, the value is @code{nil}.
112@end defun
113
114@defun byte-to-position byte-position
031c41de
EZ
115Return the buffer position, in character units, corresponding to given
116@var{byte-position} in the current buffer. If @var{byte-position} is
117out of range, the value is @code{nil}. In a multibyte buffer, an
118arbitrary value of @var{byte-position} can be not at character
119boundary, but inside a multibyte sequence representing a single
120character; in this case, this function returns the buffer position of
121the character whose multibyte sequence includes @var{byte-position}.
122In other words, the value does not change for all byte positions that
123belong to the same character.
b8d4c8d0
GM
124@end defun
125
126@defun multibyte-string-p string
c4526e93 127Return @code{t} if @var{string} is a multibyte string, @code{nil}
3323c263
EZ
128otherwise. This function also returns @code{nil} if @var{string} is
129some object other than a string.
b8d4c8d0
GM
130@end defun
131
132@defun string-bytes string
133@cindex string, number of bytes
134This function returns the number of bytes in @var{string}.
135If @var{string} is a multibyte string, this can be greater than
136@code{(length @var{string})}.
137@end defun
138
c4526e93
EZ
139@defun unibyte-string &rest bytes
140This function concatenates all its argument @var{bytes} and makes the
141result a unibyte string.
142@end defun
143
64a695bd
XF
144@node Disabling Multibyte
145@section Disabling Multibyte Characters
146@cindex disabling multibyte
147
148 By default, Emacs starts in multibyte mode: it stores the contents
149of buffers and strings using an internal encoding that represents
150non-@acronym{ASCII} characters using multi-byte sequences. Multibyte
151mode allows you to use all the supported languages and scripts without
152limitations.
153
154@cindex turn multibyte support on or off
155 Under very special circumstances, you may want to disable multibyte
156character support, for a specific buffer.
157When multibyte characters are disabled in a buffer, we call
158that @dfn{unibyte mode}. In unibyte mode, each character in the
159buffer has a character code ranging from 0 through 255 (0377 octal); 0
160through 127 (0177 octal) represent @acronym{ASCII} characters, and 128
161(0200 octal) through 255 (0377 octal) represent non-@acronym{ASCII}
162characters.
163
164 To edit a particular file in unibyte representation, visit it using
165@code{find-file-literally}. @xref{Visiting Functions}. You can
166convert a multibyte buffer to unibyte by saving it to a file, killing
167the buffer, and visiting the file again with
168@code{find-file-literally}. Alternatively, you can use @kbd{C-x
169@key{RET} c} (@code{universal-coding-system-argument}) and specify
170@samp{raw-text} as the coding system with which to visit or save a
171file. @xref{Text Coding, , Specifying a Coding System for File Text,
172emacs, GNU Emacs Manual}. Unlike @code{find-file-literally}, finding
173a file as @samp{raw-text} doesn't disable format conversion,
174uncompression, or auto mode selection.
175
176@c See http://debbugs.gnu.org/11226 for lack of unibyte tooltip.
177@vindex enable-multibyte-characters
178The buffer-local variable @code{enable-multibyte-characters} is
179non-@code{nil} in multibyte buffers, and @code{nil} in unibyte ones.
180The mode line also indicates whether a buffer is multibyte or not.
181With a graphical display, in a multibyte buffer, the portion of the
182mode line that indicates the character set has a tooltip that (amongst
183other things) says that the buffer is multibyte. In a unibyte buffer,
184the character set indicator is absent. Thus, in a unibyte buffer
185(when using a graphical display) there is normally nothing before the
186indication of the visited file's end-of-line convention (colon,
187backslash, etc.), unless you are using an input method.
188
189@findex toggle-enable-multibyte-characters
190You can turn off multibyte support in a specific buffer by invoking the
191command @code{toggle-enable-multibyte-characters} in that buffer.
192
b8d4c8d0
GM
193@node Converting Representations
194@section Converting Text Representations
195
196 Emacs can convert unibyte text to multibyte; it can also convert
031c41de 197multibyte text to unibyte, provided that the multibyte text contains
8b80cdf5 198only @acronym{ASCII} and 8-bit raw bytes. In general, these
031c41de
EZ
199conversions happen when inserting text into a buffer, or when putting
200text from several strings together in one string. You can also
201explicitly convert a string's contents to either representation.
b8d4c8d0 202
97d8273f
CY
203 Emacs chooses the representation for a string based on the text from
204which it is constructed. The general rule is to convert unibyte text
205to multibyte text when combining it with other multibyte text, because
206the multibyte representation is more general and can hold whatever
b8d4c8d0
GM
207characters the unibyte text has.
208
209 When inserting text into a buffer, Emacs converts the text to the
210buffer's representation, as specified by
211@code{enable-multibyte-characters} in that buffer. In particular, when
212you insert multibyte text into a unibyte buffer, Emacs converts the text
213to unibyte, even though this conversion cannot in general preserve all
214the characters that might be in the multibyte text. The other natural
215alternative, to convert the buffer contents to multibyte, is not
216acceptable because the buffer's representation is a choice made by the
217user that cannot be overridden automatically.
218
97d8273f 219 Converting unibyte text to multibyte text leaves @acronym{ASCII}
e4021ec1 220characters unchanged, and converts bytes with codes 128 through 255 to
97d8273f 221the multibyte representation of raw eight-bit bytes.
b8d4c8d0 222
031c41de
EZ
223 Converting multibyte text to unibyte converts all @acronym{ASCII}
224and eight-bit characters to their single-byte form, but loses
225information for non-@acronym{ASCII} characters by discarding all but
226the low 8 bits of each character's codepoint. Converting unibyte text
227to multibyte and back to unibyte reproduces the original unibyte text.
b8d4c8d0 228
031c41de 229The next two functions either return the argument @var{string}, or a
b8d4c8d0
GM
230newly created string with no text properties.
231
b8d4c8d0
GM
232@defun string-to-multibyte string
233This function returns a multibyte string containing the same sequence
031c41de 234of characters as @var{string}. If @var{string} is a multibyte string,
8b80cdf5
EZ
235it is returned unchanged. The function assumes that @var{string}
236includes only @acronym{ASCII} characters and raw 8-bit bytes; the
237latter are converted to their multibyte representation corresponding
85eeac93
CY
238to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
239(@pxref{Text Representations, codepoints}).
031c41de
EZ
240@end defun
241
242@defun string-to-unibyte string
243This function returns a unibyte string containing the same sequence of
244characters as @var{string}. It signals an error if @var{string}
245contains a non-@acronym{ASCII} character. If @var{string} is a
8b80cdf5
EZ
246unibyte string, it is returned unchanged. Use this function for
247@var{string} arguments that contain only @acronym{ASCII} and eight-bit
248characters.
b8d4c8d0
GM
249@end defun
250
3e99b825
CY
251@defun byte-to-string byte
252@cindex byte to string
253This function returns a unibyte string containing a single byte of
35a30759 254character data, @var{character}. It signals an error if
3e99b825
CY
255@var{character} is not an integer between 0 and 255.
256@end defun
b8d4c8d0
GM
257
258@defun multibyte-char-to-unibyte char
97d8273f
CY
259This converts the multibyte character @var{char} to a unibyte
260character, and returns that character. If @var{char} is neither
261@acronym{ASCII} nor eight-bit, the function returns -1.
b8d4c8d0
GM
262@end defun
263
264@defun unibyte-char-to-multibyte char
265This convert the unibyte character @var{char} to a multibyte
8b80cdf5
EZ
266character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
267byte.
b8d4c8d0
GM
268@end defun
269
270@node Selecting a Representation
271@section Selecting a Representation
272
273 Sometimes it is useful to examine an existing buffer or string as
274multibyte when it was unibyte, or vice versa.
275
276@defun set-buffer-multibyte multibyte
277Set the representation type of the current buffer. If @var{multibyte}
278is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
279is @code{nil}, the buffer becomes unibyte.
280
281This function leaves the buffer contents unchanged when viewed as a
031c41de 282sequence of bytes. As a consequence, it can change the contents
97d8273f
CY
283viewed as characters; for instance, a sequence of three bytes which is
284treated as one character in multibyte representation will count as
285three characters in unibyte representation. Eight-bit characters
031c41de
EZ
286representing raw bytes are an exception. They are represented by one
287byte in a unibyte buffer, but when the buffer is set to multibyte,
288they are converted to two-byte sequences, and vice versa.
b8d4c8d0
GM
289
290This function sets @code{enable-multibyte-characters} to record which
291representation is in use. It also adjusts various data in the buffer
292(including overlays, text properties and markers) so that they cover the
293same text as they did before.
294
cd996018
CY
295This function signals an error if the buffer is narrowed, since the
296narrowing might have occurred in the middle of multibyte character
297sequences.
298
299This function also signals an error if the buffer is an indirect
300buffer. An indirect buffer always inherits the representation of its
b8d4c8d0
GM
301base buffer.
302@end defun
303
304@defun string-as-unibyte string
97d8273f
CY
305If @var{string} is already a unibyte string, this function returns
306@var{string} itself. Otherwise, it returns a new string with the same
307bytes as @var{string}, but treating each byte as a separate character
308(so that the value may have more characters than @var{string}); as an
309exception, each eight-bit character representing a raw byte is
310converted into a single byte. The newly-created string contains no
031c41de 311text properties.
b8d4c8d0
GM
312@end defun
313
314@defun string-as-multibyte string
97d8273f
CY
315If @var{string} is a multibyte string, this function returns
316@var{string} itself. Otherwise, it returns a new string with the same
317bytes as @var{string}, but treating each multibyte sequence as one
318character. This means that the value may have fewer characters than
319@var{string} has. If a byte sequence in @var{string} is invalid as a
320multibyte representation of a single character, each byte in the
321sequence is treated as a raw 8-bit byte. The newly-created string
322contains no text properties.
b8d4c8d0
GM
323@end defun
324
325@node Character Codes
326@section Character Codes
327@cindex character codes
328
ffdbc44b
CY
329 The unibyte and multibyte text representations use different
330character codes. The valid character codes for unibyte representation
85eeac93
CY
331range from 0 to @code{#xFF} (255)---the values that can fit in one
332byte. The valid character codes for multibyte representation range
333from 0 to @code{#x3FFFFF}. In this code space, values 0 through
334@code{#x7F} (127) are for @acronym{ASCII} characters, and values
335@code{#x80} (128) through @code{#x3FFF7F} (4194175) are for
336non-@acronym{ASCII} characters.
337
338 Emacs character codes are a superset of the Unicode standard.
339Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode
340characters of the same codepoint; values @code{#x110000} (1114112)
341through @code{#x3FFF7F} (4194175) represent characters that are not
342unified with Unicode; and values @code{#x3FFF80} (4194176) through
343@code{#x3FFFFF} (4194303) represent eight-bit raw bytes.
ffdbc44b
CY
344
345@defun characterp charcode
346This returns @code{t} if @var{charcode} is a valid character, and
347@code{nil} otherwise.
b8d4c8d0
GM
348
349@example
80070260 350@group
ffdbc44b 351(characterp 65)
b8d4c8d0 352 @result{} t
80070260
EZ
353@end group
354@group
ffdbc44b 355(characterp 4194303)
b8d4c8d0 356 @result{} t
80070260
EZ
357@end group
358@group
ffdbc44b
CY
359(characterp 4194304)
360 @result{} nil
80070260
EZ
361@end group
362@end example
363@end defun
364
365@cindex maximum value of character codepoint
366@cindex codepoint, largest value
367@defun max-char
368This function returns the largest value that a valid character
369codepoint can have.
370
371@example
372@group
373(characterp (max-char))
374 @result{} t
375@end group
376@group
377(characterp (1+ (max-char)))
378 @result{} nil
379@end group
b8d4c8d0 380@end example
b8d4c8d0
GM
381@end defun
382
106e6894 383@defun get-byte &optional pos string
97d8273f
CY
384This function returns the byte at character position @var{pos} in the
385current buffer. If the current buffer is unibyte, this is literally
386the byte at that position. If the buffer is multibyte, byte values of
031c41de
EZ
387@acronym{ASCII} characters are the same as character codepoints,
388whereas eight-bit raw bytes are converted to their 8-bit codes. The
389function signals an error if the character at @var{pos} is
390non-@acronym{ASCII}.
391
392The optional argument @var{string} means to get a byte value from that
393string instead of the current buffer.
394@end defun
395
91211f07
EZ
396@node Character Properties
397@section Character Properties
398@cindex character properties
399A @dfn{character property} is a named attribute of a character that
400specifies how the character behaves and how it should be handled
401during text processing and display. Thus, character properties are an
402important part of specifying the character's semantics.
403
434843ec 404 On the whole, Emacs follows the Unicode Standard in its implementation
91211f07
EZ
405of character properties. In particular, Emacs supports the
406@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
407Model}, and the Emacs character property database is derived from the
408Unicode Character Database (@acronym{UCD}). See the
409@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
97d8273f
CY
410Properties chapter of the Unicode Standard}, for a detailed
411description of Unicode character properties and their meaning. This
412section assumes you are already familiar with that chapter of the
413Unicode Standard, and want to apply that knowledge to Emacs Lisp
414programs.
91211f07
EZ
415
416 In Emacs, each property has a name, which is a symbol, and a set of
af38459f 417possible values, whose types depend on the property; if a character
c06ea95e
EZ
418does not have a certain property, the value is @code{nil}. As a
419general rule, the names of character properties in Emacs are produced
420from the corresponding Unicode properties by downcasing them and
421replacing each @samp{_} character with a dash @samp{-}. For example,
422@code{Canonical_Combining_Class} becomes
423@code{canonical-combining-class}. However, sometimes we shorten the
424names to make their use easier.
425
bca633fb
EZ
426@cindex unassigned character codepoints
427 Some codepoints are left @dfn{unassigned} by the
428@acronym{UCD}---they don't correspond to any character. The Unicode
429Standard defines default values of properties for such codepoints;
430they are mentioned below for each property.
431
97d8273f
CY
432 Here is the full list of value types for all the character
433properties that Emacs knows about:
91211f07
EZ
434
435@table @code
436@item name
049bcbcb
CY
437Corresponds to the @code{Name} Unicode property. The value is a
438string consisting of upper-case Latin letters A to Z, digits, spaces,
bca633fb
EZ
439and hyphen @samp{-} characters. For unassigned codepoints, the value
440is an empty string.
91211f07 441
f8848423 442@cindex unicode general category
91211f07 443@item general-category
049bcbcb
CY
444Corresponds to the @code{General_Category} Unicode property. The
445value is a symbol whose name is a 2-letter abbreviation of the
bca633fb
EZ
446character's classification. For unassigned codepoints, the value
447is @code{Cn}.
91211f07
EZ
448
449@item canonical-combining-class
049bcbcb 450Corresponds to the @code{Canonical_Combining_Class} Unicode property.
bca633fb
EZ
451The value is an integer number. For unassigned codepoints, the value
452is zero.
91211f07 453
10862873 454@cindex bidirectional class of characters
91211f07 455@item bidi-class
af38459f
EZ
456Corresponds to the Unicode @code{Bidi_Class} property. The value is a
457symbol whose name is the Unicode @dfn{directional type} of the
c094bb0c 458character. Emacs uses this property when it reorders bidirectional
bca633fb
EZ
459text for display (@pxref{Bidirectional Display}). For unassigned
460codepoints, the value depends on the code blocks to which the
461codepoint belongs: most unassigned codepoints get the value of
462@code{L} (strong L), but some get values of @code{AL} (Arabic letter)
463or @code{R} (strong R).
91211f07
EZ
464
465@item decomposition
84f4a531
CY
466Corresponds to the Unicode properties @code{Decomposition_Type} and
467@code{Decomposition_Value}. The value is a list, whose first element
468may be a symbol representing a compatibility formatting tag, such as
469@code{small}@footnote{The Unicode specification writes these tag names
470inside @samp{<..>} brackets, but the tag names in Emacs do not include
1df7defd 471the brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses
84f4a531
CY
472@samp{small}. }; the other elements are characters that give the
473compatibility decomposition sequence of this character. For
474unassigned codepoints, the value is the character itself.
91211f07
EZ
475
476@item decimal-digit-value
af38459f
EZ
477Corresponds to the Unicode @code{Numeric_Value} property for
478characters whose @code{Numeric_Type} is @samp{Digit}. The value is an
bca633fb
EZ
479integer number. For unassigned codepoints, the value is @code{nil},
480which means @acronym{NaN}, or ``not-a-number''.
91211f07 481
bc039a3b 482@item digit-value
af38459f
EZ
483Corresponds to the Unicode @code{Numeric_Value} property for
484characters whose @code{Numeric_Type} is @samp{Decimal}. The value is
485an integer number. Examples of such characters include compatibility
486subscript and superscript digits, for which the value is the
bca633fb
EZ
487corresponding number. For unassigned codepoints, the value is
488@code{nil}, which means @acronym{NaN}.
91211f07
EZ
489
490@item numeric-value
af38459f
EZ
491Corresponds to the Unicode @code{Numeric_Value} property for
492characters whose @code{Numeric_Type} is @samp{Numeric}. The value of
97d8273f 493this property is an integer or a floating-point number. Examples of
af38459f 494characters that have this property include fractions, subscripts,
91211f07 495superscripts, Roman numerals, currency numerators, and encircled
af38459f 496numbers. For example, the value of this property for the character
bca633fb
EZ
497@code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}. For
498unassigned codepoints, the value is @code{nil}, which means
499@acronym{NaN}.
91211f07 500
10862873 501@cindex mirroring of characters
91211f07 502@item mirrored
af38459f 503Corresponds to the Unicode @code{Bidi_Mirrored} property. The value
bca633fb
EZ
504of this property is a symbol, either @code{Y} or @code{N}. For
505unassigned codepoints, the value is @code{N}.
91211f07 506
10862873
EZ
507@item mirroring
508Corresponds to the Unicode @code{Bidi_Mirroring_Glyph} property. The
509value of this property is a character whose glyph represents the
510mirror image of the character's glyph, or @code{nil} if there's no
511defined mirroring glyph. All the characters whose @code{mirrored}
512property is @code{N} have @code{nil} as their @code{mirroring}
513property; however, some characters whose @code{mirrored} property is
514@code{Y} also have @code{nil} for @code{mirroring}, because no
c094bb0c
EZ
515appropriate characters exist with mirrored glyphs. Emacs uses this
516property to display mirror images of characters when appropriate
bca633fb
EZ
517(@pxref{Bidirectional Display}). For unassigned codepoints, the value
518is @code{nil}.
10862873 519
91211f07 520@item old-name
af38459f 521Corresponds to the Unicode @code{Unicode_1_Name} property. The value
bca633fb 522is a string. For unassigned codepoints, the value is an empty string.
91211f07
EZ
523
524@item iso-10646-comment
af38459f 525Corresponds to the Unicode @code{ISO_Comment} property. The value is
bca633fb 526a string. For unassigned codepoints, the value is an empty string.
91211f07
EZ
527
528@item uppercase
af38459f 529Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property.
bca633fb
EZ
530The value of this property is a single character. For unassigned
531codepoints, the value is @code{nil}, which means the character itself.
91211f07
EZ
532
533@item lowercase
af38459f 534Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property.
bca633fb
EZ
535The value of this property is a single character. For unassigned
536codepoints, the value is @code{nil}, which means the character itself.
91211f07
EZ
537
538@item titlecase
af38459f 539Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
91211f07 540@dfn{Title case} is a special form of a character used when the first
af38459f 541character of a word needs to be capitalized. The value of this
bca633fb
EZ
542property is a single character. For unassigned codepoints, the value
543is @code{nil}, which means the character itself.
91211f07
EZ
544@end table
545
546@defun get-char-code-property char propname
547This function returns the value of @var{char}'s @var{propname} property.
548
549@example
550@group
551(get-char-code-property ? 'general-category)
552 @result{} Zs
553@end group
554@group
555(get-char-code-property ?1 'general-category)
556 @result{} Nd
557@end group
558@group
049bcbcb
CY
559;; subscript 4
560(get-char-code-property ?\u2084 'digit-value)
91211f07
EZ
561 @result{} 4
562@end group
563@group
049bcbcb
CY
564;; one fifth
565(get-char-code-property ?\u2155 'numeric-value)
bc039a3b 566 @result{} 0.2
91211f07
EZ
567@end group
568@group
049bcbcb
CY
569;; Roman IV
570(get-char-code-property ?\u2163 'numeric-value)
bc039a3b 571 @result{} 4
91211f07
EZ
572@end group
573@end example
574@end defun
575
576@defun char-code-property-description prop value
577This function returns the description string of property @var{prop}'s
578@var{value}, or @code{nil} if @var{value} has no description.
579
580@example
581@group
582(char-code-property-description 'general-category 'Zs)
583 @result{} "Separator, Space"
584@end group
585@group
586(char-code-property-description 'general-category 'Nd)
587 @result{} "Number, Decimal Digit"
588@end group
589@group
590(char-code-property-description 'numeric-value '1/5)
591 @result{} nil
592@end group
593@end example
594@end defun
595
596@defun put-char-code-property char propname value
597This function stores @var{value} as the value of the property
598@var{propname} for the character @var{char}.
599@end defun
600
f8848423 601@defvar unicode-category-table
91211f07 602The value of this variable is a char-table (@pxref{Char-Tables}) that
f8848423
EZ
603specifies, for each character, its Unicode @code{General_Category}
604property as a symbol.
605@end defvar
606
607@defvar char-script-table
608The value of this variable is a char-table that specifies, for each
609character, a symbol whose name is the script to which the character
610belongs, according to the Unicode Standard classification of the
611Unicode code space into script-specific blocks. This char-table has a
612single extra slot whose value is the list of all script symbols.
91211f07
EZ
613@end defvar
614
615@defvar char-width-table
616The value of this variable is a char-table that specifies the width of
617each character in columns that it will occupy on the screen.
618@end defvar
619
620@defvar printable-chars
621The value of this variable is a char-table that specifies, for each
622character, whether it is printable or not. That is, if evaluating
623@code{(aref printable-chars char)} results in @code{t}, the character
624is printable, and if it results in @code{nil}, it is not.
625@end defvar
626
b8d4c8d0
GM
627@node Character Sets
628@section Character Sets
629@cindex character sets
630
031c41de
EZ
631@cindex charset
632@cindex coded character set
633An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
634in which each character is assigned a numeric code point. (The
434843ec 635Unicode Standard calls this a @dfn{coded character set}.) Each Emacs
031c41de
EZ
636charset has a name which is a symbol. A single character can belong
637to any number of different character sets, but it will generally have
638a different code point in each charset. Examples of character sets
639include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and
640@code{windows-1255}. The code point assigned to a character in a
641charset is usually different from its code point used in Emacs buffers
642and strings.
643
644@cindex @code{emacs}, a charset
645@cindex @code{unicode}, a charset
646@cindex @code{eight-bit}, a charset
647 Emacs defines several special character sets. The character set
648@code{unicode} includes all the characters whose Emacs code points are
85eeac93 649in the range @code{0..#x10FFFF}. The character set @code{emacs}
031c41de
EZ
650includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
651Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
652Emacs uses it to represent raw bytes encountered in text.
b8d4c8d0
GM
653
654@defun charsetp object
655Returns @code{t} if @var{object} is a symbol that names a character set,
656@code{nil} otherwise.
657@end defun
658
659@defvar charset-list
660The value is a list of all defined character set names.
661@end defvar
662
031c41de 663@defun charset-priority-list &optional highestp
73e0cbc0 664This function returns a list of all defined character sets ordered by
031c41de
EZ
665their priority. If @var{highestp} is non-@code{nil}, the function
666returns a single character set of the highest priority.
667@end defun
668
669@defun set-charset-priority &rest charsets
670This function makes @var{charsets} the highest priority character sets.
b8d4c8d0
GM
671@end defun
672
106e6894 673@defun char-charset character &optional restriction
031c41de
EZ
674This function returns the name of the character set of highest
675priority that @var{character} belongs to. @acronym{ASCII} characters
676are an exception: for them, this function always returns @code{ascii}.
106e6894
CY
677
678If @var{restriction} is non-@code{nil}, it should be a list of
679charsets to search. Alternatively, it can be a coding system, in
680which case the returned charset must be supported by that coding
681system (@pxref{Coding Systems}).
b8d4c8d0
GM
682@end defun
683
684@defun charset-plist charset
031c41de
EZ
685This function returns the property list of the character set
686@var{charset}. Although @var{charset} is a symbol, this is not the
687same as the property list of that symbol. Charset properties include
688important information about the charset, such as its documentation
689string, short name, etc.
b8d4c8d0
GM
690@end defun
691
031c41de
EZ
692@defun put-charset-property charset propname value
693This function sets the @var{propname} property of @var{charset} to the
694given @var{value}.
b8d4c8d0
GM
695@end defun
696
031c41de
EZ
697@defun get-charset-property charset propname
698This function returns the value of @var{charset}s property
699@var{propname}.
b8d4c8d0
GM
700@end defun
701
031c41de
EZ
702@deffn Command list-charset-chars charset
703This command displays a list of characters in the character set
704@var{charset}.
705@end deffn
b8d4c8d0 706
8b80cdf5
EZ
707 Emacs can convert between its internal representation of a character
708and the character's codepoint in a specific charset. The following
709two functions support these conversions.
710
711@c FIXME: decode-char and encode-char accept and ignore an additional
712@c argument @var{restriction}. When that argument actually makes a
713@c difference, it should be documented here.
031c41de
EZ
714@defun decode-char charset code-point
715This function decodes a character that is assigned a @var{code-point}
716in @var{charset}, to the corresponding Emacs character, and returns
8b80cdf5
EZ
717it. If @var{charset} doesn't contain a character of that code point,
718the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp
719integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
720specified as a cons cell @code{(@var{high} . @var{low})}, where
031c41de
EZ
721@var{low} are the lower 16 bits of the value and @var{high} are the
722high 16 bits.
b8d4c8d0
GM
723@end defun
724
031c41de
EZ
725@defun encode-char char charset
726This function returns the code point assigned to the character
8b80cdf5
EZ
727@var{char} in @var{charset}. If the result does not fit in a Lisp
728integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
729that fits the second argument of @code{decode-char} above. If
730@var{charset} doesn't have a codepoint for @var{char}, the value is
731@code{nil}.
b3f1f4a5
EZ
732@end defun
733
734 The following function comes in handy for applying a certain
735function to all or part of the characters in a charset:
736
85eeac93 737@defun map-charset-chars function charset &optional arg from-code to-code
b3f1f4a5
EZ
738Call @var{function} for characters in @var{charset}. @var{function}
739is called with two arguments. The first one is a cons cell
740@code{(@var{from} . @var{to})}, where @var{from} and @var{to}
741indicate a range of characters contained in charset. The second
85eeac93 742argument passed to @var{function} is @var{arg}.
b3f1f4a5
EZ
743
744By default, the range of codepoints passed to @var{function} includes
8c9d5f9f
KH
745all the characters in @var{charset}, but optional arguments
746@var{from-code} and @var{to-code} limit that to the range of
747characters between these two codepoints of @var{charset}. If either
748of them is @code{nil}, it defaults to the first or last codepoint of
749@var{charset}, respectively.
b8d4c8d0
GM
750@end defun
751
b8d4c8d0
GM
752@node Scanning Charsets
753@section Scanning for Character Sets
754
97d8273f
CY
755 Sometimes it is useful to find out which character set a particular
756character belongs to. One use for this is in determining which coding
757systems (@pxref{Coding Systems}) are capable of representing all of
758the text in question; another is to determine the font(s) for
759displaying that text.
b8d4c8d0
GM
760
761@defun charset-after &optional pos
031c41de 762This function returns the charset of highest priority containing the
97d8273f 763character at position @var{pos} in the current buffer. If @var{pos}
031c41de
EZ
764is omitted or @code{nil}, it defaults to the current value of point.
765If @var{pos} is out of range, the value is @code{nil}.
b8d4c8d0
GM
766@end defun
767
768@defun find-charset-region beg end &optional translation
031c41de 769This function returns a list of the character sets of highest priority
8b80cdf5 770that contain characters in the current buffer between positions
031c41de 771@var{beg} and @var{end}.
b8d4c8d0 772
97d8273f
CY
773The optional argument @var{translation} specifies a translation table
774to use for scanning the text (@pxref{Translation of Characters}). If
775it is non-@code{nil}, then each character in the region is translated
b8d4c8d0
GM
776through this table, and the value returned describes the translated
777characters instead of the characters actually in the buffer.
778@end defun
779
780@defun find-charset-string string &optional translation
97d8273f 781This function returns a list of character sets of highest priority
031c41de
EZ
782that contain characters in @var{string}. It is just like
783@code{find-charset-region}, except that it applies to the contents of
784@var{string} instead of part of the current buffer.
b8d4c8d0
GM
785@end defun
786
787@node Translation of Characters
788@section Translation of Characters
789@cindex character translation tables
790@cindex translation tables
791
031c41de
EZ
792 A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that
793specifies a mapping of characters into characters. These tables are
794used in encoding and decoding, and for other purposes. Some coding
795systems specify their own particular translation tables; there are
796also default translation tables which apply to all other coding
797systems.
b8d4c8d0 798
031c41de
EZ
799 A translation table has two extra slots. The first is either
800@code{nil} or a translation table that performs the reverse
801translation; the second is the maximum number of characters to look up
8b80cdf5
EZ
802for translating sequences of characters (see the description of
803@code{make-translation-table-from-alist} below).
b8d4c8d0
GM
804
805@defun make-translation-table &rest translations
806This function returns a translation table based on the argument
807@var{translations}. Each element of @var{translations} should be a
808list of elements of the form @code{(@var{from} . @var{to})}; this says
809to translate the character @var{from} into @var{to}.
810
811The arguments and the forms in each argument are processed in order,
812and if a previous form already translates @var{to} to some other
813character, say @var{to-alt}, @var{from} is also translated to
814@var{to-alt}.
b8d4c8d0
GM
815@end defun
816
031c41de
EZ
817 During decoding, the translation table's translations are applied to
818the characters that result from ordinary decoding. If a coding system
97d8273f 819has the property @code{:decode-translation-table}, that specifies the
031c41de
EZ
820translation table to use, or a list of translation tables to apply in
821sequence. (This is a property of the coding system, as returned by
822@code{coding-system-get}, not a property of the symbol that is the
823coding system's name. @xref{Coding System Basics,, Basic Concepts of
824Coding Systems}.) Finally, if
825@code{standard-translation-table-for-decode} is non-@code{nil}, the
826resulting characters are translated by that table.
827
828 During encoding, the translation table's translations are applied to
829the characters in the buffer, and the result of translation is
830actually encoded. If a coding system has property
831@code{:encode-translation-table}, that specifies the translation table
832to use, or a list of translation tables to apply in sequence. In
833addition, if the variable @code{standard-translation-table-for-encode}
834is non-@code{nil}, it specifies the translation table to use for
835translating the result.
b8d4c8d0
GM
836
837@defvar standard-translation-table-for-decode
031c41de
EZ
838This is the default translation table for decoding. If a coding
839systems specifies its own translation tables, the table that is the
840value of this variable, if non-@code{nil}, is applied after them.
b8d4c8d0
GM
841@end defvar
842
843@defvar standard-translation-table-for-encode
031c41de
EZ
844This is the default translation table for encoding. If a coding
845systems specifies its own translation tables, the table that is the
846value of this variable, if non-@code{nil}, is applied after them.
b8d4c8d0
GM
847@end defvar
848
5c9c5c4b
EZ
849@defvar translation-table-for-input
850Self-inserting characters are translated through this translation
851table before they are inserted. Search commands also translate their
852input through this table, so they can compare more reliably with
853what's in the buffer.
854
855This variable automatically becomes buffer-local when set.
856@end defvar
857
031c41de
EZ
858@defun make-translation-table-from-vector vec
859This function returns a translation table made from @var{vec} that is
85eeac93 860an array of 256 elements to map bytes (values 0 through #xFF) to
031c41de
EZ
861characters. Elements may be @code{nil} for untranslated bytes. The
862returned table has a translation table for reverse mapping in the
8b80cdf5 863first extra slot, and the value @code{1} in the second extra slot.
031c41de
EZ
864
865This function provides an easy way to make a private coding system
866that maps each byte to a specific character. You can specify the
867returned table and the reverse translation table using the properties
868@code{:decode-translation-table} and @code{:encode-translation-table}
869respectively in the @var{props} argument to
870@code{define-coding-system}.
871@end defun
872
873@defun make-translation-table-from-alist alist
874This function is similar to @code{make-translation-table} but returns
875a complex translation table rather than a simple one-to-one mapping.
876Each element of @var{alist} is of the form @code{(@var{from}
97d8273f
CY
877. @var{to})}, where @var{from} and @var{to} are either characters or
878vectors specifying a sequence of characters. If @var{from} is a
1df7defd 879character, that character is translated to @var{to} (i.e., to a
031c41de
EZ
880character or a character sequence). If @var{from} is a vector of
881characters, that sequence is translated to @var{to}. The returned
882table has a translation table for reverse mapping in the first extra
8b80cdf5
EZ
883slot, and the maximum length of all the @var{from} character sequences
884in the second extra slot.
031c41de
EZ
885@end defun
886
b8d4c8d0
GM
887@node Coding Systems
888@section Coding Systems
889
890@cindex coding system
891 When Emacs reads or writes a file, and when Emacs sends text to a
892subprocess or receives text from a subprocess, it normally performs
893character code conversion and end-of-line conversion as specified
894by a particular @dfn{coding system}.
895
896 How to define a coding system is an arcane matter, and is not
897documented here.
898
899@menu
900* Coding System Basics:: Basic concepts.
901* Encoding and I/O:: How file I/O functions handle coding systems.
902* Lisp and Coding Systems:: Functions to operate on coding system names.
903* User-Chosen Coding Systems:: Asking the user to choose a coding system.
904* Default Coding Systems:: Controlling the default choices.
905* Specifying Coding Systems:: Requesting a particular coding system
906 for a single file operation.
907* Explicit Encoding:: Encoding or decoding text without doing I/O.
908* Terminal I/O Encoding:: Use of encoding for terminal I/O.
b8d4c8d0
GM
909@end menu
910
911@node Coding System Basics
912@subsection Basic Concepts of Coding Systems
913
914@cindex character code conversion
80070260
EZ
915 @dfn{Character code conversion} involves conversion between the
916internal representation of characters used inside Emacs and some other
917encoding. Emacs supports many different encodings, in that it can
918convert to and from them. For example, it can convert text to or from
919encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and
920several variants of ISO 2022. In some cases, Emacs supports several
921alternative encodings for the same characters; for example, there are
922three coding systems for the Cyrillic (Russian) alphabet: ISO,
923Alternativnyj, and KOI8.
924
af38459f
EZ
925 Every coding system specifies a particular set of character code
926conversions, but the coding system @code{undecided} is special: it
927leaves the choice unspecified, to be chosen heuristically for each
928file, based on the file's data.
b8d4c8d0
GM
929
930 In general, a coding system doesn't guarantee roundtrip identity:
931decoding a byte sequence using coding system, then encoding the
932resulting text in the same coding system, can produce a different byte
80070260
EZ
933sequence. But some coding systems do guarantee that the byte sequence
934will be the same as what you originally decoded. Here are a few
935examples:
b8d4c8d0
GM
936
937@quotation
80070260 938iso-8859-1, utf-8, big5, shift_jis, euc-jp
b8d4c8d0
GM
939@end quotation
940
941 Encoding buffer text and then decoding the result can also fail to
80070260
EZ
942reproduce the original text. For instance, if you encode a character
943with a coding system which does not support that character, the result
944is unpredictable, and thus decoding it using the same coding system
945may produce a different text. Currently, Emacs can't report errors
946that result from encoding unsupported characters.
b8d4c8d0
GM
947
948@cindex EOL conversion
949@cindex end-of-line conversion
950@cindex line end conversion
80070260
EZ
951 @dfn{End of line conversion} handles three different conventions
952used on various systems for representing end of line in files. The
953Unix convention, used on GNU and Unix systems, is to use the linefeed
954character (also called newline). The DOS convention, used on
955MS-Windows and MS-DOS systems, is to use a carriage-return and a
956linefeed at the end of a line. The Mac convention is to use just
957carriage-return.
b8d4c8d0
GM
958
959@cindex base coding system
960@cindex variant coding system
961 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
962conversion unspecified, to be chosen based on the data. @dfn{Variant
963coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
964@code{latin-1-mac} specify the end-of-line conversion explicitly as
965well. Most base coding systems have three corresponding variants whose
966names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
967
02eccf6b 968@vindex raw-text@r{ coding system}
b8d4c8d0 969 The coding system @code{raw-text} is special in that it prevents
02eccf6b
EZ
970character code conversion, and causes the buffer visited with this
971coding system to be a unibyte buffer. For historical reasons, you can
972save both unibyte and multibyte text with this coding system. When
973you use @code{raw-text} to encode multibyte text, it does perform one
974character code conversion: it converts eight-bit characters to their
975single-byte external representation. @code{raw-text} does not specify
976the end-of-line conversion, allowing that to be determined as usual by
977the data, and has the usual three variants which specify the
978end-of-line conversion.
979
980@vindex no-conversion@r{ coding system}
981@vindex binary@r{ coding system}
982 @code{no-conversion} (and its alias @code{binary}) is equivalent to
983@code{raw-text-unix}: it specifies no conversion of either character
984codes or end-of-line.
b8d4c8d0 985
80070260 986@vindex emacs-internal@r{ coding system}
97d8273f
CY
987@vindex utf-8-emacs@r{ coding system}
988 The coding system @code{utf-8-emacs} specifies that the data is
989represented in the internal Emacs encoding (@pxref{Text
990Representations}). This is like @code{raw-text} in that no code
991conversion happens, but different in that the result is multibyte
992data. The name @code{emacs-internal} is an alias for
993@code{utf-8-emacs}.
b8d4c8d0
GM
994
995@defun coding-system-get coding-system property
996This function returns the specified property of the coding system
997@var{coding-system}. Most coding system properties exist for internal
80070260 998purposes, but one that you might find useful is @code{:mime-charset}.
b8d4c8d0
GM
999That property's value is the name used in MIME for the character coding
1000which this coding system can read and write. Examples:
1001
1002@example
80070260 1003(coding-system-get 'iso-latin-1 :mime-charset)
b8d4c8d0 1004 @result{} iso-8859-1
80070260 1005(coding-system-get 'iso-2022-cn :mime-charset)
b8d4c8d0 1006 @result{} iso-2022-cn
80070260 1007(coding-system-get 'cyrillic-koi8 :mime-charset)
b8d4c8d0
GM
1008 @result{} koi8-r
1009@end example
1010
80070260 1011The value of the @code{:mime-charset} property is also defined
b8d4c8d0
GM
1012as an alias for the coding system.
1013@end defun
1014
91211f07
EZ
1015@defun coding-system-aliases coding-system
1016This function returns the list of aliases of @var{coding-system}.
1017@end defun
1018
b8d4c8d0
GM
1019@node Encoding and I/O
1020@subsection Encoding and I/O
1021
1022 The principal purpose of coding systems is for use in reading and
97d8273f
CY
1023writing files. The function @code{insert-file-contents} uses a coding
1024system to decode the file data, and @code{write-region} uses one to
1025encode the buffer contents.
b8d4c8d0
GM
1026
1027 You can specify the coding system to use either explicitly
1028(@pxref{Specifying Coding Systems}), or implicitly using a default
1029mechanism (@pxref{Default Coding Systems}). But these methods may not
1030completely specify what to do. For example, they may choose a coding
1031system such as @code{undefined} which leaves the character code
1032conversion to be determined from the data. In these cases, the I/O
1033operation finishes the job of choosing a coding system. Very often
1034you will want to find out afterwards which coding system was chosen.
1035
1036@defvar buffer-file-coding-system
e2e3f1d7
MR
1037This buffer-local variable records the coding system used for saving the
1038buffer and for writing part of the buffer with @code{write-region}. If
1039the text to be written cannot be safely encoded using the coding system
1040specified by this variable, these operations select an alternative
1041encoding by calling the function @code{select-safe-coding-system}
1042(@pxref{User-Chosen Coding Systems}). If selecting a different encoding
1043requires to ask the user to specify a coding system,
1044@code{buffer-file-coding-system} is updated to the newly selected coding
1045system.
b8d4c8d0
GM
1046
1047@code{buffer-file-coding-system} does @emph{not} affect sending text
1048to a subprocess.
1049@end defvar
1050
1051@defvar save-buffer-coding-system
1052This variable specifies the coding system for saving the buffer (by
1053overriding @code{buffer-file-coding-system}). Note that it is not used
1054for @code{write-region}.
1055
1056When a command to save the buffer starts out to use
1057@code{buffer-file-coding-system} (or @code{save-buffer-coding-system}),
1058and that coding system cannot handle
1059the actual text in the buffer, the command asks the user to choose
1060another coding system (by calling @code{select-safe-coding-system}).
1061After that happens, the command also updates
1062@code{buffer-file-coding-system} to represent the coding system that
1063the user specified.
1064@end defvar
1065
1066@defvar last-coding-system-used
1067I/O operations for files and subprocesses set this variable to the
1068coding system name that was used. The explicit encoding and decoding
1069functions (@pxref{Explicit Encoding}) set it too.
1070
1071@strong{Warning:} Since receiving subprocess output sets this variable,
1072it can change whenever Emacs waits; therefore, you should copy the
1073value shortly after the function call that stores the value you are
1074interested in.
1075@end defvar
1076
1077 The variable @code{selection-coding-system} specifies how to encode
1078selections for the window system. @xref{Window System Selections}.
1079
1080@defvar file-name-coding-system
1081The variable @code{file-name-coding-system} specifies the coding
1082system to use for encoding file names. Emacs encodes file names using
1083that coding system for all file operations. If
1084@code{file-name-coding-system} is @code{nil}, Emacs uses a default
1085coding system determined by the selected language environment. In the
1086default language environment, any non-@acronym{ASCII} characters in
1087file names are not encoded specially; they appear in the file system
1088using the internal Emacs representation.
1089@end defvar
1090
1091 @strong{Warning:} if you change @code{file-name-coding-system} (or
1092the language environment) in the middle of an Emacs session, problems
1093can result if you have already visited files whose names were encoded
1094using the earlier coding system and are handled differently under the
1095new coding system. If you try to save one of these buffers under the
1096visited file name, saving may use the wrong file name, or it may get
1097an error. If such a problem happens, use @kbd{C-x C-w} to specify a
1098new file name for that buffer.
1099
1100@node Lisp and Coding Systems
1101@subsection Coding Systems in Lisp
1102
1103 Here are the Lisp facilities for working with coding systems:
1104
0e90e7be 1105@cindex list all coding systems
b8d4c8d0
GM
1106@defun coding-system-list &optional base-only
1107This function returns a list of all coding system names (symbols). If
1108@var{base-only} is non-@code{nil}, the value includes only the
1109base coding systems. Otherwise, it includes alias and variant coding
1110systems as well.
1111@end defun
1112
1113@defun coding-system-p object
1114This function returns @code{t} if @var{object} is a coding system
1115name or @code{nil}.
1116@end defun
1117
0e90e7be
EZ
1118@cindex validity of coding system
1119@cindex coding system, validity check
b8d4c8d0 1120@defun check-coding-system coding-system
80070260
EZ
1121This function checks the validity of @var{coding-system}. If that is
1122valid, it returns @var{coding-system}. If @var{coding-system} is
1123@code{nil}, the function return @code{nil}. For any other values, it
1124signals an error whose @code{error-symbol} is @code{coding-system-error}
1125(@pxref{Signaling Errors, signal}).
b8d4c8d0
GM
1126@end defun
1127
0e90e7be 1128@cindex eol type of coding system
b8d4c8d0
GM
1129@defun coding-system-eol-type coding-system
1130This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
1131conversion used by @var{coding-system}. If @var{coding-system}
1132specifies a certain eol conversion, the return value is an integer 0,
11331, or 2, standing for @code{unix}, @code{dos}, and @code{mac},
1134respectively. If @var{coding-system} doesn't specify eol conversion
1135explicitly, the return value is a vector of coding systems, each one
1136with one of the possible eol conversion types, like this:
1137
1138@lisp
1139(coding-system-eol-type 'latin-1)
1140 @result{} [latin-1-unix latin-1-dos latin-1-mac]
1141@end lisp
1142
1143@noindent
1144If this function returns a vector, Emacs will decide, as part of the
1145text encoding or decoding process, what eol conversion to use. For
1146decoding, the end-of-line format of the text is auto-detected, and the
1147eol conversion is set to match it (e.g., DOS-style CRLF format will
1148imply @code{dos} eol conversion). For encoding, the eol conversion is
1149taken from the appropriate default coding system (e.g.,
4e3b4528 1150default value of @code{buffer-file-coding-system} for
b8d4c8d0
GM
1151@code{buffer-file-coding-system}), or from the default eol conversion
1152appropriate for the underlying platform.
1153@end defun
1154
0e90e7be 1155@cindex eol conversion of coding system
b8d4c8d0
GM
1156@defun coding-system-change-eol-conversion coding-system eol-type
1157This function returns a coding system which is like @var{coding-system}
1158except for its eol conversion, which is specified by @code{eol-type}.
1159@var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
1160@code{nil}. If it is @code{nil}, the returned coding system determines
1161the end-of-line conversion from the data.
1162
1163@var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
1164@code{dos} and @code{mac}, respectively.
1165@end defun
1166
0e90e7be 1167@cindex text conversion of coding system
b8d4c8d0
GM
1168@defun coding-system-change-text-conversion eol-coding text-coding
1169This function returns a coding system which uses the end-of-line
1170conversion of @var{eol-coding}, and the text conversion of
1171@var{text-coding}. If @var{text-coding} is @code{nil}, it returns
1172@code{undecided}, or one of its variants according to @var{eol-coding}.
1173@end defun
1174
0e90e7be
EZ
1175@cindex safely encode region
1176@cindex coding systems for encoding region
b8d4c8d0
GM
1177@defun find-coding-systems-region from to
1178This function returns a list of coding systems that could be used to
1179encode a text between @var{from} and @var{to}. All coding systems in
1180the list can safely encode any multibyte characters in that portion of
1181the text.
1182
1183If the text contains no multibyte characters, the function returns the
1184list @code{(undecided)}.
1185@end defun
1186
0e90e7be
EZ
1187@cindex safely encode a string
1188@cindex coding systems for encoding a string
b8d4c8d0
GM
1189@defun find-coding-systems-string string
1190This function returns a list of coding systems that could be used to
1191encode the text of @var{string}. All coding systems in the list can
1192safely encode any multibyte characters in @var{string}. If the text
1193contains no multibyte characters, this returns the list
1194@code{(undecided)}.
1195@end defun
1196
0e90e7be
EZ
1197@cindex charset, coding systems to encode
1198@cindex safely encode characters in a charset
b8d4c8d0
GM
1199@defun find-coding-systems-for-charsets charsets
1200This function returns a list of coding systems that could be used to
1201encode all the character sets in the list @var{charsets}.
1202@end defun
1203
91211f07
EZ
1204@defun check-coding-systems-region start end coding-system-list
1205This function checks whether coding systems in the list
1206@code{coding-system-list} can encode all the characters in the region
1207between @var{start} and @var{end}. If all of the coding systems in
1208the list can encode the specified text, the function returns
1209@code{nil}. If some coding systems cannot encode some of the
1210characters, the value is an alist, each element of which has the form
1211@code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning
1212that @var{coding-system1} cannot encode characters at buffer positions
1213@var{pos1}, @var{pos2}, @enddots{}.
1214
1215@var{start} may be a string, in which case @var{end} is ignored and
1216the returned value references string indices instead of buffer
1217positions.
1218@end defun
1219
b8d4c8d0
GM
1220@defun detect-coding-region start end &optional highest
1221This function chooses a plausible coding system for decoding the text
80070260 1222from @var{start} to @var{end}. This text should be a byte sequence,
1df7defd 1223i.e., unibyte text or multibyte text with only @acronym{ASCII} and
80070260 1224eight-bit characters (@pxref{Explicit Encoding}).
b8d4c8d0
GM
1225
1226Normally this function returns a list of coding systems that could
1227handle decoding the text that was scanned. They are listed in order of
1228decreasing priority. But if @var{highest} is non-@code{nil}, then the
1229return value is just one coding system, the one that is highest in
1230priority.
1231
1232If the region contains only @acronym{ASCII} characters except for such
1233ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
1234@code{undecided} or @code{(undecided)}, or a variant specifying
1235end-of-line conversion, if that can be deduced from the text.
0b4faef3
EZ
1236
1237If the region contains null bytes, the value is @code{no-conversion},
1238even if the region contains text encoded in some coding system.
b8d4c8d0
GM
1239@end defun
1240
1241@defun detect-coding-string string &optional highest
1242This function is like @code{detect-coding-region} except that it
1243operates on the contents of @var{string} instead of bytes in the buffer.
91211f07
EZ
1244@end defun
1245
0e90e7be 1246@cindex null bytes, and decoding text
0b4faef3
EZ
1247@defvar inhibit-null-byte-detection
1248If this variable has a non-@code{nil} value, null bytes are ignored
1249when detecting the encoding of a region or a string. This allows to
1250correctly detect the encoding of text that contains null bytes, such
1251as Info files with Index nodes.
1252@end defvar
1253
1254@defvar inhibit-iso-escape-detection
1255If this variable has a non-@code{nil} value, ISO-2022 escape sequences
1256are ignored when detecting the encoding of a region or a string. The
1257result is that no text is ever detected as encoded in some ISO-2022
1258encoding, and all escape sequences become visible in a buffer.
1259@strong{Warning:} @emph{Use this variable with extreme caution,
1260because many files in the Emacs distribution use ISO-2022 encoding.}
1261@end defvar
1262
0e90e7be 1263@cindex charsets supported by a coding system
91211f07
EZ
1264@defun coding-system-charset-list coding-system
1265This function returns the list of character sets (@pxref{Character
1266Sets}) supported by @var{coding-system}. Some coding systems that
1267support too many character sets to list them all yield special values:
1268@itemize @bullet
1269@item
1270If @var{coding-system} supports all the ISO-2022 charsets, the value
1271is @code{iso-2022}.
1272@item
1273If @var{coding-system} supports all Emacs characters, the value is
1274@code{(emacs)}.
1275@item
1276If @var{coding-system} supports all emacs-mule characters, the value
1277is @code{emacs-mule}.
1278@item
1279If @var{coding-system} supports all Unicode characters, the value is
1280@code{(unicode)}.
1281@end itemize
b8d4c8d0
GM
1282@end defun
1283
1284 @xref{Coding systems for a subprocess,, Process Information}, in
1285particular the description of the functions
1286@code{process-coding-system} and @code{set-process-coding-system}, for
1287how to examine or set the coding systems used for I/O to a subprocess.
1288
1289@node User-Chosen Coding Systems
1290@subsection User-Chosen Coding Systems
1291
1292@cindex select safe coding system
1293@defun select-safe-coding-system from to &optional default-coding-system accept-default-p file
1294This function selects a coding system for encoding specified text,
1295asking the user to choose if necessary. Normally the specified text
1296is the text in the current buffer between @var{from} and @var{to}. If
1297@var{from} is a string, the string specifies the text to encode, and
1298@var{to} is ignored.
1299
77730170
EZ
1300If the specified text includes raw bytes (@pxref{Text
1301Representations}), @code{select-safe-coding-system} suggests
1302@code{raw-text} for its encoding.
1303
b8d4c8d0
GM
1304If @var{default-coding-system} is non-@code{nil}, that is the first
1305coding system to try; if that can handle the text,
1306@code{select-safe-coding-system} returns that coding system. It can
1307also be a list of coding systems; then the function tries each of them
1308one by one. After trying all of them, it next tries the current
1309buffer's value of @code{buffer-file-coding-system} (if it is not
4e3b4528
SM
1310@code{undecided}), then the default value of
1311@code{buffer-file-coding-system} and finally the user's most
b8d4c8d0
GM
1312preferred coding system, which the user can set using the command
1313@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
1314Coding Systems, emacs, The GNU Emacs Manual}).
1315
1316If one of those coding systems can safely encode all the specified
1317text, @code{select-safe-coding-system} chooses it and returns it.
1318Otherwise, it asks the user to choose from a list of coding systems
1319which can encode all the text, and returns the user's choice.
1320
1321@var{default-coding-system} can also be a list whose first element is
1322t and whose other elements are coding systems. Then, if no coding
1323system in the list can handle the text, @code{select-safe-coding-system}
1324queries the user immediately, without trying any of the three
1325alternatives described above.
1326
1327The optional argument @var{accept-default-p}, if non-@code{nil},
1328should be a function to determine whether a coding system selected
1329without user interaction is acceptable. @code{select-safe-coding-system}
1330calls this function with one argument, the base coding system of the
1331selected coding system. If @var{accept-default-p} returns @code{nil},
1332@code{select-safe-coding-system} rejects the silently selected coding
1333system, and asks the user to select a coding system from a list of
1334possible candidates.
1335
1336@vindex select-safe-coding-system-accept-default-p
1337If the variable @code{select-safe-coding-system-accept-default-p} is
9bd79893
GM
1338non-@code{nil}, it should be a function taking a single argument.
1339It is used in place of @var{accept-default-p}, overriding any
1340value supplied for this argument.
b8d4c8d0
GM
1341
1342As a final step, before returning the chosen coding system,
1343@code{select-safe-coding-system} checks whether that coding system is
1344consistent with what would be selected if the contents of the region
1345were read from a file. (If not, this could lead to data corruption in
1346a file subsequently re-visited and edited.) Normally,
1347@code{select-safe-coding-system} uses @code{buffer-file-name} as the
1348file for this purpose, but if @var{file} is non-@code{nil}, it uses
1349that file instead (this can be relevant for @code{write-region} and
1350similar functions). If it detects an apparent inconsistency,
1351@code{select-safe-coding-system} queries the user before selecting the
1352coding system.
1353@end defun
1354
1355 Here are two functions you can use to let the user specify a coding
1356system, with completion. @xref{Completion}.
1357
1358@defun read-coding-system prompt &optional default
1359This function reads a coding system using the minibuffer, prompting with
1360string @var{prompt}, and returns the coding system name as a symbol. If
1361the user enters null input, @var{default} specifies which coding system
1362to return. It should be a symbol or a string.
1363@end defun
1364
1365@defun read-non-nil-coding-system prompt
1366This function reads a coding system using the minibuffer, prompting with
1367string @var{prompt}, and returns the coding system name as a symbol. If
1368the user tries to enter null input, it asks the user to try again.
1369@xref{Coding Systems}.
1370@end defun
1371
1372@node Default Coding Systems
1373@subsection Default Coding Systems
0e90e7be
EZ
1374@cindex default coding system
1375@cindex coding system, automatically determined
b8d4c8d0
GM
1376
1377 This section describes variables that specify the default coding
1378system for certain files or when running certain subprograms, and the
1379function that I/O operations use to access them.
1380
1381 The idea of these variables is that you set them once and for all to the
1382defaults you want, and then do not change them again. To specify a
1383particular coding system for a particular operation in a Lisp program,
1384don't change these variables; instead, override them using
1385@code{coding-system-for-read} and @code{coding-system-for-write}
1386(@pxref{Specifying Coding Systems}).
1387
0e90e7be 1388@cindex file contents, and default coding system
01f17ae2 1389@defopt auto-coding-regexp-alist
b8d4c8d0
GM
1390This variable is an alist of text patterns and corresponding coding
1391systems. Each element has the form @code{(@var{regexp}
1392. @var{coding-system})}; a file whose first few kilobytes match
1393@var{regexp} is decoded with @var{coding-system} when its contents are
1394read into a buffer. The settings in this alist take priority over
1395@code{coding:} tags in the files and the contents of
1396@code{file-coding-system-alist} (see below). The default value is set
1397so that Emacs automatically recognizes mail files in Babyl format and
1398reads them with no code conversions.
01f17ae2 1399@end defopt
b8d4c8d0 1400
0e90e7be 1401@cindex file name, and default coding system
01f17ae2 1402@defopt file-coding-system-alist
b8d4c8d0
GM
1403This variable is an alist that specifies the coding systems to use for
1404reading and writing particular files. Each element has the form
1405@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
1406expression that matches certain file names. The element applies to file
1407names that match @var{pattern}.
1408
1409The @sc{cdr} of the element, @var{coding}, should be either a coding
1410system, a cons cell containing two coding systems, or a function name (a
1411symbol with a function definition). If @var{coding} is a coding system,
1412that coding system is used for both reading the file and writing it. If
1413@var{coding} is a cons cell containing two coding systems, its @sc{car}
1414specifies the coding system for decoding, and its @sc{cdr} specifies the
1415coding system for encoding.
1416
1417If @var{coding} is a function name, the function should take one
1418argument, a list of all arguments passed to
1419@code{find-operation-coding-system}. It must return a coding system
1420or a cons cell containing two coding systems. This value has the same
1421meaning as described above.
1422
1423If @var{coding} (or what returned by the above function) is
1424@code{undecided}, the normal code-detection is performed.
01f17ae2 1425@end defopt
b8d4c8d0 1426
01f17ae2 1427@defopt auto-coding-alist
0e90e7be
EZ
1428This variable is an alist that specifies the coding systems to use for
1429reading and writing particular files. Its form is like that of
1430@code{file-coding-system-alist}, but, unlike the latter, this variable
1431takes priority over any @code{coding:} tags in the file.
01f17ae2 1432@end defopt
0e90e7be
EZ
1433
1434@cindex program name, and default coding system
b8d4c8d0
GM
1435@defvar process-coding-system-alist
1436This variable is an alist specifying which coding systems to use for a
1437subprocess, depending on which program is running in the subprocess. It
1438works like @code{file-coding-system-alist}, except that @var{pattern} is
1439matched against the program name used to start the subprocess. The coding
1440system or systems specified in this alist are used to initialize the
1441coding systems used for I/O to the subprocess, but you can specify
1442other coding systems later using @code{set-process-coding-system}.
1443@end defvar
1444
1445 @strong{Warning:} Coding systems such as @code{undecided}, which
1446determine the coding system from the data, do not work entirely reliably
1447with asynchronous subprocess output. This is because Emacs handles
1448asynchronous subprocess output in batches, as it arrives. If the coding
1449system leaves the character code conversion unspecified, or leaves the
1450end-of-line conversion unspecified, Emacs must try to detect the proper
1451conversion from one batch at a time, and this does not always work.
1452
1453 Therefore, with an asynchronous subprocess, if at all possible, use a
1454coding system which determines both the character code conversion and
1455the end of line conversion---that is, one like @code{latin-1-unix},
1456rather than @code{undecided} or @code{latin-1}.
1457
0e90e7be
EZ
1458@cindex port number, and default coding system
1459@cindex network service name, and default coding system
b8d4c8d0
GM
1460@defvar network-coding-system-alist
1461This variable is an alist that specifies the coding system to use for
1462network streams. It works much like @code{file-coding-system-alist},
1463with the difference that the @var{pattern} in an element may be either a
1464port number or a regular expression. If it is a regular expression, it
1465is matched against the network service name used to open the network
1466stream.
1467@end defvar
1468
1469@defvar default-process-coding-system
1470This variable specifies the coding systems to use for subprocess (and
1471network stream) input and output, when nothing else specifies what to
1472do.
1473
1474The value should be a cons cell of the form @code{(@var{input-coding}
1475. @var{output-coding})}. Here @var{input-coding} applies to input from
1476the subprocess, and @var{output-coding} applies to output to it.
1477@end defvar
1478
0e90e7be 1479@cindex default coding system, functions to determine
01f17ae2 1480@defopt auto-coding-functions
b8d4c8d0
GM
1481This variable holds a list of functions that try to determine a
1482coding system for a file based on its undecoded contents.
1483
1484Each function in this list should be written to look at text in the
1485current buffer, but should not modify it in any way. The buffer will
1486contain undecoded text of parts of the file. Each function should
1487take one argument, @var{size}, which tells it how many characters to
1488look at, starting from point. If the function succeeds in determining
1489a coding system for the file, it should return that coding system.
1490Otherwise, it should return @code{nil}.
1491
1492If a file has a @samp{coding:} tag, that takes precedence, so these
1493functions won't be called.
01f17ae2 1494@end defopt
b8d4c8d0 1495
0e90e7be
EZ
1496@defun find-auto-coding filename size
1497This function tries to determine a suitable coding system for
1498@var{filename}. It examines the buffer visiting the named file, using
1499the variables documented above in sequence, until it finds a match for
1500one of the rules specified by these variables. It then returns a cons
1501cell of the form @code{(@var{coding} . @var{source})}, where
1502@var{coding} is the coding system to use and @var{source} is a symbol,
1503one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist},
1504@code{:coding}, or @code{auto-coding-functions}, indicating which one
1505supplied the matching rule. The value @code{:coding} means the coding
1506system was specified by the @code{coding:} tag in the file
1507(@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}).
1508The order of looking for a matching rule is @code{auto-coding-alist}
1509first, then @code{auto-coding-regexp-alist}, then the @code{coding:}
1510tag, and lastly @code{auto-coding-functions}. If no matching rule was
1511found, the function returns @code{nil}.
1512
1513The second argument @var{size} is the size of text, in characters,
1514following point. The function examines text only within @var{size}
1515characters after point. Normally, the buffer should be positioned at
1516the beginning when this function is called, because one of the places
1517for the @code{coding:} tag is the first one or two lines of the file;
1518in that case, @var{size} should be the size of the buffer.
1519@end defun
1520
1521@defun set-auto-coding filename size
1522This function returns a suitable coding system for file
1523@var{filename}. It uses @code{find-auto-coding} to find the coding
1524system. If no coding system could be determined, the function returns
1525@code{nil}. The meaning of the argument @var{size} is like in
1526@code{find-auto-coding}.
1527@end defun
1528
b8d4c8d0
GM
1529@defun find-operation-coding-system operation &rest arguments
1530This function returns the coding system to use (by default) for
1531performing @var{operation} with @var{arguments}. The value has this
1532form:
1533
1534@example
1535(@var{decoding-system} . @var{encoding-system})
1536@end example
1537
1538The first element, @var{decoding-system}, is the coding system to use
1539for decoding (in case @var{operation} does decoding), and
1540@var{encoding-system} is the coding system for encoding (in case
1541@var{operation} does encoding).
1542
049bcbcb
CY
1543The argument @var{operation} is a symbol; it should be one of
1544@code{write-region}, @code{start-process}, @code{call-process},
1545@code{call-process-region}, @code{insert-file-contents}, or
1546@code{open-network-stream}. These are the names of the Emacs I/O
1547primitives that can do character code and eol conversion.
b8d4c8d0
GM
1548
1549The remaining arguments should be the same arguments that might be given
1550to the corresponding I/O primitive. Depending on the primitive, one
1551of those arguments is selected as the @dfn{target}. For example, if
1552@var{operation} does file I/O, whichever argument specifies the file
1553name is the target. For subprocess primitives, the process name is the
1554target. For @code{open-network-stream}, the target is the service name
1555or port number.
1556
1557Depending on @var{operation}, this function looks up the target in
1558@code{file-coding-system-alist}, @code{process-coding-system-alist},
1559or @code{network-coding-system-alist}. If the target is found in the
1560alist, @code{find-operation-coding-system} returns its association in
1561the alist; otherwise it returns @code{nil}.
1562
1563If @var{operation} is @code{insert-file-contents}, the argument
1564corresponding to the target may be a cons cell of the form
1565@code{(@var{filename} . @var{buffer})}). In that case, @var{filename}
1566is a file name to look up in @code{file-coding-system-alist}, and
1567@var{buffer} is a buffer that contains the file's contents (not yet
1568decoded). If @code{file-coding-system-alist} specifies a function to
1569call for this file, and that function needs to examine the file's
1570contents (as it usually does), it should examine the contents of
1571@var{buffer} instead of reading the file.
1572@end defun
1573
1574@node Specifying Coding Systems
1575@subsection Specifying a Coding System for One Operation
1576
1577 You can specify the coding system for a specific operation by binding
1578the variables @code{coding-system-for-read} and/or
1579@code{coding-system-for-write}.
1580
1581@defvar coding-system-for-read
1582If this variable is non-@code{nil}, it specifies the coding system to
1583use for reading a file, or for input from a synchronous subprocess.
1584
1585It also applies to any asynchronous subprocess or network stream, but in
1586a different way: the value of @code{coding-system-for-read} when you
1587start the subprocess or open the network stream specifies the input
1588decoding method for that subprocess or network stream. It remains in
1589use for that subprocess or network stream unless and until overridden.
1590
1591The right way to use this variable is to bind it with @code{let} for a
1592specific I/O operation. Its global value is normally @code{nil}, and
1593you should not globally set it to any other value. Here is an example
1594of the right way to use the variable:
1595
1596@example
1597;; @r{Read the file with no character code conversion.}
1598;; @r{Assume @acronym{crlf} represents end-of-line.}
1599(let ((coding-system-for-read 'emacs-mule-dos))
1600 (insert-file-contents filename))
1601@end example
1602
1603When its value is non-@code{nil}, this variable takes precedence over
1604all other methods of specifying a coding system to use for input,
1605including @code{file-coding-system-alist},
1606@code{process-coding-system-alist} and
1607@code{network-coding-system-alist}.
1608@end defvar
1609
1610@defvar coding-system-for-write
1611This works much like @code{coding-system-for-read}, except that it
1612applies to output rather than input. It affects writing to files,
1613as well as sending output to subprocesses and net connections.
1614
1615When a single operation does both input and output, as do
1616@code{call-process-region} and @code{start-process}, both
1617@code{coding-system-for-read} and @code{coding-system-for-write}
1618affect it.
1619@end defvar
1620
01f17ae2 1621@defopt inhibit-eol-conversion
b8d4c8d0
GM
1622When this variable is non-@code{nil}, no end-of-line conversion is done,
1623no matter which coding system is specified. This applies to all the
1624Emacs I/O and subprocess primitives, and to the explicit encoding and
1625decoding functions (@pxref{Explicit Encoding}).
01f17ae2 1626@end defopt
b8d4c8d0 1627
91211f07
EZ
1628@cindex priority order of coding systems
1629@cindex coding systems, priority
1630 Sometimes, you need to prefer several coding systems for some
1631operation, rather than fix a single one. Emacs lets you specify a
1632priority order for using coding systems. This ordering affects the
333f9019 1633sorting of lists of coding systems returned by functions such as
91211f07
EZ
1634@code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}).
1635
1636@defun coding-system-priority-list &optional highestp
1637This function returns the list of coding systems in the order of their
1638current priorities. Optional argument @var{highestp}, if
1639non-@code{nil}, means return only the highest priority coding system.
1640@end defun
1641
1642@defun set-coding-system-priority &rest coding-systems
1643This function puts @var{coding-systems} at the beginning of the
1644priority list for coding systems, thus making their priority higher
1645than all the rest.
1646@end defun
1647
1648@defmac with-coding-priority coding-systems &rest body@dots{}
1649This macro execute @var{body}, like @code{progn} does
1650(@pxref{Sequencing, progn}), with @var{coding-systems} at the front of
1651the priority list for coding systems. @var{coding-systems} should be
1652a list of coding systems to prefer during execution of @var{body}.
1653@end defmac
1654
b8d4c8d0
GM
1655@node Explicit Encoding
1656@subsection Explicit Encoding and Decoding
1657@cindex encoding in coding systems
1658@cindex decoding in coding systems
1659
1660 All the operations that transfer text in and out of Emacs have the
1661ability to use a coding system to encode or decode the text.
1662You can also explicitly encode and decode text using the functions
1663in this section.
1664
1665 The result of encoding, and the input to decoding, are not ordinary
1666text. They logically consist of a series of byte values; that is, a
80070260
EZ
1667series of @acronym{ASCII} and eight-bit characters. In unibyte
1668buffers and strings, these characters have codes in the range 0
85eeac93
CY
1669through #xFF (255). In a multibyte buffer or string, eight-bit
1670characters have character codes higher than #xFF (@pxref{Text
1671Representations}), but Emacs transparently converts them to their
1672single-byte values when you encode or decode such text.
b8d4c8d0
GM
1673
1674 The usual way to read a file into a buffer as a sequence of bytes, so
1675you can decode the contents explicitly, is with
1676@code{insert-file-contents-literally} (@pxref{Reading from Files});
1677alternatively, specify a non-@code{nil} @var{rawfile} argument when
1678visiting a file with @code{find-file-noselect}. These methods result in
1679a unibyte buffer.
1680
1681 The usual way to use the byte sequence that results from explicitly
1682encoding text is to copy it to a file or process---for example, to write
1683it with @code{write-region} (@pxref{Writing to Files}), and suppress
1684encoding by binding @code{coding-system-for-write} to
1685@code{no-conversion}.
1686
1687 Here are the functions to perform explicit encoding or decoding. The
1688encoding functions produce sequences of bytes; the decoding functions
1689are meant to operate on sequences of bytes. All of these functions
80070260
EZ
1690discard text properties. They also set @code{last-coding-system-used}
1691to the precise coding system they used.
b8d4c8d0 1692
80070260 1693@deffn Command encode-coding-region start end coding-system &optional destination
b8d4c8d0 1694This command encodes the text from @var{start} to @var{end} according
80070260
EZ
1695to coding system @var{coding-system}. Normally, the encoded text
1696replaces the original text in the buffer, but the optional argument
1697@var{destination} can change that. If @var{destination} is a buffer,
1698the encoded text is inserted in that buffer after point (point does
1699not move); if it is @code{t}, the command returns the encoded text as
1700a unibyte string without inserting it.
1701
1702If encoded text is inserted in some buffer, this command returns the
1703length of the encoded text.
1704
1705The result of encoding is logically a sequence of bytes, but the
1706buffer remains multibyte if it was multibyte before, and any 8-bit
1707bytes are converted to their multibyte representation (@pxref{Text
1708Representations}).
77730170
EZ
1709
1710@cindex @code{undecided} coding-system, when encoding
1711Do @emph{not} use @code{undecided} for @var{coding-system} when
1712encoding text, since that may lead to unexpected results. Instead,
1713use @code{select-safe-coding-system} (@pxref{User-Chosen Coding
1714Systems, select-safe-coding-system}) to suggest a suitable encoding,
1715if there's no obvious pertinent value for @var{coding-system}.
b8d4c8d0
GM
1716@end deffn
1717
80070260 1718@defun encode-coding-string string coding-system &optional nocopy buffer
b8d4c8d0
GM
1719This function encodes the text in @var{string} according to coding
1720system @var{coding-system}. It returns a new string containing the
1721encoded text, except when @var{nocopy} is non-@code{nil}, in which
1722case the function may return @var{string} itself if the encoding
1723operation is trivial. The result of encoding is a unibyte string.
1724@end defun
1725
106e6894 1726@deffn Command decode-coding-region start end coding-system &optional destination
b8d4c8d0 1727This command decodes the text from @var{start} to @var{end} according
80070260
EZ
1728to coding system @var{coding-system}. To make explicit decoding
1729useful, the text before decoding ought to be a sequence of byte
1730values, but both multibyte and unibyte buffers are acceptable (in the
1731multibyte case, the raw byte values should be represented as eight-bit
1732characters). Normally, the decoded text replaces the original text in
1733the buffer, but the optional argument @var{destination} can change
1734that. If @var{destination} is a buffer, the decoded text is inserted
1735in that buffer after point (point does not move); if it is @code{t},
1736the command returns the decoded text as a multibyte string without
1737inserting it.
1738
1739If decoded text is inserted in some buffer, this command returns the
1740length of the decoded text.
7d2a859f
EZ
1741
1742This command puts a @code{charset} text property on the decoded text.
1743The value of the property states the character set used to decode the
1744original text.
b8d4c8d0
GM
1745@end deffn
1746
80070260
EZ
1747@defun decode-coding-string string coding-system &optional nocopy buffer
1748This function decodes the text in @var{string} according to
1749@var{coding-system}. It returns a new string containing the decoded
1750text, except when @var{nocopy} is non-@code{nil}, in which case the
1751function may return @var{string} itself if the decoding operation is
1752trivial. To make explicit decoding useful, the contents of
1753@var{string} ought to be a unibyte string with a sequence of byte
1754values, but a multibyte string is also acceptable (assuming it
1755contains 8-bit bytes in their multibyte form).
1756
1757If optional argument @var{buffer} specifies a buffer, the decoded text
1758is inserted in that buffer after point (point does not move). In this
1759case, the return value is the length of the decoded text.
7d2a859f
EZ
1760
1761@cindex @code{charset}, text property
1762This function puts a @code{charset} text property on the decoded text.
1763The value of the property states the character set used to decode the
1764original text:
1765
1766@example
1767@group
1768(decode-coding-string "Gr\374ss Gott" 'latin-1)
1769 @result{} #("Gr@"uss Gott" 0 9 (charset iso-8859-1))
1770@end group
1771@end example
b8d4c8d0
GM
1772@end defun
1773
1774@defun decode-coding-inserted-region from to filename &optional visit beg end replace
1775This function decodes the text from @var{from} to @var{to} as if
1776it were being read from file @var{filename} using @code{insert-file-contents}
1777using the rest of the arguments provided.
1778
1779The normal way to use this function is after reading text from a file
1780without decoding, if you decide you would rather have decoded it.
1781Instead of deleting the text and reading it again, this time with
1782decoding, you can call this function.
1783@end defun
1784
1785@node Terminal I/O Encoding
1786@subsection Terminal I/O Encoding
1787
1788 Emacs can decode keyboard input using a coding system, and encode
80070260
EZ
1789terminal output. This is useful for terminals that transmit or
1790display text using a particular encoding such as Latin-1. Emacs does
1791not set @code{last-coding-system-used} for encoding or decoding of
1792terminal I/O.
b8d4c8d0 1793
3f1d322f 1794@defun keyboard-coding-system &optional terminal
b8d4c8d0 1795This function returns the coding system that is in use for decoding
3f1d322f
EZ
1796keyboard input from @var{terminal}---or @code{nil} if no coding system
1797is to be used for that terminal. If @var{terminal} is omitted or
1798@code{nil}, it means the selected frame's terminal. @xref{Multiple
1799Terminals}.
b8d4c8d0
GM
1800@end defun
1801
3f1d322f
EZ
1802@deffn Command set-keyboard-coding-system coding-system &optional terminal
1803This command specifies @var{coding-system} as the coding system to use
1804for decoding keyboard input from @var{terminal}. If
1805@var{coding-system} is @code{nil}, that means do not decode keyboard
1806input. If @var{terminal} is a frame, it means that frame's terminal;
1807if it is @code{nil}, that means the currently selected frame's
1808terminal. @xref{Multiple Terminals}.
b8d4c8d0
GM
1809@end deffn
1810
106e6894 1811@defun terminal-coding-system &optional terminal
b8d4c8d0 1812This function returns the coding system that is in use for encoding
106e6894
CY
1813terminal output from @var{terminal}---or @code{nil} if the output is
1814not encoded. If @var{terminal} is a frame, it means that frame's
1815terminal; if it is @code{nil}, that means the currently selected
1816frame's terminal.
b8d4c8d0
GM
1817@end defun
1818
106e6894 1819@deffn Command set-terminal-coding-system coding-system &optional terminal
b8d4c8d0 1820This command specifies @var{coding-system} as the coding system to use
106e6894
CY
1821for encoding terminal output from @var{terminal}. If
1822@var{coding-system} is @code{nil}, terminal output is not encoded. If
1823@var{terminal} is a frame, it means that frame's terminal; if it is
1824@code{nil}, that means the currently selected frame's terminal.
b8d4c8d0
GM
1825@end deffn
1826
b8d4c8d0
GM
1827@node Input Methods
1828@section Input Methods
1829@cindex input methods
1830
1831 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
1832characters from the keyboard. Unlike coding systems, which translate
1833non-@acronym{ASCII} characters to and from encodings meant to be read by
1834programs, input methods provide human-friendly commands. (@xref{Input
1835Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1836use input methods to enter text.) How to define input methods is not
1837yet documented in this manual, but here we describe how to use them.
1838
1839 Each input method has a name, which is currently a string;
1840in the future, symbols may also be usable as input method names.
1841
1842@defvar current-input-method
1843This variable holds the name of the input method now active in the
1844current buffer. (It automatically becomes local in each buffer when set
1845in any fashion.) It is @code{nil} if no input method is active in the
1846buffer now.
1847@end defvar
1848
1849@defopt default-input-method
1850This variable holds the default input method for commands that choose an
1851input method. Unlike @code{current-input-method}, this variable is
1852normally global.
1853@end defopt
1854
1855@deffn Command set-input-method input-method
1856This command activates input method @var{input-method} for the current
1857buffer. It also sets @code{default-input-method} to @var{input-method}.
1858If @var{input-method} is @code{nil}, this command deactivates any input
1859method for the current buffer.
1860@end deffn
1861
1862@defun read-input-method-name prompt &optional default inhibit-null
1863This function reads an input method name with the minibuffer, prompting
1864with @var{prompt}. If @var{default} is non-@code{nil}, that is returned
1865by default, if the user enters empty input. However, if
1866@var{inhibit-null} is non-@code{nil}, empty input signals an error.
1867
1868The returned value is a string.
1869@end defun
1870
1871@defvar input-method-alist
1872This variable defines all the supported input methods.
1873Each element defines one input method, and should have the form:
1874
1875@example
1876(@var{input-method} @var{language-env} @var{activate-func}
1877 @var{title} @var{description} @var{args}...)
1878@end example
1879
1880Here @var{input-method} is the input method name, a string;
1881@var{language-env} is another string, the name of the language
1882environment this input method is recommended for. (That serves only for
1883documentation purposes.)
1884
1885@var{activate-func} is a function to call to activate this method. The
1886@var{args}, if any, are passed as arguments to @var{activate-func}. All
1887told, the arguments to @var{activate-func} are @var{input-method} and
1888the @var{args}.
1889
1890@var{title} is a string to display in the mode line while this method is
1891active. @var{description} is a string describing this method and what
1892it is good for.
1893@end defvar
1894
1895 The fundamental interface to input methods is through the
1896variable @code{input-method-function}. @xref{Reading One Event},
1897and @ref{Invoking the Input Method}.
1898
1899@node Locales
1900@section Locales
1901@cindex locale
1902
1903 POSIX defines a concept of ``locales'' which control which language
1904to use in language-related features. These Emacs variables control
1905how Emacs interacts with these features.
1906
1907@defvar locale-coding-system
1908@cindex keyboard input decoding on X
1909This variable specifies the coding system to use for decoding system
1910error messages and---on X Window system only---keyboard input, for
1911encoding the format argument to @code{format-time-string}, and for
1912decoding the return value of @code{format-time-string}.
1913@end defvar
1914
1915@defvar system-messages-locale
1916This variable specifies the locale to use for generating system error
1917messages. Changing the locale can cause messages to come out in a
1918different language or in a different orthography. If the variable is
1919@code{nil}, the locale is specified by environment variables in the
1920usual POSIX fashion.
1921@end defvar
1922
1923@defvar system-time-locale
1924This variable specifies the locale to use for formatting time values.
1925Changing the locale can cause messages to appear according to the
1926conventions of a different language. If the variable is @code{nil}, the
1927locale is specified by environment variables in the usual POSIX fashion.
1928@end defvar
1929
1930@defun locale-info item
1931This function returns locale data @var{item} for the current POSIX
1932locale, if available. @var{item} should be one of these symbols:
1933
1934@table @code
1935@item codeset
1936Return the character set as a string (locale item @code{CODESET}).
1937
1938@item days
1939Return a 7-element vector of day names (locale items
1940@code{DAY_1} through @code{DAY_7});
1941
1942@item months
1943Return a 12-element vector of month names (locale items @code{MON_1}
1944through @code{MON_12}).
1945
1946@item paper
1947Return a list @code{(@var{width} @var{height})} for the default paper
1948size measured in millimeters (locale items @code{PAPER_WIDTH} and
1949@code{PAPER_HEIGHT}).
1950@end table
1951
1952If the system can't provide the requested information, or if
1953@var{item} is not one of those symbols, the value is @code{nil}. All
1954strings in the return value are decoded using
1955@code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual},
1956for more information about locales and locale items.
1957@end defun