Improve indentation of bracelists defined by macros (without "=").
[bpt/emacs.git] / doc / lispref / nonascii.texi
CommitLineData
b8d4c8d0
GM
1@c -*-texinfo-*-
2@c This is part of the GNU Emacs Lisp Reference Manual.
ab422c4d 3@c Copyright (C) 1998-1999, 2001-2013 Free Software Foundation, Inc.
b8d4c8d0 4@c See the file elisp.texi for copying conditions.
ecc6530d 5@node Non-ASCII Characters
b8d4c8d0
GM
6@chapter Non-@acronym{ASCII} Characters
7@cindex multibyte characters
8@cindex characters, multi-byte
9@cindex non-@acronym{ASCII} characters
10
c4526e93
EZ
11 This chapter covers the special issues relating to characters and
12how they are stored in strings and buffers.
b8d4c8d0
GM
13
14@menu
c4526e93 15* Text Representations:: How Emacs represents text.
64a695bd 16* Disabling Multibyte:: Controlling whether to use multibyte characters.
b8d4c8d0
GM
17* Converting Representations:: Converting unibyte to multibyte and vice versa.
18* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
19* Character Codes:: How unibyte and multibyte relate to
20 codes of individual characters.
91211f07
EZ
21* Character Properties:: Character attributes that define their
22 behavior and handling.
b8d4c8d0
GM
23* Character Sets:: The space of possible character codes
24 is divided into various character sets.
b8d4c8d0
GM
25* Scanning Charsets:: Which character sets are used in a buffer?
26* Translation of Characters:: Translation tables are used for conversion.
27* Coding Systems:: Coding systems are conversions for saving files.
28* Input Methods:: Input methods allow users to enter various
29 non-ASCII characters without special keyboards.
30* Locales:: Interacting with the POSIX locale.
31@end menu
32
33@node Text Representations
34@section Text Representations
c4526e93
EZ
35@cindex text representation
36
37 Emacs buffers and strings support a large repertoire of characters
97d8273f 38from many different scripts, allowing users to type and display text
8cc8cecf 39in almost any known written language.
c4526e93
EZ
40
41@cindex character codepoint
42@cindex codespace
43@cindex Unicode
44 To support this multitude of characters and scripts, Emacs closely
45follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
46unique number, called a @dfn{codepoint}, to each and every character.
47The range of codepoints defined by Unicode, or the Unicode
85eeac93
CY
48@dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation),
49inclusive. Emacs extends this range with codepoints in the range
50@code{#x110000..#x3FFFFF}, which it uses for representing characters
51that are not unified with Unicode and @dfn{raw 8-bit bytes} that
52cannot be interpreted as characters. Thus, a character codepoint in
53Emacs is a 22-bit integer number.
c4526e93
EZ
54
55@cindex internal representation of characters
56@cindex characters, representation in buffers and strings
57@cindex multibyte text
58 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
59that are codepoints of text characters within buffers and strings.
60Rather, Emacs uses a variable-length internal representation of
61characters, that stores each character as a sequence of 1 to 5 8-bit
62bytes, depending on the magnitude of its codepoint@footnote{
63This internal representation is based on one of the encodings defined
64by the Unicode Standard, called @dfn{UTF-8}, for representing any
65Unicode codepoint, but Emacs extends UTF-8 to represent the additional
8b80cdf5 66codepoints it uses for raw 8-bit bytes and characters not unified with
97d8273f
CY
67Unicode.}. For example, any @acronym{ASCII} character takes up only 1
68byte, a Latin-1 character takes up 2 bytes, etc. We call this
69representation of text @dfn{multibyte}.
c4526e93
EZ
70
71 Outside Emacs, characters can be represented in many different
72encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
97d8273f 73between these external encodings and its internal representation, as
c4526e93
EZ
74appropriate, when it reads text into a buffer or a string, or when it
75writes text to a disk file or passes it to some other process.
76
77 Occasionally, Emacs needs to hold and manipulate encoded text or
031c41de
EZ
78binary non-text data in its buffers or strings. For example, when
79Emacs visits a file, it first reads the file's text verbatim into a
80buffer, and only then converts it to the internal representation.
81Before the conversion, the buffer holds encoded text.
b8d4c8d0
GM
82
83@cindex unibyte text
c4526e93
EZ
84 Encoded text is not really text, as far as Emacs is concerned, but
85rather a sequence of raw 8-bit bytes. We call buffers and strings
86that hold encoded text @dfn{unibyte} buffers and strings, because
97d8273f
CY
87Emacs treats them as a sequence of individual bytes. Usually, Emacs
88displays unibyte buffers and strings as octal codes such as
89@code{\237}. We recommend that you never use unibyte buffers and
c4526e93 90strings except for manipulating encoded text or binary non-text data.
b8d4c8d0
GM
91
92 In a buffer, the buffer-local value of the variable
93@code{enable-multibyte-characters} specifies the representation used.
94The representation for a string is determined and recorded in the string
95when the string is constructed.
96
8a14dec7 97@defvar enable-multibyte-characters
b8d4c8d0
GM
98This variable specifies the current buffer's text representation.
99If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
c4526e93 100it contains unibyte encoded text or binary non-text data.
b8d4c8d0
GM
101
102You cannot set this variable directly; instead, use the function
103@code{set-buffer-multibyte} to change a buffer's representation.
8a14dec7 104@end defvar
b8d4c8d0
GM
105
106@defun position-bytes position
c4526e93
EZ
107Buffer positions are measured in character units. This function
108returns the byte-position corresponding to buffer position
b8d4c8d0
GM
109@var{position} in the current buffer. This is 1 at the start of the
110buffer, and counts upward in bytes. If @var{position} is out of
111range, the value is @code{nil}.
112@end defun
113
114@defun byte-to-position byte-position
031c41de
EZ
115Return the buffer position, in character units, corresponding to given
116@var{byte-position} in the current buffer. If @var{byte-position} is
117out of range, the value is @code{nil}. In a multibyte buffer, an
118arbitrary value of @var{byte-position} can be not at character
119boundary, but inside a multibyte sequence representing a single
120character; in this case, this function returns the buffer position of
121the character whose multibyte sequence includes @var{byte-position}.
122In other words, the value does not change for all byte positions that
123belong to the same character.
b8d4c8d0
GM
124@end defun
125
126@defun multibyte-string-p string
c4526e93 127Return @code{t} if @var{string} is a multibyte string, @code{nil}
3323c263
EZ
128otherwise. This function also returns @code{nil} if @var{string} is
129some object other than a string.
b8d4c8d0
GM
130@end defun
131
132@defun string-bytes string
133@cindex string, number of bytes
134This function returns the number of bytes in @var{string}.
135If @var{string} is a multibyte string, this can be greater than
136@code{(length @var{string})}.
137@end defun
138
c4526e93
EZ
139@defun unibyte-string &rest bytes
140This function concatenates all its argument @var{bytes} and makes the
141result a unibyte string.
142@end defun
143
64a695bd
XF
144@node Disabling Multibyte
145@section Disabling Multibyte Characters
146@cindex disabling multibyte
147
148 By default, Emacs starts in multibyte mode: it stores the contents
149of buffers and strings using an internal encoding that represents
150non-@acronym{ASCII} characters using multi-byte sequences. Multibyte
151mode allows you to use all the supported languages and scripts without
152limitations.
153
154@cindex turn multibyte support on or off
155 Under very special circumstances, you may want to disable multibyte
156character support, for a specific buffer.
157When multibyte characters are disabled in a buffer, we call
158that @dfn{unibyte mode}. In unibyte mode, each character in the
159buffer has a character code ranging from 0 through 255 (0377 octal); 0
160through 127 (0177 octal) represent @acronym{ASCII} characters, and 128
161(0200 octal) through 255 (0377 octal) represent non-@acronym{ASCII}
162characters.
163
164 To edit a particular file in unibyte representation, visit it using
165@code{find-file-literally}. @xref{Visiting Functions}. You can
166convert a multibyte buffer to unibyte by saving it to a file, killing
167the buffer, and visiting the file again with
168@code{find-file-literally}. Alternatively, you can use @kbd{C-x
169@key{RET} c} (@code{universal-coding-system-argument}) and specify
170@samp{raw-text} as the coding system with which to visit or save a
171file. @xref{Text Coding, , Specifying a Coding System for File Text,
172emacs, GNU Emacs Manual}. Unlike @code{find-file-literally}, finding
173a file as @samp{raw-text} doesn't disable format conversion,
174uncompression, or auto mode selection.
175
176@c See http://debbugs.gnu.org/11226 for lack of unibyte tooltip.
177@vindex enable-multibyte-characters
178The buffer-local variable @code{enable-multibyte-characters} is
179non-@code{nil} in multibyte buffers, and @code{nil} in unibyte ones.
180The mode line also indicates whether a buffer is multibyte or not.
181With a graphical display, in a multibyte buffer, the portion of the
182mode line that indicates the character set has a tooltip that (amongst
183other things) says that the buffer is multibyte. In a unibyte buffer,
184the character set indicator is absent. Thus, in a unibyte buffer
185(when using a graphical display) there is normally nothing before the
186indication of the visited file's end-of-line convention (colon,
187backslash, etc.), unless you are using an input method.
188
189@findex toggle-enable-multibyte-characters
190You can turn off multibyte support in a specific buffer by invoking the
191command @code{toggle-enable-multibyte-characters} in that buffer.
192
b8d4c8d0
GM
193@node Converting Representations
194@section Converting Text Representations
195
196 Emacs can convert unibyte text to multibyte; it can also convert
031c41de 197multibyte text to unibyte, provided that the multibyte text contains
8b80cdf5 198only @acronym{ASCII} and 8-bit raw bytes. In general, these
031c41de
EZ
199conversions happen when inserting text into a buffer, or when putting
200text from several strings together in one string. You can also
201explicitly convert a string's contents to either representation.
b8d4c8d0 202
97d8273f
CY
203 Emacs chooses the representation for a string based on the text from
204which it is constructed. The general rule is to convert unibyte text
205to multibyte text when combining it with other multibyte text, because
206the multibyte representation is more general and can hold whatever
b8d4c8d0
GM
207characters the unibyte text has.
208
209 When inserting text into a buffer, Emacs converts the text to the
210buffer's representation, as specified by
211@code{enable-multibyte-characters} in that buffer. In particular, when
212you insert multibyte text into a unibyte buffer, Emacs converts the text
213to unibyte, even though this conversion cannot in general preserve all
214the characters that might be in the multibyte text. The other natural
215alternative, to convert the buffer contents to multibyte, is not
216acceptable because the buffer's representation is a choice made by the
217user that cannot be overridden automatically.
218
97d8273f 219 Converting unibyte text to multibyte text leaves @acronym{ASCII}
e4021ec1 220characters unchanged, and converts bytes with codes 128 through 255 to
97d8273f 221the multibyte representation of raw eight-bit bytes.
b8d4c8d0 222
031c41de
EZ
223 Converting multibyte text to unibyte converts all @acronym{ASCII}
224and eight-bit characters to their single-byte form, but loses
225information for non-@acronym{ASCII} characters by discarding all but
226the low 8 bits of each character's codepoint. Converting unibyte text
227to multibyte and back to unibyte reproduces the original unibyte text.
b8d4c8d0 228
031c41de 229The next two functions either return the argument @var{string}, or a
b8d4c8d0
GM
230newly created string with no text properties.
231
b8d4c8d0
GM
232@defun string-to-multibyte string
233This function returns a multibyte string containing the same sequence
031c41de 234of characters as @var{string}. If @var{string} is a multibyte string,
8b80cdf5
EZ
235it is returned unchanged. The function assumes that @var{string}
236includes only @acronym{ASCII} characters and raw 8-bit bytes; the
237latter are converted to their multibyte representation corresponding
85eeac93
CY
238to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
239(@pxref{Text Representations, codepoints}).
031c41de
EZ
240@end defun
241
242@defun string-to-unibyte string
243This function returns a unibyte string containing the same sequence of
244characters as @var{string}. It signals an error if @var{string}
245contains a non-@acronym{ASCII} character. If @var{string} is a
8b80cdf5
EZ
246unibyte string, it is returned unchanged. Use this function for
247@var{string} arguments that contain only @acronym{ASCII} and eight-bit
248characters.
b8d4c8d0
GM
249@end defun
250
3e99b825
CY
251@defun byte-to-string byte
252@cindex byte to string
253This function returns a unibyte string containing a single byte of
35a30759 254character data, @var{character}. It signals an error if
3e99b825
CY
255@var{character} is not an integer between 0 and 255.
256@end defun
b8d4c8d0
GM
257
258@defun multibyte-char-to-unibyte char
97d8273f
CY
259This converts the multibyte character @var{char} to a unibyte
260character, and returns that character. If @var{char} is neither
261@acronym{ASCII} nor eight-bit, the function returns -1.
b8d4c8d0
GM
262@end defun
263
264@defun unibyte-char-to-multibyte char
265This convert the unibyte character @var{char} to a multibyte
8b80cdf5
EZ
266character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
267byte.
b8d4c8d0
GM
268@end defun
269
270@node Selecting a Representation
271@section Selecting a Representation
272
273 Sometimes it is useful to examine an existing buffer or string as
274multibyte when it was unibyte, or vice versa.
275
276@defun set-buffer-multibyte multibyte
277Set the representation type of the current buffer. If @var{multibyte}
278is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
279is @code{nil}, the buffer becomes unibyte.
280
281This function leaves the buffer contents unchanged when viewed as a
031c41de 282sequence of bytes. As a consequence, it can change the contents
97d8273f
CY
283viewed as characters; for instance, a sequence of three bytes which is
284treated as one character in multibyte representation will count as
285three characters in unibyte representation. Eight-bit characters
031c41de
EZ
286representing raw bytes are an exception. They are represented by one
287byte in a unibyte buffer, but when the buffer is set to multibyte,
288they are converted to two-byte sequences, and vice versa.
b8d4c8d0
GM
289
290This function sets @code{enable-multibyte-characters} to record which
291representation is in use. It also adjusts various data in the buffer
292(including overlays, text properties and markers) so that they cover the
293same text as they did before.
294
cd996018
CY
295This function signals an error if the buffer is narrowed, since the
296narrowing might have occurred in the middle of multibyte character
297sequences.
298
299This function also signals an error if the buffer is an indirect
300buffer. An indirect buffer always inherits the representation of its
b8d4c8d0
GM
301base buffer.
302@end defun
303
304@defun string-as-unibyte string
97d8273f
CY
305If @var{string} is already a unibyte string, this function returns
306@var{string} itself. Otherwise, it returns a new string with the same
307bytes as @var{string}, but treating each byte as a separate character
308(so that the value may have more characters than @var{string}); as an
309exception, each eight-bit character representing a raw byte is
310converted into a single byte. The newly-created string contains no
031c41de 311text properties.
b8d4c8d0
GM
312@end defun
313
314@defun string-as-multibyte string
97d8273f
CY
315If @var{string} is a multibyte string, this function returns
316@var{string} itself. Otherwise, it returns a new string with the same
317bytes as @var{string}, but treating each multibyte sequence as one
318character. This means that the value may have fewer characters than
319@var{string} has. If a byte sequence in @var{string} is invalid as a
320multibyte representation of a single character, each byte in the
321sequence is treated as a raw 8-bit byte. The newly-created string
322contains no text properties.
b8d4c8d0
GM
323@end defun
324
325@node Character Codes
326@section Character Codes
327@cindex character codes
328
ffdbc44b
CY
329 The unibyte and multibyte text representations use different
330character codes. The valid character codes for unibyte representation
85eeac93
CY
331range from 0 to @code{#xFF} (255)---the values that can fit in one
332byte. The valid character codes for multibyte representation range
333from 0 to @code{#x3FFFFF}. In this code space, values 0 through
334@code{#x7F} (127) are for @acronym{ASCII} characters, and values
335@code{#x80} (128) through @code{#x3FFF7F} (4194175) are for
336non-@acronym{ASCII} characters.
337
338 Emacs character codes are a superset of the Unicode standard.
339Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode
340characters of the same codepoint; values @code{#x110000} (1114112)
341through @code{#x3FFF7F} (4194175) represent characters that are not
342unified with Unicode; and values @code{#x3FFF80} (4194176) through
343@code{#x3FFFFF} (4194303) represent eight-bit raw bytes.
ffdbc44b
CY
344
345@defun characterp charcode
346This returns @code{t} if @var{charcode} is a valid character, and
347@code{nil} otherwise.
b8d4c8d0
GM
348
349@example
80070260 350@group
ffdbc44b 351(characterp 65)
b8d4c8d0 352 @result{} t
80070260
EZ
353@end group
354@group
ffdbc44b 355(characterp 4194303)
b8d4c8d0 356 @result{} t
80070260
EZ
357@end group
358@group
ffdbc44b
CY
359(characterp 4194304)
360 @result{} nil
80070260
EZ
361@end group
362@end example
363@end defun
364
365@cindex maximum value of character codepoint
366@cindex codepoint, largest value
367@defun max-char
368This function returns the largest value that a valid character
369codepoint can have.
370
371@example
372@group
373(characterp (max-char))
374 @result{} t
375@end group
376@group
377(characterp (1+ (max-char)))
378 @result{} nil
379@end group
b8d4c8d0 380@end example
b8d4c8d0
GM
381@end defun
382
106e6894 383@defun get-byte &optional pos string
97d8273f
CY
384This function returns the byte at character position @var{pos} in the
385current buffer. If the current buffer is unibyte, this is literally
386the byte at that position. If the buffer is multibyte, byte values of
031c41de
EZ
387@acronym{ASCII} characters are the same as character codepoints,
388whereas eight-bit raw bytes are converted to their 8-bit codes. The
389function signals an error if the character at @var{pos} is
390non-@acronym{ASCII}.
391
392The optional argument @var{string} means to get a byte value from that
393string instead of the current buffer.
394@end defun
395
91211f07
EZ
396@node Character Properties
397@section Character Properties
398@cindex character properties
399A @dfn{character property} is a named attribute of a character that
400specifies how the character behaves and how it should be handled
401during text processing and display. Thus, character properties are an
402important part of specifying the character's semantics.
403
434843ec 404 On the whole, Emacs follows the Unicode Standard in its implementation
91211f07
EZ
405of character properties. In particular, Emacs supports the
406@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
407Model}, and the Emacs character property database is derived from the
408Unicode Character Database (@acronym{UCD}). See the
409@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
97d8273f
CY
410Properties chapter of the Unicode Standard}, for a detailed
411description of Unicode character properties and their meaning. This
412section assumes you are already familiar with that chapter of the
413Unicode Standard, and want to apply that knowledge to Emacs Lisp
414programs.
91211f07
EZ
415
416 In Emacs, each property has a name, which is a symbol, and a set of
af38459f 417possible values, whose types depend on the property; if a character
c06ea95e
EZ
418does not have a certain property, the value is @code{nil}. As a
419general rule, the names of character properties in Emacs are produced
420from the corresponding Unicode properties by downcasing them and
421replacing each @samp{_} character with a dash @samp{-}. For example,
422@code{Canonical_Combining_Class} becomes
423@code{canonical-combining-class}. However, sometimes we shorten the
424names to make their use easier.
425
bca633fb
EZ
426@cindex unassigned character codepoints
427 Some codepoints are left @dfn{unassigned} by the
428@acronym{UCD}---they don't correspond to any character. The Unicode
429Standard defines default values of properties for such codepoints;
430they are mentioned below for each property.
431
97d8273f
CY
432 Here is the full list of value types for all the character
433properties that Emacs knows about:
91211f07
EZ
434
435@table @code
436@item name
049bcbcb
CY
437Corresponds to the @code{Name} Unicode property. The value is a
438string consisting of upper-case Latin letters A to Z, digits, spaces,
bca633fb
EZ
439and hyphen @samp{-} characters. For unassigned codepoints, the value
440is an empty string.
91211f07 441
f8848423 442@cindex unicode general category
91211f07 443@item general-category
049bcbcb
CY
444Corresponds to the @code{General_Category} Unicode property. The
445value is a symbol whose name is a 2-letter abbreviation of the
bca633fb
EZ
446character's classification. For unassigned codepoints, the value
447is @code{Cn}.
91211f07
EZ
448
449@item canonical-combining-class
049bcbcb 450Corresponds to the @code{Canonical_Combining_Class} Unicode property.
bca633fb
EZ
451The value is an integer number. For unassigned codepoints, the value
452is zero.
91211f07 453
10862873 454@cindex bidirectional class of characters
91211f07 455@item bidi-class
af38459f
EZ
456Corresponds to the Unicode @code{Bidi_Class} property. The value is a
457symbol whose name is the Unicode @dfn{directional type} of the
c094bb0c 458character. Emacs uses this property when it reorders bidirectional
bca633fb
EZ
459text for display (@pxref{Bidirectional Display}). For unassigned
460codepoints, the value depends on the code blocks to which the
461codepoint belongs: most unassigned codepoints get the value of
462@code{L} (strong L), but some get values of @code{AL} (Arabic letter)
463or @code{R} (strong R).
91211f07
EZ
464
465@item decomposition
84f4a531
CY
466Corresponds to the Unicode properties @code{Decomposition_Type} and
467@code{Decomposition_Value}. The value is a list, whose first element
468may be a symbol representing a compatibility formatting tag, such as
469@code{small}@footnote{The Unicode specification writes these tag names
470inside @samp{<..>} brackets, but the tag names in Emacs do not include
1df7defd 471the brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses
84f4a531
CY
472@samp{small}. }; the other elements are characters that give the
473compatibility decomposition sequence of this character. For
474unassigned codepoints, the value is the character itself.
91211f07
EZ
475
476@item decimal-digit-value
af38459f
EZ
477Corresponds to the Unicode @code{Numeric_Value} property for
478characters whose @code{Numeric_Type} is @samp{Digit}. The value is an
bca633fb
EZ
479integer number. For unassigned codepoints, the value is @code{nil},
480which means @acronym{NaN}, or ``not-a-number''.
91211f07 481
bc039a3b 482@item digit-value
af38459f
EZ
483Corresponds to the Unicode @code{Numeric_Value} property for
484characters whose @code{Numeric_Type} is @samp{Decimal}. The value is
485an integer number. Examples of such characters include compatibility
486subscript and superscript digits, for which the value is the
bca633fb
EZ
487corresponding number. For unassigned codepoints, the value is
488@code{nil}, which means @acronym{NaN}.
91211f07
EZ
489
490@item numeric-value
af38459f
EZ
491Corresponds to the Unicode @code{Numeric_Value} property for
492characters whose @code{Numeric_Type} is @samp{Numeric}. The value of
97d8273f 493this property is an integer or a floating-point number. Examples of
af38459f 494characters that have this property include fractions, subscripts,
91211f07 495superscripts, Roman numerals, currency numerators, and encircled
af38459f 496numbers. For example, the value of this property for the character
bca633fb
EZ
497@code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}. For
498unassigned codepoints, the value is @code{nil}, which means
499@acronym{NaN}.
91211f07 500
10862873 501@cindex mirroring of characters
91211f07 502@item mirrored
af38459f 503Corresponds to the Unicode @code{Bidi_Mirrored} property. The value
bca633fb
EZ
504of this property is a symbol, either @code{Y} or @code{N}. For
505unassigned codepoints, the value is @code{N}.
91211f07 506
10862873
EZ
507@item mirroring
508Corresponds to the Unicode @code{Bidi_Mirroring_Glyph} property. The
509value of this property is a character whose glyph represents the
510mirror image of the character's glyph, or @code{nil} if there's no
511defined mirroring glyph. All the characters whose @code{mirrored}
512property is @code{N} have @code{nil} as their @code{mirroring}
513property; however, some characters whose @code{mirrored} property is
514@code{Y} also have @code{nil} for @code{mirroring}, because no
c094bb0c
EZ
515appropriate characters exist with mirrored glyphs. Emacs uses this
516property to display mirror images of characters when appropriate
bca633fb
EZ
517(@pxref{Bidirectional Display}). For unassigned codepoints, the value
518is @code{nil}.
10862873 519
91211f07 520@item old-name
af38459f 521Corresponds to the Unicode @code{Unicode_1_Name} property. The value
bca633fb 522is a string. For unassigned codepoints, the value is an empty string.
91211f07
EZ
523
524@item iso-10646-comment
af38459f 525Corresponds to the Unicode @code{ISO_Comment} property. The value is
bca633fb 526a string. For unassigned codepoints, the value is an empty string.
91211f07
EZ
527
528@item uppercase
af38459f 529Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property.
bca633fb
EZ
530The value of this property is a single character. For unassigned
531codepoints, the value is @code{nil}, which means the character itself.
91211f07
EZ
532
533@item lowercase
af38459f 534Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property.
bca633fb
EZ
535The value of this property is a single character. For unassigned
536codepoints, the value is @code{nil}, which means the character itself.
91211f07
EZ
537
538@item titlecase
af38459f 539Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
91211f07 540@dfn{Title case} is a special form of a character used when the first
af38459f 541character of a word needs to be capitalized. The value of this
bca633fb
EZ
542property is a single character. For unassigned codepoints, the value
543is @code{nil}, which means the character itself.
91211f07
EZ
544@end table
545
546@defun get-char-code-property char propname
547This function returns the value of @var{char}'s @var{propname} property.
548
549@example
550@group
551(get-char-code-property ? 'general-category)
552 @result{} Zs
553@end group
554@group
555(get-char-code-property ?1 'general-category)
556 @result{} Nd
557@end group
558@group
049bcbcb
CY
559;; subscript 4
560(get-char-code-property ?\u2084 'digit-value)
91211f07
EZ
561 @result{} 4
562@end group
563@group
049bcbcb
CY
564;; one fifth
565(get-char-code-property ?\u2155 'numeric-value)
bc039a3b 566 @result{} 0.2
91211f07
EZ
567@end group
568@group
049bcbcb
CY
569;; Roman IV
570(get-char-code-property ?\u2163 'numeric-value)
bc039a3b 571 @result{} 4
91211f07
EZ
572@end group
573@end example
574@end defun
575
576@defun char-code-property-description prop value
577This function returns the description string of property @var{prop}'s
578@var{value}, or @code{nil} if @var{value} has no description.
579
580@example
581@group
582(char-code-property-description 'general-category 'Zs)
583 @result{} "Separator, Space"
584@end group
585@group
586(char-code-property-description 'general-category 'Nd)
587 @result{} "Number, Decimal Digit"
588@end group
589@group
590(char-code-property-description 'numeric-value '1/5)
591 @result{} nil
592@end group
593@end example
594@end defun
595
596@defun put-char-code-property char propname value
597This function stores @var{value} as the value of the property
598@var{propname} for the character @var{char}.
599@end defun
600
f8848423 601@defvar unicode-category-table
91211f07 602The value of this variable is a char-table (@pxref{Char-Tables}) that
f8848423
EZ
603specifies, for each character, its Unicode @code{General_Category}
604property as a symbol.
605@end defvar
606
607@defvar char-script-table
608The value of this variable is a char-table that specifies, for each
609character, a symbol whose name is the script to which the character
610belongs, according to the Unicode Standard classification of the
611Unicode code space into script-specific blocks. This char-table has a
612single extra slot whose value is the list of all script symbols.
91211f07
EZ
613@end defvar
614
615@defvar char-width-table
616The value of this variable is a char-table that specifies the width of
617each character in columns that it will occupy on the screen.
618@end defvar
619
620@defvar printable-chars
621The value of this variable is a char-table that specifies, for each
622character, whether it is printable or not. That is, if evaluating
623@code{(aref printable-chars char)} results in @code{t}, the character
624is printable, and if it results in @code{nil}, it is not.
625@end defvar
626
b8d4c8d0
GM
627@node Character Sets
628@section Character Sets
629@cindex character sets
630
031c41de
EZ
631@cindex charset
632@cindex coded character set
633An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
634in which each character is assigned a numeric code point. (The
434843ec 635Unicode Standard calls this a @dfn{coded character set}.) Each Emacs
031c41de
EZ
636charset has a name which is a symbol. A single character can belong
637to any number of different character sets, but it will generally have
638a different code point in each charset. Examples of character sets
639include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and
640@code{windows-1255}. The code point assigned to a character in a
641charset is usually different from its code point used in Emacs buffers
642and strings.
643
644@cindex @code{emacs}, a charset
645@cindex @code{unicode}, a charset
646@cindex @code{eight-bit}, a charset
647 Emacs defines several special character sets. The character set
648@code{unicode} includes all the characters whose Emacs code points are
85eeac93 649in the range @code{0..#x10FFFF}. The character set @code{emacs}
031c41de
EZ
650includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
651Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
652Emacs uses it to represent raw bytes encountered in text.
b8d4c8d0
GM
653
654@defun charsetp object
655Returns @code{t} if @var{object} is a symbol that names a character set,
656@code{nil} otherwise.
657@end defun
658
659@defvar charset-list
660The value is a list of all defined character set names.
661@end defvar
662
031c41de 663@defun charset-priority-list &optional highestp
73e0cbc0 664This function returns a list of all defined character sets ordered by
031c41de
EZ
665their priority. If @var{highestp} is non-@code{nil}, the function
666returns a single character set of the highest priority.
667@end defun
668
669@defun set-charset-priority &rest charsets
670This function makes @var{charsets} the highest priority character sets.
b8d4c8d0
GM
671@end defun
672
106e6894 673@defun char-charset character &optional restriction
031c41de
EZ
674This function returns the name of the character set of highest
675priority that @var{character} belongs to. @acronym{ASCII} characters
676are an exception: for them, this function always returns @code{ascii}.
106e6894
CY
677
678If @var{restriction} is non-@code{nil}, it should be a list of
679charsets to search. Alternatively, it can be a coding system, in
680which case the returned charset must be supported by that coding
681system (@pxref{Coding Systems}).
b8d4c8d0
GM
682@end defun
683
684@defun charset-plist charset
031c41de
EZ
685This function returns the property list of the character set
686@var{charset}. Although @var{charset} is a symbol, this is not the
687same as the property list of that symbol. Charset properties include
688important information about the charset, such as its documentation
689string, short name, etc.
b8d4c8d0
GM
690@end defun
691
031c41de
EZ
692@defun put-charset-property charset propname value
693This function sets the @var{propname} property of @var{charset} to the
694given @var{value}.
b8d4c8d0
GM
695@end defun
696
031c41de
EZ
697@defun get-charset-property charset propname
698This function returns the value of @var{charset}s property
699@var{propname}.
b8d4c8d0
GM
700@end defun
701
031c41de
EZ
702@deffn Command list-charset-chars charset
703This command displays a list of characters in the character set
704@var{charset}.
705@end deffn
b8d4c8d0 706
8b80cdf5
EZ
707 Emacs can convert between its internal representation of a character
708and the character's codepoint in a specific charset. The following
709two functions support these conversions.
710
711@c FIXME: decode-char and encode-char accept and ignore an additional
712@c argument @var{restriction}. When that argument actually makes a
713@c difference, it should be documented here.
031c41de
EZ
714@defun decode-char charset code-point
715This function decodes a character that is assigned a @var{code-point}
716in @var{charset}, to the corresponding Emacs character, and returns
8b80cdf5
EZ
717it. If @var{charset} doesn't contain a character of that code point,
718the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp
719integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
720specified as a cons cell @code{(@var{high} . @var{low})}, where
031c41de
EZ
721@var{low} are the lower 16 bits of the value and @var{high} are the
722high 16 bits.
b8d4c8d0
GM
723@end defun
724
031c41de
EZ
725@defun encode-char char charset
726This function returns the code point assigned to the character
8b80cdf5
EZ
727@var{char} in @var{charset}. If the result does not fit in a Lisp
728integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
729that fits the second argument of @code{decode-char} above. If
730@var{charset} doesn't have a codepoint for @var{char}, the value is
731@code{nil}.
b3f1f4a5
EZ
732@end defun
733
734 The following function comes in handy for applying a certain
735function to all or part of the characters in a charset:
736
85eeac93 737@defun map-charset-chars function charset &optional arg from-code to-code
b3f1f4a5
EZ
738Call @var{function} for characters in @var{charset}. @var{function}
739is called with two arguments. The first one is a cons cell
740@code{(@var{from} . @var{to})}, where @var{from} and @var{to}
741indicate a range of characters contained in charset. The second
85eeac93 742argument passed to @var{function} is @var{arg}.
b3f1f4a5
EZ
743
744By default, the range of codepoints passed to @var{function} includes
8c9d5f9f
KH
745all the characters in @var{charset}, but optional arguments
746@var{from-code} and @var{to-code} limit that to the range of
747characters between these two codepoints of @var{charset}. If either
748of them is @code{nil}, it defaults to the first or last codepoint of
749@var{charset}, respectively.
b8d4c8d0
GM
750@end defun
751
b8d4c8d0
GM
752@node Scanning Charsets
753@section Scanning for Character Sets
754
97d8273f
CY
755 Sometimes it is useful to find out which character set a particular
756character belongs to. One use for this is in determining which coding
757systems (@pxref{Coding Systems}) are capable of representing all of
758the text in question; another is to determine the font(s) for
759displaying that text.
b8d4c8d0
GM
760
761@defun charset-after &optional pos
031c41de 762This function returns the charset of highest priority containing the
97d8273f 763character at position @var{pos} in the current buffer. If @var{pos}
031c41de
EZ
764is omitted or @code{nil}, it defaults to the current value of point.
765If @var{pos} is out of range, the value is @code{nil}.
b8d4c8d0
GM
766@end defun
767
768@defun find-charset-region beg end &optional translation
031c41de 769This function returns a list of the character sets of highest priority
8b80cdf5 770that contain characters in the current buffer between positions
031c41de 771@var{beg} and @var{end}.
b8d4c8d0 772
97d8273f
CY
773The optional argument @var{translation} specifies a translation table
774to use for scanning the text (@pxref{Translation of Characters}). If
775it is non-@code{nil}, then each character in the region is translated
b8d4c8d0
GM
776through this table, and the value returned describes the translated
777characters instead of the characters actually in the buffer.
778@end defun
779
780@defun find-charset-string string &optional translation
97d8273f 781This function returns a list of character sets of highest priority
031c41de
EZ
782that contain characters in @var{string}. It is just like
783@code{find-charset-region}, except that it applies to the contents of
784@var{string} instead of part of the current buffer.
b8d4c8d0
GM
785@end defun
786
787@node Translation of Characters
788@section Translation of Characters
789@cindex character translation tables
790@cindex translation tables
791
031c41de
EZ
792 A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that
793specifies a mapping of characters into characters. These tables are
794used in encoding and decoding, and for other purposes. Some coding
795systems specify their own particular translation tables; there are
796also default translation tables which apply to all other coding
797systems.
b8d4c8d0 798
031c41de
EZ
799 A translation table has two extra slots. The first is either
800@code{nil} or a translation table that performs the reverse
801translation; the second is the maximum number of characters to look up
8b80cdf5
EZ
802for translating sequences of characters (see the description of
803@code{make-translation-table-from-alist} below).
b8d4c8d0
GM
804
805@defun make-translation-table &rest translations
806This function returns a translation table based on the argument
807@var{translations}. Each element of @var{translations} should be a
808list of elements of the form @code{(@var{from} . @var{to})}; this says
809to translate the character @var{from} into @var{to}.
810
811The arguments and the forms in each argument are processed in order,
812and if a previous form already translates @var{to} to some other
813character, say @var{to-alt}, @var{from} is also translated to
814@var{to-alt}.
b8d4c8d0
GM
815@end defun
816
031c41de
EZ
817 During decoding, the translation table's translations are applied to
818the characters that result from ordinary decoding. If a coding system
97d8273f 819has the property @code{:decode-translation-table}, that specifies the
031c41de
EZ
820translation table to use, or a list of translation tables to apply in
821sequence. (This is a property of the coding system, as returned by
822@code{coding-system-get}, not a property of the symbol that is the
823coding system's name. @xref{Coding System Basics,, Basic Concepts of
824Coding Systems}.) Finally, if
825@code{standard-translation-table-for-decode} is non-@code{nil}, the
826resulting characters are translated by that table.
827
828 During encoding, the translation table's translations are applied to
829the characters in the buffer, and the result of translation is
830actually encoded. If a coding system has property
831@code{:encode-translation-table}, that specifies the translation table
832to use, or a list of translation tables to apply in sequence. In
833addition, if the variable @code{standard-translation-table-for-encode}
834is non-@code{nil}, it specifies the translation table to use for
835translating the result.
b8d4c8d0
GM
836
837@defvar standard-translation-table-for-decode
031c41de
EZ
838This is the default translation table for decoding. If a coding
839systems specifies its own translation tables, the table that is the
840value of this variable, if non-@code{nil}, is applied after them.
b8d4c8d0
GM
841@end defvar
842
843@defvar standard-translation-table-for-encode
031c41de
EZ
844This is the default translation table for encoding. If a coding
845systems specifies its own translation tables, the table that is the
846value of this variable, if non-@code{nil}, is applied after them.
b8d4c8d0
GM
847@end defvar
848
5c9c5c4b
EZ
849@defvar translation-table-for-input
850Self-inserting characters are translated through this translation
851table before they are inserted. Search commands also translate their
852input through this table, so they can compare more reliably with
853what's in the buffer.
854
855This variable automatically becomes buffer-local when set.
856@end defvar
857
031c41de
EZ
858@defun make-translation-table-from-vector vec
859This function returns a translation table made from @var{vec} that is
85eeac93 860an array of 256 elements to map bytes (values 0 through #xFF) to
031c41de
EZ
861characters. Elements may be @code{nil} for untranslated bytes. The
862returned table has a translation table for reverse mapping in the
8b80cdf5 863first extra slot, and the value @code{1} in the second extra slot.
031c41de
EZ
864
865This function provides an easy way to make a private coding system
866that maps each byte to a specific character. You can specify the
867returned table and the reverse translation table using the properties
868@code{:decode-translation-table} and @code{:encode-translation-table}
869respectively in the @var{props} argument to
870@code{define-coding-system}.
871@end defun
872
873@defun make-translation-table-from-alist alist
874This function is similar to @code{make-translation-table} but returns
875a complex translation table rather than a simple one-to-one mapping.
876Each element of @var{alist} is of the form @code{(@var{from}
97d8273f
CY
877. @var{to})}, where @var{from} and @var{to} are either characters or
878vectors specifying a sequence of characters. If @var{from} is a
1df7defd 879character, that character is translated to @var{to} (i.e., to a
031c41de
EZ
880character or a character sequence). If @var{from} is a vector of
881characters, that sequence is translated to @var{to}. The returned
882table has a translation table for reverse mapping in the first extra
8b80cdf5
EZ
883slot, and the maximum length of all the @var{from} character sequences
884in the second extra slot.
031c41de
EZ
885@end defun
886
b8d4c8d0
GM
887@node Coding Systems
888@section Coding Systems
889
890@cindex coding system
891 When Emacs reads or writes a file, and when Emacs sends text to a
892subprocess or receives text from a subprocess, it normally performs
893character code conversion and end-of-line conversion as specified
894by a particular @dfn{coding system}.
895
896 How to define a coding system is an arcane matter, and is not
897documented here.
898
899@menu
900* Coding System Basics:: Basic concepts.
901* Encoding and I/O:: How file I/O functions handle coding systems.
902* Lisp and Coding Systems:: Functions to operate on coding system names.
903* User-Chosen Coding Systems:: Asking the user to choose a coding system.
904* Default Coding Systems:: Controlling the default choices.
905* Specifying Coding Systems:: Requesting a particular coding system
906 for a single file operation.
907* Explicit Encoding:: Encoding or decoding text without doing I/O.
908* Terminal I/O Encoding:: Use of encoding for terminal I/O.
b8d4c8d0
GM
909@end menu
910
911@node Coding System Basics
912@subsection Basic Concepts of Coding Systems
913
914@cindex character code conversion
80070260
EZ
915 @dfn{Character code conversion} involves conversion between the
916internal representation of characters used inside Emacs and some other
917encoding. Emacs supports many different encodings, in that it can
918convert to and from them. For example, it can convert text to or from
919encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and
920several variants of ISO 2022. In some cases, Emacs supports several
921alternative encodings for the same characters; for example, there are
922three coding systems for the Cyrillic (Russian) alphabet: ISO,
923Alternativnyj, and KOI8.
924
af38459f
EZ
925 Every coding system specifies a particular set of character code
926conversions, but the coding system @code{undecided} is special: it
927leaves the choice unspecified, to be chosen heuristically for each
928file, based on the file's data.
b8d4c8d0
GM
929
930 In general, a coding system doesn't guarantee roundtrip identity:
931decoding a byte sequence using coding system, then encoding the
932resulting text in the same coding system, can produce a different byte
80070260
EZ
933sequence. But some coding systems do guarantee that the byte sequence
934will be the same as what you originally decoded. Here are a few
935examples:
b8d4c8d0
GM
936
937@quotation
80070260 938iso-8859-1, utf-8, big5, shift_jis, euc-jp
b8d4c8d0
GM
939@end quotation
940
941 Encoding buffer text and then decoding the result can also fail to
80070260
EZ
942reproduce the original text. For instance, if you encode a character
943with a coding system which does not support that character, the result
944is unpredictable, and thus decoding it using the same coding system
945may produce a different text. Currently, Emacs can't report errors
946that result from encoding unsupported characters.
b8d4c8d0
GM
947
948@cindex EOL conversion
949@cindex end-of-line conversion
950@cindex line end conversion
80070260
EZ
951 @dfn{End of line conversion} handles three different conventions
952used on various systems for representing end of line in files. The
953Unix convention, used on GNU and Unix systems, is to use the linefeed
954character (also called newline). The DOS convention, used on
955MS-Windows and MS-DOS systems, is to use a carriage-return and a
956linefeed at the end of a line. The Mac convention is to use just
957carriage-return.
b8d4c8d0
GM
958
959@cindex base coding system
960@cindex variant coding system
961 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
962conversion unspecified, to be chosen based on the data. @dfn{Variant
963coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
964@code{latin-1-mac} specify the end-of-line conversion explicitly as
965well. Most base coding systems have three corresponding variants whose
966names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
967
02eccf6b 968@vindex raw-text@r{ coding system}
b8d4c8d0 969 The coding system @code{raw-text} is special in that it prevents
02eccf6b
EZ
970character code conversion, and causes the buffer visited with this
971coding system to be a unibyte buffer. For historical reasons, you can
972save both unibyte and multibyte text with this coding system. When
973you use @code{raw-text} to encode multibyte text, it does perform one
974character code conversion: it converts eight-bit characters to their
975single-byte external representation. @code{raw-text} does not specify
976the end-of-line conversion, allowing that to be determined as usual by
977the data, and has the usual three variants which specify the
978end-of-line conversion.
979
980@vindex no-conversion@r{ coding system}
981@vindex binary@r{ coding system}
982 @code{no-conversion} (and its alias @code{binary}) is equivalent to
983@code{raw-text-unix}: it specifies no conversion of either character
984codes or end-of-line.
b8d4c8d0 985
80070260 986@vindex emacs-internal@r{ coding system}
97d8273f
CY
987@vindex utf-8-emacs@r{ coding system}
988 The coding system @code{utf-8-emacs} specifies that the data is
989represented in the internal Emacs encoding (@pxref{Text
990Representations}). This is like @code{raw-text} in that no code
991conversion happens, but different in that the result is multibyte
992data. The name @code{emacs-internal} is an alias for
993@code{utf-8-emacs}.
b8d4c8d0
GM
994
995@defun coding-system-get coding-system property
996This function returns the specified property of the coding system
997@var{coding-system}. Most coding system properties exist for internal
80070260 998purposes, but one that you might find useful is @code{:mime-charset}.
b8d4c8d0
GM
999That property's value is the name used in MIME for the character coding
1000which this coding system can read and write. Examples:
1001
1002@example
80070260 1003(coding-system-get 'iso-latin-1 :mime-charset)
b8d4c8d0 1004 @result{} iso-8859-1
80070260 1005(coding-system-get 'iso-2022-cn :mime-charset)
b8d4c8d0 1006 @result{} iso-2022-cn
80070260 1007(coding-system-get 'cyrillic-koi8 :mime-charset)
b8d4c8d0
GM
1008 @result{} koi8-r
1009@end example
1010
80070260 1011The value of the @code{:mime-charset} property is also defined
b8d4c8d0
GM
1012as an alias for the coding system.
1013@end defun
1014
9097ad86 1015@cindex alias, for coding systems
91211f07
EZ
1016@defun coding-system-aliases coding-system
1017This function returns the list of aliases of @var{coding-system}.
1018@end defun
1019
b8d4c8d0
GM
1020@node Encoding and I/O
1021@subsection Encoding and I/O
1022
1023 The principal purpose of coding systems is for use in reading and
97d8273f
CY
1024writing files. The function @code{insert-file-contents} uses a coding
1025system to decode the file data, and @code{write-region} uses one to
1026encode the buffer contents.
b8d4c8d0
GM
1027
1028 You can specify the coding system to use either explicitly
1029(@pxref{Specifying Coding Systems}), or implicitly using a default
1030mechanism (@pxref{Default Coding Systems}). But these methods may not
1031completely specify what to do. For example, they may choose a coding
1032system such as @code{undefined} which leaves the character code
1033conversion to be determined from the data. In these cases, the I/O
1034operation finishes the job of choosing a coding system. Very often
1035you will want to find out afterwards which coding system was chosen.
1036
1037@defvar buffer-file-coding-system
e2e3f1d7
MR
1038This buffer-local variable records the coding system used for saving the
1039buffer and for writing part of the buffer with @code{write-region}. If
1040the text to be written cannot be safely encoded using the coding system
1041specified by this variable, these operations select an alternative
1042encoding by calling the function @code{select-safe-coding-system}
1043(@pxref{User-Chosen Coding Systems}). If selecting a different encoding
1044requires to ask the user to specify a coding system,
1045@code{buffer-file-coding-system} is updated to the newly selected coding
1046system.
b8d4c8d0
GM
1047
1048@code{buffer-file-coding-system} does @emph{not} affect sending text
1049to a subprocess.
1050@end defvar
1051
1052@defvar save-buffer-coding-system
1053This variable specifies the coding system for saving the buffer (by
1054overriding @code{buffer-file-coding-system}). Note that it is not used
1055for @code{write-region}.
1056
1057When a command to save the buffer starts out to use
1058@code{buffer-file-coding-system} (or @code{save-buffer-coding-system}),
1059and that coding system cannot handle
1060the actual text in the buffer, the command asks the user to choose
1061another coding system (by calling @code{select-safe-coding-system}).
1062After that happens, the command also updates
1063@code{buffer-file-coding-system} to represent the coding system that
1064the user specified.
1065@end defvar
1066
1067@defvar last-coding-system-used
1068I/O operations for files and subprocesses set this variable to the
1069coding system name that was used. The explicit encoding and decoding
1070functions (@pxref{Explicit Encoding}) set it too.
1071
1072@strong{Warning:} Since receiving subprocess output sets this variable,
1073it can change whenever Emacs waits; therefore, you should copy the
1074value shortly after the function call that stores the value you are
1075interested in.
1076@end defvar
1077
1078 The variable @code{selection-coding-system} specifies how to encode
1079selections for the window system. @xref{Window System Selections}.
1080
1081@defvar file-name-coding-system
1082The variable @code{file-name-coding-system} specifies the coding
1083system to use for encoding file names. Emacs encodes file names using
1084that coding system for all file operations. If
1085@code{file-name-coding-system} is @code{nil}, Emacs uses a default
1086coding system determined by the selected language environment. In the
1087default language environment, any non-@acronym{ASCII} characters in
1088file names are not encoded specially; they appear in the file system
1089using the internal Emacs representation.
1090@end defvar
1091
1092 @strong{Warning:} if you change @code{file-name-coding-system} (or
1093the language environment) in the middle of an Emacs session, problems
1094can result if you have already visited files whose names were encoded
1095using the earlier coding system and are handled differently under the
1096new coding system. If you try to save one of these buffers under the
1097visited file name, saving may use the wrong file name, or it may get
1098an error. If such a problem happens, use @kbd{C-x C-w} to specify a
1099new file name for that buffer.
1100
1101@node Lisp and Coding Systems
1102@subsection Coding Systems in Lisp
1103
1104 Here are the Lisp facilities for working with coding systems:
1105
0e90e7be 1106@cindex list all coding systems
b8d4c8d0
GM
1107@defun coding-system-list &optional base-only
1108This function returns a list of all coding system names (symbols). If
1109@var{base-only} is non-@code{nil}, the value includes only the
1110base coding systems. Otherwise, it includes alias and variant coding
1111systems as well.
1112@end defun
1113
1114@defun coding-system-p object
1115This function returns @code{t} if @var{object} is a coding system
1116name or @code{nil}.
1117@end defun
1118
0e90e7be
EZ
1119@cindex validity of coding system
1120@cindex coding system, validity check
b8d4c8d0 1121@defun check-coding-system coding-system
80070260
EZ
1122This function checks the validity of @var{coding-system}. If that is
1123valid, it returns @var{coding-system}. If @var{coding-system} is
1124@code{nil}, the function return @code{nil}. For any other values, it
1125signals an error whose @code{error-symbol} is @code{coding-system-error}
1126(@pxref{Signaling Errors, signal}).
b8d4c8d0
GM
1127@end defun
1128
0e90e7be 1129@cindex eol type of coding system
b8d4c8d0
GM
1130@defun coding-system-eol-type coding-system
1131This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
1132conversion used by @var{coding-system}. If @var{coding-system}
1133specifies a certain eol conversion, the return value is an integer 0,
11341, or 2, standing for @code{unix}, @code{dos}, and @code{mac},
1135respectively. If @var{coding-system} doesn't specify eol conversion
1136explicitly, the return value is a vector of coding systems, each one
1137with one of the possible eol conversion types, like this:
1138
1139@lisp
1140(coding-system-eol-type 'latin-1)
1141 @result{} [latin-1-unix latin-1-dos latin-1-mac]
1142@end lisp
1143
1144@noindent
1145If this function returns a vector, Emacs will decide, as part of the
1146text encoding or decoding process, what eol conversion to use. For
1147decoding, the end-of-line format of the text is auto-detected, and the
1148eol conversion is set to match it (e.g., DOS-style CRLF format will
1149imply @code{dos} eol conversion). For encoding, the eol conversion is
1150taken from the appropriate default coding system (e.g.,
4e3b4528 1151default value of @code{buffer-file-coding-system} for
b8d4c8d0
GM
1152@code{buffer-file-coding-system}), or from the default eol conversion
1153appropriate for the underlying platform.
1154@end defun
1155
0e90e7be 1156@cindex eol conversion of coding system
b8d4c8d0
GM
1157@defun coding-system-change-eol-conversion coding-system eol-type
1158This function returns a coding system which is like @var{coding-system}
1159except for its eol conversion, which is specified by @code{eol-type}.
1160@var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
1161@code{nil}. If it is @code{nil}, the returned coding system determines
1162the end-of-line conversion from the data.
1163
1164@var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
1165@code{dos} and @code{mac}, respectively.
1166@end defun
1167
0e90e7be 1168@cindex text conversion of coding system
b8d4c8d0
GM
1169@defun coding-system-change-text-conversion eol-coding text-coding
1170This function returns a coding system which uses the end-of-line
1171conversion of @var{eol-coding}, and the text conversion of
1172@var{text-coding}. If @var{text-coding} is @code{nil}, it returns
1173@code{undecided}, or one of its variants according to @var{eol-coding}.
1174@end defun
1175
0e90e7be
EZ
1176@cindex safely encode region
1177@cindex coding systems for encoding region
b8d4c8d0
GM
1178@defun find-coding-systems-region from to
1179This function returns a list of coding systems that could be used to
1180encode a text between @var{from} and @var{to}. All coding systems in
1181the list can safely encode any multibyte characters in that portion of
1182the text.
1183
1184If the text contains no multibyte characters, the function returns the
1185list @code{(undecided)}.
1186@end defun
1187
0e90e7be
EZ
1188@cindex safely encode a string
1189@cindex coding systems for encoding a string
b8d4c8d0
GM
1190@defun find-coding-systems-string string
1191This function returns a list of coding systems that could be used to
1192encode the text of @var{string}. All coding systems in the list can
1193safely encode any multibyte characters in @var{string}. If the text
1194contains no multibyte characters, this returns the list
1195@code{(undecided)}.
1196@end defun
1197
0e90e7be
EZ
1198@cindex charset, coding systems to encode
1199@cindex safely encode characters in a charset
b8d4c8d0
GM
1200@defun find-coding-systems-for-charsets charsets
1201This function returns a list of coding systems that could be used to
1202encode all the character sets in the list @var{charsets}.
1203@end defun
1204
91211f07
EZ
1205@defun check-coding-systems-region start end coding-system-list
1206This function checks whether coding systems in the list
1207@code{coding-system-list} can encode all the characters in the region
1208between @var{start} and @var{end}. If all of the coding systems in
1209the list can encode the specified text, the function returns
1210@code{nil}. If some coding systems cannot encode some of the
1211characters, the value is an alist, each element of which has the form
1212@code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning
1213that @var{coding-system1} cannot encode characters at buffer positions
1214@var{pos1}, @var{pos2}, @enddots{}.
1215
1216@var{start} may be a string, in which case @var{end} is ignored and
1217the returned value references string indices instead of buffer
1218positions.
1219@end defun
1220
b8d4c8d0
GM
1221@defun detect-coding-region start end &optional highest
1222This function chooses a plausible coding system for decoding the text
80070260 1223from @var{start} to @var{end}. This text should be a byte sequence,
1df7defd 1224i.e., unibyte text or multibyte text with only @acronym{ASCII} and
80070260 1225eight-bit characters (@pxref{Explicit Encoding}).
b8d4c8d0
GM
1226
1227Normally this function returns a list of coding systems that could
1228handle decoding the text that was scanned. They are listed in order of
1229decreasing priority. But if @var{highest} is non-@code{nil}, then the
1230return value is just one coding system, the one that is highest in
1231priority.
1232
1233If the region contains only @acronym{ASCII} characters except for such
1234ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
1235@code{undecided} or @code{(undecided)}, or a variant specifying
1236end-of-line conversion, if that can be deduced from the text.
0b4faef3
EZ
1237
1238If the region contains null bytes, the value is @code{no-conversion},
1239even if the region contains text encoded in some coding system.
b8d4c8d0
GM
1240@end defun
1241
1242@defun detect-coding-string string &optional highest
1243This function is like @code{detect-coding-region} except that it
1244operates on the contents of @var{string} instead of bytes in the buffer.
91211f07
EZ
1245@end defun
1246
0e90e7be 1247@cindex null bytes, and decoding text
0b4faef3
EZ
1248@defvar inhibit-null-byte-detection
1249If this variable has a non-@code{nil} value, null bytes are ignored
1250when detecting the encoding of a region or a string. This allows to
1251correctly detect the encoding of text that contains null bytes, such
1252as Info files with Index nodes.
1253@end defvar
1254
1255@defvar inhibit-iso-escape-detection
1256If this variable has a non-@code{nil} value, ISO-2022 escape sequences
1257are ignored when detecting the encoding of a region or a string. The
1258result is that no text is ever detected as encoded in some ISO-2022
1259encoding, and all escape sequences become visible in a buffer.
1260@strong{Warning:} @emph{Use this variable with extreme caution,
1261because many files in the Emacs distribution use ISO-2022 encoding.}
1262@end defvar
1263
0e90e7be 1264@cindex charsets supported by a coding system
91211f07
EZ
1265@defun coding-system-charset-list coding-system
1266This function returns the list of character sets (@pxref{Character
1267Sets}) supported by @var{coding-system}. Some coding systems that
1268support too many character sets to list them all yield special values:
1269@itemize @bullet
1270@item
1271If @var{coding-system} supports all the ISO-2022 charsets, the value
1272is @code{iso-2022}.
1273@item
1274If @var{coding-system} supports all Emacs characters, the value is
1275@code{(emacs)}.
1276@item
1277If @var{coding-system} supports all emacs-mule characters, the value
1278is @code{emacs-mule}.
1279@item
1280If @var{coding-system} supports all Unicode characters, the value is
1281@code{(unicode)}.
1282@end itemize
b8d4c8d0
GM
1283@end defun
1284
1285 @xref{Coding systems for a subprocess,, Process Information}, in
1286particular the description of the functions
1287@code{process-coding-system} and @code{set-process-coding-system}, for
1288how to examine or set the coding systems used for I/O to a subprocess.
1289
1290@node User-Chosen Coding Systems
1291@subsection User-Chosen Coding Systems
1292
1293@cindex select safe coding system
1294@defun select-safe-coding-system from to &optional default-coding-system accept-default-p file
1295This function selects a coding system for encoding specified text,
1296asking the user to choose if necessary. Normally the specified text
1297is the text in the current buffer between @var{from} and @var{to}. If
1298@var{from} is a string, the string specifies the text to encode, and
1299@var{to} is ignored.
1300
77730170
EZ
1301If the specified text includes raw bytes (@pxref{Text
1302Representations}), @code{select-safe-coding-system} suggests
1303@code{raw-text} for its encoding.
1304
b8d4c8d0
GM
1305If @var{default-coding-system} is non-@code{nil}, that is the first
1306coding system to try; if that can handle the text,
1307@code{select-safe-coding-system} returns that coding system. It can
1308also be a list of coding systems; then the function tries each of them
1309one by one. After trying all of them, it next tries the current
1310buffer's value of @code{buffer-file-coding-system} (if it is not
4e3b4528
SM
1311@code{undecided}), then the default value of
1312@code{buffer-file-coding-system} and finally the user's most
b8d4c8d0
GM
1313preferred coding system, which the user can set using the command
1314@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
1315Coding Systems, emacs, The GNU Emacs Manual}).
1316
1317If one of those coding systems can safely encode all the specified
1318text, @code{select-safe-coding-system} chooses it and returns it.
1319Otherwise, it asks the user to choose from a list of coding systems
1320which can encode all the text, and returns the user's choice.
1321
1322@var{default-coding-system} can also be a list whose first element is
1323t and whose other elements are coding systems. Then, if no coding
1324system in the list can handle the text, @code{select-safe-coding-system}
1325queries the user immediately, without trying any of the three
1326alternatives described above.
1327
1328The optional argument @var{accept-default-p}, if non-@code{nil},
1329should be a function to determine whether a coding system selected
1330without user interaction is acceptable. @code{select-safe-coding-system}
1331calls this function with one argument, the base coding system of the
1332selected coding system. If @var{accept-default-p} returns @code{nil},
1333@code{select-safe-coding-system} rejects the silently selected coding
1334system, and asks the user to select a coding system from a list of
1335possible candidates.
1336
1337@vindex select-safe-coding-system-accept-default-p
1338If the variable @code{select-safe-coding-system-accept-default-p} is
9bd79893
GM
1339non-@code{nil}, it should be a function taking a single argument.
1340It is used in place of @var{accept-default-p}, overriding any
1341value supplied for this argument.
b8d4c8d0
GM
1342
1343As a final step, before returning the chosen coding system,
1344@code{select-safe-coding-system} checks whether that coding system is
1345consistent with what would be selected if the contents of the region
1346were read from a file. (If not, this could lead to data corruption in
1347a file subsequently re-visited and edited.) Normally,
1348@code{select-safe-coding-system} uses @code{buffer-file-name} as the
1349file for this purpose, but if @var{file} is non-@code{nil}, it uses
1350that file instead (this can be relevant for @code{write-region} and
1351similar functions). If it detects an apparent inconsistency,
1352@code{select-safe-coding-system} queries the user before selecting the
1353coding system.
1354@end defun
1355
1356 Here are two functions you can use to let the user specify a coding
1357system, with completion. @xref{Completion}.
1358
1359@defun read-coding-system prompt &optional default
1360This function reads a coding system using the minibuffer, prompting with
1361string @var{prompt}, and returns the coding system name as a symbol. If
1362the user enters null input, @var{default} specifies which coding system
1363to return. It should be a symbol or a string.
1364@end defun
1365
1366@defun read-non-nil-coding-system prompt
1367This function reads a coding system using the minibuffer, prompting with
1368string @var{prompt}, and returns the coding system name as a symbol. If
1369the user tries to enter null input, it asks the user to try again.
1370@xref{Coding Systems}.
1371@end defun
1372
1373@node Default Coding Systems
1374@subsection Default Coding Systems
0e90e7be
EZ
1375@cindex default coding system
1376@cindex coding system, automatically determined
b8d4c8d0
GM
1377
1378 This section describes variables that specify the default coding
1379system for certain files or when running certain subprograms, and the
1380function that I/O operations use to access them.
1381
1382 The idea of these variables is that you set them once and for all to the
1383defaults you want, and then do not change them again. To specify a
1384particular coding system for a particular operation in a Lisp program,
1385don't change these variables; instead, override them using
1386@code{coding-system-for-read} and @code{coding-system-for-write}
1387(@pxref{Specifying Coding Systems}).
1388
0e90e7be 1389@cindex file contents, and default coding system
01f17ae2 1390@defopt auto-coding-regexp-alist
b8d4c8d0
GM
1391This variable is an alist of text patterns and corresponding coding
1392systems. Each element has the form @code{(@var{regexp}
1393. @var{coding-system})}; a file whose first few kilobytes match
1394@var{regexp} is decoded with @var{coding-system} when its contents are
1395read into a buffer. The settings in this alist take priority over
1396@code{coding:} tags in the files and the contents of
1397@code{file-coding-system-alist} (see below). The default value is set
1398so that Emacs automatically recognizes mail files in Babyl format and
1399reads them with no code conversions.
01f17ae2 1400@end defopt
b8d4c8d0 1401
0e90e7be 1402@cindex file name, and default coding system
01f17ae2 1403@defopt file-coding-system-alist
b8d4c8d0
GM
1404This variable is an alist that specifies the coding systems to use for
1405reading and writing particular files. Each element has the form
1406@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
1407expression that matches certain file names. The element applies to file
1408names that match @var{pattern}.
1409
1410The @sc{cdr} of the element, @var{coding}, should be either a coding
1411system, a cons cell containing two coding systems, or a function name (a
1412symbol with a function definition). If @var{coding} is a coding system,
1413that coding system is used for both reading the file and writing it. If
1414@var{coding} is a cons cell containing two coding systems, its @sc{car}
1415specifies the coding system for decoding, and its @sc{cdr} specifies the
1416coding system for encoding.
1417
1418If @var{coding} is a function name, the function should take one
1419argument, a list of all arguments passed to
1420@code{find-operation-coding-system}. It must return a coding system
1421or a cons cell containing two coding systems. This value has the same
1422meaning as described above.
1423
1424If @var{coding} (or what returned by the above function) is
1425@code{undecided}, the normal code-detection is performed.
01f17ae2 1426@end defopt
b8d4c8d0 1427
01f17ae2 1428@defopt auto-coding-alist
0e90e7be
EZ
1429This variable is an alist that specifies the coding systems to use for
1430reading and writing particular files. Its form is like that of
1431@code{file-coding-system-alist}, but, unlike the latter, this variable
1432takes priority over any @code{coding:} tags in the file.
01f17ae2 1433@end defopt
0e90e7be
EZ
1434
1435@cindex program name, and default coding system
b8d4c8d0
GM
1436@defvar process-coding-system-alist
1437This variable is an alist specifying which coding systems to use for a
1438subprocess, depending on which program is running in the subprocess. It
1439works like @code{file-coding-system-alist}, except that @var{pattern} is
1440matched against the program name used to start the subprocess. The coding
1441system or systems specified in this alist are used to initialize the
1442coding systems used for I/O to the subprocess, but you can specify
1443other coding systems later using @code{set-process-coding-system}.
1444@end defvar
1445
1446 @strong{Warning:} Coding systems such as @code{undecided}, which
1447determine the coding system from the data, do not work entirely reliably
1448with asynchronous subprocess output. This is because Emacs handles
1449asynchronous subprocess output in batches, as it arrives. If the coding
1450system leaves the character code conversion unspecified, or leaves the
1451end-of-line conversion unspecified, Emacs must try to detect the proper
1452conversion from one batch at a time, and this does not always work.
1453
1454 Therefore, with an asynchronous subprocess, if at all possible, use a
1455coding system which determines both the character code conversion and
1456the end of line conversion---that is, one like @code{latin-1-unix},
1457rather than @code{undecided} or @code{latin-1}.
1458
0e90e7be
EZ
1459@cindex port number, and default coding system
1460@cindex network service name, and default coding system
b8d4c8d0
GM
1461@defvar network-coding-system-alist
1462This variable is an alist that specifies the coding system to use for
1463network streams. It works much like @code{file-coding-system-alist},
1464with the difference that the @var{pattern} in an element may be either a
1465port number or a regular expression. If it is a regular expression, it
1466is matched against the network service name used to open the network
1467stream.
1468@end defvar
1469
1470@defvar default-process-coding-system
1471This variable specifies the coding systems to use for subprocess (and
1472network stream) input and output, when nothing else specifies what to
1473do.
1474
1475The value should be a cons cell of the form @code{(@var{input-coding}
1476. @var{output-coding})}. Here @var{input-coding} applies to input from
1477the subprocess, and @var{output-coding} applies to output to it.
1478@end defvar
1479
0e90e7be 1480@cindex default coding system, functions to determine
01f17ae2 1481@defopt auto-coding-functions
b8d4c8d0
GM
1482This variable holds a list of functions that try to determine a
1483coding system for a file based on its undecoded contents.
1484
1485Each function in this list should be written to look at text in the
1486current buffer, but should not modify it in any way. The buffer will
1487contain undecoded text of parts of the file. Each function should
1488take one argument, @var{size}, which tells it how many characters to
1489look at, starting from point. If the function succeeds in determining
1490a coding system for the file, it should return that coding system.
1491Otherwise, it should return @code{nil}.
1492
1493If a file has a @samp{coding:} tag, that takes precedence, so these
1494functions won't be called.
01f17ae2 1495@end defopt
b8d4c8d0 1496
0e90e7be
EZ
1497@defun find-auto-coding filename size
1498This function tries to determine a suitable coding system for
1499@var{filename}. It examines the buffer visiting the named file, using
1500the variables documented above in sequence, until it finds a match for
1501one of the rules specified by these variables. It then returns a cons
1502cell of the form @code{(@var{coding} . @var{source})}, where
1503@var{coding} is the coding system to use and @var{source} is a symbol,
1504one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist},
1505@code{:coding}, or @code{auto-coding-functions}, indicating which one
1506supplied the matching rule. The value @code{:coding} means the coding
1507system was specified by the @code{coding:} tag in the file
1508(@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}).
1509The order of looking for a matching rule is @code{auto-coding-alist}
1510first, then @code{auto-coding-regexp-alist}, then the @code{coding:}
1511tag, and lastly @code{auto-coding-functions}. If no matching rule was
1512found, the function returns @code{nil}.
1513
1514The second argument @var{size} is the size of text, in characters,
1515following point. The function examines text only within @var{size}
1516characters after point. Normally, the buffer should be positioned at
1517the beginning when this function is called, because one of the places
1518for the @code{coding:} tag is the first one or two lines of the file;
1519in that case, @var{size} should be the size of the buffer.
1520@end defun
1521
1522@defun set-auto-coding filename size
1523This function returns a suitable coding system for file
1524@var{filename}. It uses @code{find-auto-coding} to find the coding
1525system. If no coding system could be determined, the function returns
1526@code{nil}. The meaning of the argument @var{size} is like in
1527@code{find-auto-coding}.
1528@end defun
1529
b8d4c8d0
GM
1530@defun find-operation-coding-system operation &rest arguments
1531This function returns the coding system to use (by default) for
1532performing @var{operation} with @var{arguments}. The value has this
1533form:
1534
1535@example
1536(@var{decoding-system} . @var{encoding-system})
1537@end example
1538
1539The first element, @var{decoding-system}, is the coding system to use
1540for decoding (in case @var{operation} does decoding), and
1541@var{encoding-system} is the coding system for encoding (in case
1542@var{operation} does encoding).
1543
049bcbcb
CY
1544The argument @var{operation} is a symbol; it should be one of
1545@code{write-region}, @code{start-process}, @code{call-process},
1546@code{call-process-region}, @code{insert-file-contents}, or
1547@code{open-network-stream}. These are the names of the Emacs I/O
1548primitives that can do character code and eol conversion.
b8d4c8d0
GM
1549
1550The remaining arguments should be the same arguments that might be given
1551to the corresponding I/O primitive. Depending on the primitive, one
1552of those arguments is selected as the @dfn{target}. For example, if
1553@var{operation} does file I/O, whichever argument specifies the file
1554name is the target. For subprocess primitives, the process name is the
1555target. For @code{open-network-stream}, the target is the service name
1556or port number.
1557
1558Depending on @var{operation}, this function looks up the target in
1559@code{file-coding-system-alist}, @code{process-coding-system-alist},
1560or @code{network-coding-system-alist}. If the target is found in the
1561alist, @code{find-operation-coding-system} returns its association in
1562the alist; otherwise it returns @code{nil}.
1563
1564If @var{operation} is @code{insert-file-contents}, the argument
1565corresponding to the target may be a cons cell of the form
1566@code{(@var{filename} . @var{buffer})}). In that case, @var{filename}
1567is a file name to look up in @code{file-coding-system-alist}, and
1568@var{buffer} is a buffer that contains the file's contents (not yet
1569decoded). If @code{file-coding-system-alist} specifies a function to
1570call for this file, and that function needs to examine the file's
1571contents (as it usually does), it should examine the contents of
1572@var{buffer} instead of reading the file.
1573@end defun
1574
1575@node Specifying Coding Systems
1576@subsection Specifying a Coding System for One Operation
1577
1578 You can specify the coding system for a specific operation by binding
1579the variables @code{coding-system-for-read} and/or
1580@code{coding-system-for-write}.
1581
1582@defvar coding-system-for-read
1583If this variable is non-@code{nil}, it specifies the coding system to
1584use for reading a file, or for input from a synchronous subprocess.
1585
1586It also applies to any asynchronous subprocess or network stream, but in
1587a different way: the value of @code{coding-system-for-read} when you
1588start the subprocess or open the network stream specifies the input
1589decoding method for that subprocess or network stream. It remains in
1590use for that subprocess or network stream unless and until overridden.
1591
1592The right way to use this variable is to bind it with @code{let} for a
1593specific I/O operation. Its global value is normally @code{nil}, and
1594you should not globally set it to any other value. Here is an example
1595of the right way to use the variable:
1596
1597@example
1598;; @r{Read the file with no character code conversion.}
1599;; @r{Assume @acronym{crlf} represents end-of-line.}
1600(let ((coding-system-for-read 'emacs-mule-dos))
1601 (insert-file-contents filename))
1602@end example
1603
1604When its value is non-@code{nil}, this variable takes precedence over
1605all other methods of specifying a coding system to use for input,
1606including @code{file-coding-system-alist},
1607@code{process-coding-system-alist} and
1608@code{network-coding-system-alist}.
1609@end defvar
1610
1611@defvar coding-system-for-write
1612This works much like @code{coding-system-for-read}, except that it
1613applies to output rather than input. It affects writing to files,
1614as well as sending output to subprocesses and net connections.
1615
1616When a single operation does both input and output, as do
1617@code{call-process-region} and @code{start-process}, both
1618@code{coding-system-for-read} and @code{coding-system-for-write}
1619affect it.
1620@end defvar
1621
01f17ae2 1622@defopt inhibit-eol-conversion
b8d4c8d0
GM
1623When this variable is non-@code{nil}, no end-of-line conversion is done,
1624no matter which coding system is specified. This applies to all the
1625Emacs I/O and subprocess primitives, and to the explicit encoding and
1626decoding functions (@pxref{Explicit Encoding}).
01f17ae2 1627@end defopt
b8d4c8d0 1628
91211f07
EZ
1629@cindex priority order of coding systems
1630@cindex coding systems, priority
1631 Sometimes, you need to prefer several coding systems for some
1632operation, rather than fix a single one. Emacs lets you specify a
1633priority order for using coding systems. This ordering affects the
333f9019 1634sorting of lists of coding systems returned by functions such as
91211f07
EZ
1635@code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}).
1636
1637@defun coding-system-priority-list &optional highestp
1638This function returns the list of coding systems in the order of their
1639current priorities. Optional argument @var{highestp}, if
1640non-@code{nil}, means return only the highest priority coding system.
1641@end defun
1642
1643@defun set-coding-system-priority &rest coding-systems
1644This function puts @var{coding-systems} at the beginning of the
1645priority list for coding systems, thus making their priority higher
1646than all the rest.
1647@end defun
1648
1649@defmac with-coding-priority coding-systems &rest body@dots{}
1650This macro execute @var{body}, like @code{progn} does
1651(@pxref{Sequencing, progn}), with @var{coding-systems} at the front of
1652the priority list for coding systems. @var{coding-systems} should be
1653a list of coding systems to prefer during execution of @var{body}.
1654@end defmac
1655
b8d4c8d0
GM
1656@node Explicit Encoding
1657@subsection Explicit Encoding and Decoding
1658@cindex encoding in coding systems
1659@cindex decoding in coding systems
1660
1661 All the operations that transfer text in and out of Emacs have the
1662ability to use a coding system to encode or decode the text.
1663You can also explicitly encode and decode text using the functions
1664in this section.
1665
1666 The result of encoding, and the input to decoding, are not ordinary
1667text. They logically consist of a series of byte values; that is, a
80070260
EZ
1668series of @acronym{ASCII} and eight-bit characters. In unibyte
1669buffers and strings, these characters have codes in the range 0
85eeac93
CY
1670through #xFF (255). In a multibyte buffer or string, eight-bit
1671characters have character codes higher than #xFF (@pxref{Text
1672Representations}), but Emacs transparently converts them to their
1673single-byte values when you encode or decode such text.
b8d4c8d0
GM
1674
1675 The usual way to read a file into a buffer as a sequence of bytes, so
1676you can decode the contents explicitly, is with
1677@code{insert-file-contents-literally} (@pxref{Reading from Files});
1678alternatively, specify a non-@code{nil} @var{rawfile} argument when
1679visiting a file with @code{find-file-noselect}. These methods result in
1680a unibyte buffer.
1681
1682 The usual way to use the byte sequence that results from explicitly
1683encoding text is to copy it to a file or process---for example, to write
1684it with @code{write-region} (@pxref{Writing to Files}), and suppress
1685encoding by binding @code{coding-system-for-write} to
1686@code{no-conversion}.
1687
1688 Here are the functions to perform explicit encoding or decoding. The
1689encoding functions produce sequences of bytes; the decoding functions
1690are meant to operate on sequences of bytes. All of these functions
80070260
EZ
1691discard text properties. They also set @code{last-coding-system-used}
1692to the precise coding system they used.
b8d4c8d0 1693
80070260 1694@deffn Command encode-coding-region start end coding-system &optional destination
b8d4c8d0 1695This command encodes the text from @var{start} to @var{end} according
80070260
EZ
1696to coding system @var{coding-system}. Normally, the encoded text
1697replaces the original text in the buffer, but the optional argument
1698@var{destination} can change that. If @var{destination} is a buffer,
1699the encoded text is inserted in that buffer after point (point does
1700not move); if it is @code{t}, the command returns the encoded text as
1701a unibyte string without inserting it.
1702
1703If encoded text is inserted in some buffer, this command returns the
1704length of the encoded text.
1705
1706The result of encoding is logically a sequence of bytes, but the
1707buffer remains multibyte if it was multibyte before, and any 8-bit
1708bytes are converted to their multibyte representation (@pxref{Text
1709Representations}).
77730170
EZ
1710
1711@cindex @code{undecided} coding-system, when encoding
1712Do @emph{not} use @code{undecided} for @var{coding-system} when
1713encoding text, since that may lead to unexpected results. Instead,
1714use @code{select-safe-coding-system} (@pxref{User-Chosen Coding
1715Systems, select-safe-coding-system}) to suggest a suitable encoding,
1716if there's no obvious pertinent value for @var{coding-system}.
b8d4c8d0
GM
1717@end deffn
1718
80070260 1719@defun encode-coding-string string coding-system &optional nocopy buffer
b8d4c8d0
GM
1720This function encodes the text in @var{string} according to coding
1721system @var{coding-system}. It returns a new string containing the
1722encoded text, except when @var{nocopy} is non-@code{nil}, in which
1723case the function may return @var{string} itself if the encoding
1724operation is trivial. The result of encoding is a unibyte string.
1725@end defun
1726
106e6894 1727@deffn Command decode-coding-region start end coding-system &optional destination
b8d4c8d0 1728This command decodes the text from @var{start} to @var{end} according
80070260
EZ
1729to coding system @var{coding-system}. To make explicit decoding
1730useful, the text before decoding ought to be a sequence of byte
1731values, but both multibyte and unibyte buffers are acceptable (in the
1732multibyte case, the raw byte values should be represented as eight-bit
1733characters). Normally, the decoded text replaces the original text in
1734the buffer, but the optional argument @var{destination} can change
1735that. If @var{destination} is a buffer, the decoded text is inserted
1736in that buffer after point (point does not move); if it is @code{t},
1737the command returns the decoded text as a multibyte string without
1738inserting it.
1739
1740If decoded text is inserted in some buffer, this command returns the
1741length of the decoded text.
7d2a859f
EZ
1742
1743This command puts a @code{charset} text property on the decoded text.
1744The value of the property states the character set used to decode the
1745original text.
b8d4c8d0
GM
1746@end deffn
1747
80070260
EZ
1748@defun decode-coding-string string coding-system &optional nocopy buffer
1749This function decodes the text in @var{string} according to
1750@var{coding-system}. It returns a new string containing the decoded
1751text, except when @var{nocopy} is non-@code{nil}, in which case the
1752function may return @var{string} itself if the decoding operation is
1753trivial. To make explicit decoding useful, the contents of
1754@var{string} ought to be a unibyte string with a sequence of byte
1755values, but a multibyte string is also acceptable (assuming it
1756contains 8-bit bytes in their multibyte form).
1757
1758If optional argument @var{buffer} specifies a buffer, the decoded text
1759is inserted in that buffer after point (point does not move). In this
1760case, the return value is the length of the decoded text.
7d2a859f
EZ
1761
1762@cindex @code{charset}, text property
1763This function puts a @code{charset} text property on the decoded text.
1764The value of the property states the character set used to decode the
1765original text:
1766
1767@example
1768@group
1769(decode-coding-string "Gr\374ss Gott" 'latin-1)
1770 @result{} #("Gr@"uss Gott" 0 9 (charset iso-8859-1))
1771@end group
1772@end example
b8d4c8d0
GM
1773@end defun
1774
1775@defun decode-coding-inserted-region from to filename &optional visit beg end replace
1776This function decodes the text from @var{from} to @var{to} as if
1777it were being read from file @var{filename} using @code{insert-file-contents}
1778using the rest of the arguments provided.
1779
1780The normal way to use this function is after reading text from a file
1781without decoding, if you decide you would rather have decoded it.
1782Instead of deleting the text and reading it again, this time with
1783decoding, you can call this function.
1784@end defun
1785
1786@node Terminal I/O Encoding
1787@subsection Terminal I/O Encoding
1788
1789 Emacs can decode keyboard input using a coding system, and encode
80070260
EZ
1790terminal output. This is useful for terminals that transmit or
1791display text using a particular encoding such as Latin-1. Emacs does
1792not set @code{last-coding-system-used} for encoding or decoding of
1793terminal I/O.
b8d4c8d0 1794
3f1d322f 1795@defun keyboard-coding-system &optional terminal
b8d4c8d0 1796This function returns the coding system that is in use for decoding
3f1d322f
EZ
1797keyboard input from @var{terminal}---or @code{nil} if no coding system
1798is to be used for that terminal. If @var{terminal} is omitted or
1799@code{nil}, it means the selected frame's terminal. @xref{Multiple
1800Terminals}.
b8d4c8d0
GM
1801@end defun
1802
3f1d322f
EZ
1803@deffn Command set-keyboard-coding-system coding-system &optional terminal
1804This command specifies @var{coding-system} as the coding system to use
1805for decoding keyboard input from @var{terminal}. If
1806@var{coding-system} is @code{nil}, that means do not decode keyboard
1807input. If @var{terminal} is a frame, it means that frame's terminal;
1808if it is @code{nil}, that means the currently selected frame's
1809terminal. @xref{Multiple Terminals}.
b8d4c8d0
GM
1810@end deffn
1811
106e6894 1812@defun terminal-coding-system &optional terminal
b8d4c8d0 1813This function returns the coding system that is in use for encoding
106e6894
CY
1814terminal output from @var{terminal}---or @code{nil} if the output is
1815not encoded. If @var{terminal} is a frame, it means that frame's
1816terminal; if it is @code{nil}, that means the currently selected
1817frame's terminal.
b8d4c8d0
GM
1818@end defun
1819
106e6894 1820@deffn Command set-terminal-coding-system coding-system &optional terminal
b8d4c8d0 1821This command specifies @var{coding-system} as the coding system to use
106e6894
CY
1822for encoding terminal output from @var{terminal}. If
1823@var{coding-system} is @code{nil}, terminal output is not encoded. If
1824@var{terminal} is a frame, it means that frame's terminal; if it is
1825@code{nil}, that means the currently selected frame's terminal.
b8d4c8d0
GM
1826@end deffn
1827
b8d4c8d0
GM
1828@node Input Methods
1829@section Input Methods
1830@cindex input methods
1831
1832 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
1833characters from the keyboard. Unlike coding systems, which translate
1834non-@acronym{ASCII} characters to and from encodings meant to be read by
1835programs, input methods provide human-friendly commands. (@xref{Input
1836Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1837use input methods to enter text.) How to define input methods is not
1838yet documented in this manual, but here we describe how to use them.
1839
1840 Each input method has a name, which is currently a string;
1841in the future, symbols may also be usable as input method names.
1842
1843@defvar current-input-method
1844This variable holds the name of the input method now active in the
1845current buffer. (It automatically becomes local in each buffer when set
1846in any fashion.) It is @code{nil} if no input method is active in the
1847buffer now.
1848@end defvar
1849
1850@defopt default-input-method
1851This variable holds the default input method for commands that choose an
1852input method. Unlike @code{current-input-method}, this variable is
1853normally global.
1854@end defopt
1855
1856@deffn Command set-input-method input-method
1857This command activates input method @var{input-method} for the current
1858buffer. It also sets @code{default-input-method} to @var{input-method}.
1859If @var{input-method} is @code{nil}, this command deactivates any input
1860method for the current buffer.
1861@end deffn
1862
1863@defun read-input-method-name prompt &optional default inhibit-null
1864This function reads an input method name with the minibuffer, prompting
1865with @var{prompt}. If @var{default} is non-@code{nil}, that is returned
1866by default, if the user enters empty input. However, if
1867@var{inhibit-null} is non-@code{nil}, empty input signals an error.
1868
1869The returned value is a string.
1870@end defun
1871
1872@defvar input-method-alist
1873This variable defines all the supported input methods.
1874Each element defines one input method, and should have the form:
1875
1876@example
1877(@var{input-method} @var{language-env} @var{activate-func}
1878 @var{title} @var{description} @var{args}...)
1879@end example
1880
1881Here @var{input-method} is the input method name, a string;
1882@var{language-env} is another string, the name of the language
1883environment this input method is recommended for. (That serves only for
1884documentation purposes.)
1885
1886@var{activate-func} is a function to call to activate this method. The
1887@var{args}, if any, are passed as arguments to @var{activate-func}. All
1888told, the arguments to @var{activate-func} are @var{input-method} and
1889the @var{args}.
1890
1891@var{title} is a string to display in the mode line while this method is
1892active. @var{description} is a string describing this method and what
1893it is good for.
1894@end defvar
1895
1896 The fundamental interface to input methods is through the
1897variable @code{input-method-function}. @xref{Reading One Event},
1898and @ref{Invoking the Input Method}.
1899
1900@node Locales
1901@section Locales
1902@cindex locale
1903
1904 POSIX defines a concept of ``locales'' which control which language
1905to use in language-related features. These Emacs variables control
1906how Emacs interacts with these features.
1907
1908@defvar locale-coding-system
1909@cindex keyboard input decoding on X
1910This variable specifies the coding system to use for decoding system
1911error messages and---on X Window system only---keyboard input, for
1912encoding the format argument to @code{format-time-string}, and for
1913decoding the return value of @code{format-time-string}.
1914@end defvar
1915
1916@defvar system-messages-locale
1917This variable specifies the locale to use for generating system error
1918messages. Changing the locale can cause messages to come out in a
1919different language or in a different orthography. If the variable is
1920@code{nil}, the locale is specified by environment variables in the
1921usual POSIX fashion.
1922@end defvar
1923
1924@defvar system-time-locale
1925This variable specifies the locale to use for formatting time values.
1926Changing the locale can cause messages to appear according to the
1927conventions of a different language. If the variable is @code{nil}, the
1928locale is specified by environment variables in the usual POSIX fashion.
1929@end defvar
1930
1931@defun locale-info item
1932This function returns locale data @var{item} for the current POSIX
1933locale, if available. @var{item} should be one of these symbols:
1934
1935@table @code
1936@item codeset
1937Return the character set as a string (locale item @code{CODESET}).
1938
1939@item days
1940Return a 7-element vector of day names (locale items
1941@code{DAY_1} through @code{DAY_7});
1942
1943@item months
1944Return a 12-element vector of month names (locale items @code{MON_1}
1945through @code{MON_12}).
1946
1947@item paper
1948Return a list @code{(@var{width} @var{height})} for the default paper
1949size measured in millimeters (locale items @code{PAPER_WIDTH} and
1950@code{PAPER_HEIGHT}).
1951@end table
1952
1953If the system can't provide the requested information, or if
1954@var{item} is not one of those symbols, the value is @code{nil}. All
1955strings in the return value are decoded using
1956@code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual},
1957for more information about locales and locale items.
1958@end defun