* image-mode.el (image-mode): Fix last fix.
[bpt/emacs.git] / lispref / nonascii.texi
CommitLineData
cc6d0d2c
RS
1@c -*-texinfo-*-
2@c This is part of the GNU Emacs Lisp Reference Manual.
651f374c 3@c Copyright (C) 1998, 1999, 2002, 2003, 2004,
ceb4c4d3 4@c 2005, 2006 Free Software Foundation, Inc.
cc6d0d2c
RS
5@c See the file elisp.texi for copying conditions.
6@setfilename ../info/characters
7@node Non-ASCII Characters, Searching and Matching, Text, Top
ad800164 8@chapter Non-@acronym{ASCII} Characters
cc6d0d2c 9@cindex multibyte characters
ad800164 10@cindex non-@acronym{ASCII} characters
cc6d0d2c 11
ad800164 12 This chapter covers the special issues relating to non-@acronym{ASCII}
cc6d0d2c
RS
13characters and how they are stored in strings and buffers.
14
15@menu
5557b83b
RS
16* Text Representations:: Unibyte and multibyte representations
17* Converting Representations:: Converting unibyte to multibyte and vice versa.
18* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
19* Character Codes:: How unibyte and multibyte relate to
20 codes of individual characters.
a3d3f60d 21* Character Sets:: The space of possible character codes
5557b83b
RS
22 is divided into various character sets.
23* Chars and Bytes:: More information about multibyte encodings.
24* Splitting Characters:: Converting a character to its byte sequence.
25* Scanning Charsets:: Which character sets are used in a buffer?
26* Translation of Characters:: Translation tables are used for conversion.
27* Coding Systems:: Coding systems are conversions for saving files.
28* Input Methods:: Input methods allow users to enter various
8a9e355c 29 non-ASCII characters without special keyboards.
5557b83b 30* Locales:: Interacting with the POSIX locale.
cc6d0d2c
RS
31@end menu
32
33@node Text Representations
34@section Text Representations
35@cindex text representations
36
37 Emacs has two @dfn{text representations}---two ways to represent text
38in a string or buffer. These are called @dfn{unibyte} and
39@dfn{multibyte}. Each string, and each buffer, uses one of these two
40representations. For most purposes, you can ignore the issue of
41representations, because Emacs converts text between them as
42appropriate. Occasionally in Lisp programming you will need to pay
43attention to the difference.
44
45@cindex unibyte text
46 In unibyte representation, each character occupies one byte and
47therefore the possible character codes range from 0 to 255. Codes 0
ad800164
EZ
48through 127 are @acronym{ASCII} characters; the codes from 128 through 255
49are used for one non-@acronym{ASCII} character set (you can choose which
969fe9b5 50character set by setting the variable @code{nonascii-insert-offset}).
cc6d0d2c
RS
51
52@cindex leading code
53@cindex multibyte text
1911e6e5 54@cindex trailing codes
cc6d0d2c
RS
55 In multibyte representation, a character may occupy more than one
56byte, and as a result, the full range of Emacs character codes can be
57stored. The first byte of a multibyte character is always in the range
58128 through 159 (octal 0200 through 0237). These values are called
a9f0a989
RS
59@dfn{leading codes}. The second and subsequent bytes of a multibyte
60character are always in the range 160 through 255 (octal 0240 through
1911e6e5 610377); these values are @dfn{trailing codes}.
cc6d0d2c 62
0ace421a 63 Some sequences of bytes are not valid in multibyte text: for example,
1e4d32f8
GM
64a single isolated byte in the range 128 through 159 is not allowed. But
65character codes 128 through 159 can appear in multibyte text,
66represented as two-byte sequences. All the character codes 128 through
67255 are possible (though slightly abnormal) in multibyte text; they
0ace421a
GM
68appear in multibyte buffers and strings when you do explicit encoding
69and decoding (@pxref{Explicit Encoding}).
b6954afd 70
cc6d0d2c
RS
71 In a buffer, the buffer-local value of the variable
72@code{enable-multibyte-characters} specifies the representation used.
08f0f5e9
KH
73The representation for a string is determined and recorded in the string
74when the string is constructed.
cc6d0d2c 75
cc6d0d2c
RS
76@defvar enable-multibyte-characters
77This variable specifies the current buffer's text representation.
78If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
79it contains unibyte text.
80
969fe9b5
RS
81You cannot set this variable directly; instead, use the function
82@code{set-buffer-multibyte} to change a buffer's representation.
cc6d0d2c
RS
83@end defvar
84
cc6d0d2c 85@defvar default-enable-multibyte-characters
a9f0a989 86This variable's value is entirely equivalent to @code{(default-value
cc6d0d2c 87'enable-multibyte-characters)}, and setting this variable changes that
a9f0a989
RS
88default value. Setting the local binding of
89@code{enable-multibyte-characters} in a specific buffer is not allowed,
90but changing the default value is supported, and it is a reasonable
91thing to do, because it has no effect on existing buffers.
cc6d0d2c
RS
92
93The @samp{--unibyte} command line option does its job by setting the
94default value to @code{nil} early in startup.
95@end defvar
96
b6954afd 97@defun position-bytes position
5ac343ac
RS
98Return the byte-position corresponding to buffer position
99@var{position} in the current buffer. This is 1 at the start of the
100buffer, and counts upward in bytes. If @var{position} is out of
101range, the value is @code{nil}.
b6954afd
RS
102@end defun
103
104@defun byte-to-position byte-position
b6954afd 105Return the buffer position corresponding to byte-position
35864124
LT
106@var{byte-position} in the current buffer. If @var{byte-position} is
107out of range, the value is @code{nil}.
b6954afd
RS
108@end defun
109
cc6d0d2c 110@defun multibyte-string-p string
b6954afd 111Return @code{t} if @var{string} is a multibyte string.
cc6d0d2c
RS
112@end defun
113
114@node Converting Representations
115@section Converting Text Representations
116
117 Emacs can convert unibyte text to multibyte; it can also convert
118multibyte text to unibyte, though this conversion loses information. In
119general these conversions happen when inserting text into a buffer, or
120when putting text from several strings together in one string. You can
121also explicitly convert a string's contents to either representation.
122
123 Emacs chooses the representation for a string based on the text that
124it is constructed from. The general rule is to convert unibyte text to
125multibyte text when combining it with other multibyte text, because the
126multibyte representation is more general and can hold whatever
127characters the unibyte text has.
128
129 When inserting text into a buffer, Emacs converts the text to the
130buffer's representation, as specified by
131@code{enable-multibyte-characters} in that buffer. In particular, when
132you insert multibyte text into a unibyte buffer, Emacs converts the text
133to unibyte, even though this conversion cannot in general preserve all
134the characters that might be in the multibyte text. The other natural
135alternative, to convert the buffer contents to multibyte, is not
136acceptable because the buffer's representation is a choice made by the
969fe9b5 137user that cannot be overridden automatically.
cc6d0d2c 138
ad800164 139 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
1e4d32f8 140unchanged, and likewise character codes 128 through 159. It converts
ad800164 141the non-@acronym{ASCII} codes 160 through 255 by adding the value
1e4d32f8
GM
142@code{nonascii-insert-offset} to each character code. By setting this
143variable, you specify which character set the unibyte characters
144correspond to (@pxref{Character Sets}). For example, if
145@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
ad800164 146'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
1e4d32f8
GM
147correspond to Latin 1. If it is 2688, which is @code{(- (make-char
148'greek-iso8859-7) 128)}, then they correspond to Greek letters.
cc6d0d2c 149
8241495d
RS
150 Converting multibyte text to unibyte is simpler: it discards all but
151the low 8 bits of each character code. If @code{nonascii-insert-offset}
152has a reasonable value, corresponding to the beginning of some character
153set, this conversion is the inverse of the other: converting unibyte
154text to multibyte and back to unibyte reproduces the original unibyte
155text.
cc6d0d2c 156
cc6d0d2c 157@defvar nonascii-insert-offset
ad800164 158This variable specifies the amount to add to a non-@acronym{ASCII} character
cc6d0d2c 159when converting unibyte text to multibyte. It also applies when
a9f0a989 160@code{self-insert-command} inserts a character in the unibyte
ad800164 161non-@acronym{ASCII} range, 128 through 255. However, the functions
7a063989 162@code{insert} and @code{insert-char} do not perform this conversion.
cc6d0d2c
RS
163
164The right value to use to select character set @var{cs} is @code{(-
a9f0a989 165(make-char @var{cs}) 128)}. If the value of
cc6d0d2c
RS
166@code{nonascii-insert-offset} is zero, then conversion actually uses the
167value for the Latin 1 character set, rather than zero.
168@end defvar
169
a9f0a989 170@defvar nonascii-translation-table
cc6d0d2c
RS
171This variable provides a more general alternative to
172@code{nonascii-insert-offset}. You can use it to specify independently
173how to translate each code in the range of 128 through 255 into a
7a063989 174multibyte character. The value should be a char-table, or @code{nil}.
969fe9b5 175If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
cc6d0d2c
RS
176@end defvar
177
35864124
LT
178The next three functions either return the argument @var{string}, or a
179newly created string with no text properties.
180
cc6d0d2c
RS
181@defun string-make-unibyte string
182This function converts the text of @var{string} to unibyte
1911e6e5 183representation, if it isn't already, and returns the result. If
38eee91c
EZ
184@var{string} is a unibyte string, it is returned unchanged. Multibyte
185character codes are converted to unibyte according to
186@code{nonascii-translation-table} or, if that is @code{nil}, using
187@code{nonascii-insert-offset}. If the lookup in the translation table
188fails, this function takes just the low 8 bits of each character.
cc6d0d2c
RS
189@end defun
190
cc6d0d2c
RS
191@defun string-make-multibyte string
192This function converts the text of @var{string} to multibyte
1911e6e5 193representation, if it isn't already, and returns the result. If
35864124
LT
194@var{string} is a multibyte string or consists entirely of
195@acronym{ASCII} characters, it is returned unchanged. In particular,
196if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
197string is unibyte. (When the characters are all @acronym{ASCII},
198Emacs primitives will treat the string the same way whether it is
199unibyte or multibyte.) If @var{string} is unibyte and contains
200non-@acronym{ASCII} characters, the function
201@code{unibyte-char-to-multibyte} is used to convert each unibyte
202character to a multibyte character.
cc6d0d2c
RS
203@end defun
204
131bf943
RS
205@defun string-to-multibyte string
206This function returns a multibyte string containing the same sequence
35864124
LT
207of character codes as @var{string}. Unlike
208@code{string-make-multibyte}, this function unconditionally returns a
209multibyte string. If @var{string} is a multibyte string, it is
210returned unchanged.
131bf943
RS
211@end defun
212
1ee89891
RS
213@defun multibyte-char-to-unibyte char
214This convert the multibyte character @var{char} to a unibyte
215character, based on @code{nonascii-translation-table} and
216@code{nonascii-insert-offset}.
217@end defun
218
219@defun unibyte-char-to-multibyte char
220This convert the unibyte character @var{char} to a multibyte
221character, based on @code{nonascii-translation-table} and
222@code{nonascii-insert-offset}.
223@end defun
224
cc6d0d2c
RS
225@node Selecting a Representation
226@section Selecting a Representation
227
228 Sometimes it is useful to examine an existing buffer or string as
229multibyte when it was unibyte, or vice versa.
230
cc6d0d2c
RS
231@defun set-buffer-multibyte multibyte
232Set the representation type of the current buffer. If @var{multibyte}
233is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
234is @code{nil}, the buffer becomes unibyte.
235
236This function leaves the buffer contents unchanged when viewed as a
237sequence of bytes. As a consequence, it can change the contents viewed
238as characters; a sequence of two bytes which is treated as one character
239in multibyte representation will count as two characters in unibyte
7a063989
KH
240representation. Character codes 128 through 159 are an exception. They
241are represented by one byte in a unibyte buffer, but when the buffer is
242set to multibyte, they are converted to two-byte sequences, and vice
243versa.
cc6d0d2c
RS
244
245This function sets @code{enable-multibyte-characters} to record which
246representation is in use. It also adjusts various data in the buffer
969fe9b5
RS
247(including overlays, text properties and markers) so that they cover the
248same text as they did before.
b6954afd
RS
249
250You cannot use @code{set-buffer-multibyte} on an indirect buffer,
251because indirect buffers always inherit the representation of the
252base buffer.
cc6d0d2c
RS
253@end defun
254
cc6d0d2c
RS
255@defun string-as-unibyte string
256This function returns a string with the same bytes as @var{string} but
257treating each byte as a character. This means that the value may have
258more characters than @var{string} has.
259
b6954afd 260If @var{string} is already a unibyte string, then the value is
7f84d9ae
DL
261@var{string} itself. Otherwise it is a newly created string, with no
262text properties. If @var{string} is multibyte, any characters it
686ffe28 263contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
7f84d9ae 264are converted to the corresponding single byte.
cc6d0d2c
RS
265@end defun
266
cc6d0d2c
RS
267@defun string-as-multibyte string
268This function returns a string with the same bytes as @var{string} but
269treating each multibyte sequence as one character. This means that the
270value may have fewer characters than @var{string} has.
271
b6954afd 272If @var{string} is already a multibyte string, then the value is
7f84d9ae
DL
273@var{string} itself. Otherwise it is a newly created string, with no
274text properties. If @var{string} is unibyte and contains any individual
2758-bit bytes (i.e.@: not part of a multibyte form), they are converted to
686ffe28
RS
276the corresponding multibyte character of charset @code{eight-bit-control}
277or @code{eight-bit-graphic}.
cc6d0d2c
RS
278@end defun
279
280@node Character Codes
281@section Character Codes
282@cindex character codes
283
284 The unibyte and multibyte text representations use different character
285codes. The valid character codes for unibyte representation range from
2860 to 255---the values that can fit in one byte. The valid character
287codes for multibyte representation range from 0 to 524287, but not all
0ace421a 288values in that range are valid. The values 128 through 255 are not
1e4d32f8 289entirely proper in multibyte text, but they can occur if you do explicit
0ace421a 290encoding and decoding (@pxref{Explicit Encoding}). Some other character
ad800164 291codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes
1e4d32f8 2920 through 127 are completely legitimate in both representations.
cc6d0d2c 293
7a063989 294@defun char-valid-p charcode &optional genericp
0a58afcd
RS
295This returns @code{t} if @var{charcode} is valid (either for unibyte
296text or for multibyte text).
cc6d0d2c
RS
297
298@example
299(char-valid-p 65)
300 @result{} t
301(char-valid-p 256)
302 @result{} nil
303(char-valid-p 2248)
304 @result{} t
305@end example
7a063989 306
6fe50867 307If the optional argument @var{genericp} is non-@code{nil}, this
35864124
LT
308function also returns @code{t} if @var{charcode} is a generic
309character (@pxref{Splitting Characters}).
cc6d0d2c
RS
310@end defun
311
312@node Character Sets
313@section Character Sets
314@cindex character sets
315
316 Emacs classifies characters into various @dfn{character sets}, each of
317which has a name which is a symbol. Each character belongs to one and
318only one character set.
319
320 In general, there is one character set for each distinct script. For
321example, @code{latin-iso8859-1} is one character set,
322@code{greek-iso8859-7} is another, and @code{ascii} is another. An
969fe9b5
RS
323Emacs character set can hold at most 9025 characters; therefore, in some
324cases, characters that would logically be grouped together are split
a9f0a989
RS
325into several character sets. For example, one set of Chinese
326characters, generally known as Big 5, is divided into two Emacs
327character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
cc6d0d2c 328
ad800164
EZ
329 @acronym{ASCII} characters are in character set @code{ascii}. The
330non-@acronym{ASCII} characters 128 through 159 are in character set
4240c779
GM
331@code{eight-bit-control}, and codes 160 through 255 are in character set
332@code{eight-bit-graphic}.
333
cc6d0d2c 334@defun charsetp object
8241495d 335Returns @code{t} if @var{object} is a symbol that names a character set,
cc6d0d2c
RS
336@code{nil} otherwise.
337@end defun
338
35864124
LT
339@defvar charset-list
340The value is a list of all defined character set names.
341@end defvar
342
cc6d0d2c 343@defun charset-list
35864124
LT
344This function returns the value of @code{charset-list}. It is only
345provided for backward compatibility.
cc6d0d2c
RS
346@end defun
347
cc6d0d2c 348@defun char-charset character
b6954afd 349This function returns the name of the character set that @var{character}
35864124
LT
350belongs to, or the symbol @code{unknown} if @var{character} is not a
351valid character.
cc6d0d2c
RS
352@end defun
353
8241495d 354@defun charset-plist charset
8241495d
RS
355This function returns the charset property list of the character set
356@var{charset}. Although @var{charset} is a symbol, this is not the same
357as the property list of that symbol. Charset properties are used for
0f4da9ce 358special purposes within Emacs.
8241495d
RS
359@end defun
360
5ac343ac
RS
361@deffn Command list-charset-chars charset
362This command displays a list of characters in the character set
363@var{charset}.
364@end deffn
365
cc6d0d2c
RS
366@node Chars and Bytes
367@section Characters and Bytes
368@cindex bytes and characters
369
a9f0a989
RS
370@cindex introduction sequence
371@cindex dimension (of character set)
cc6d0d2c 372 In multibyte representation, each character occupies one or more
a9f0a989 373bytes. Each character set has an @dfn{introduction sequence}, which is
ad800164
EZ
374normally one or two bytes long. (Exception: the @code{ascii} character
375set and the @code{eight-bit-graphic} character set have a zero-length
7a063989
KH
376introduction sequence.) The introduction sequence is the beginning of
377the byte sequence for any character in the character set. The rest of
378the character's bytes distinguish it from the other characters in the
379same character set. Depending on the character set, there are either
380one or two distinguishing bytes; the number of such bytes is called the
381@dfn{dimension} of the character set.
a9f0a989
RS
382
383@defun charset-dimension charset
b6954afd
RS
384This function returns the dimension of @var{charset}; at present, the
385dimension is always 1 or 2.
386@end defun
387
388@defun charset-bytes charset
b6954afd
RS
389This function returns the number of bytes used to represent a character
390in character set @var{charset}.
a9f0a989
RS
391@end defun
392
393 This is the simplest way to determine the byte length of a character
394set's introduction sequence:
395
396@example
b6954afd 397(- (charset-bytes @var{charset})
a9f0a989
RS
398 (charset-dimension @var{charset}))
399@end example
400
401@node Splitting Characters
402@section Splitting Characters
403
404 The functions in this section convert between characters and the byte
405values used to represent them. For most purposes, there is no need to
406be concerned with the sequence of bytes used to represent a character,
969fe9b5 407because Emacs translates automatically when necessary.
cc6d0d2c 408
cc6d0d2c
RS
409@defun split-char character
410Return a list containing the name of the character set of
a9f0a989
RS
411@var{character}, followed by one or two byte values (integers) which
412identify @var{character} within that character set. The number of byte
413values is the character set's dimension.
cc6d0d2c 414
35864124
LT
415If @var{character} is invalid as a character code, @code{split-char}
416returns a list consisting of the symbol @code{unknown} and @var{character}.
417
cc6d0d2c
RS
418@example
419(split-char 2248)
420 @result{} (latin-iso8859-1 72)
421(split-char 65)
422 @result{} (ascii 65)
7a063989
KH
423(split-char 128)
424 @result{} (eight-bit-control 128)
cc6d0d2c
RS
425@end example
426@end defun
427
e8262f40
DL
428@defun make-char charset &optional code1 code2
429This function returns the character in character set @var{charset} whose
430position codes are @var{code1} and @var{code2}. This is roughly the
431inverse of @code{split-char}. Normally, you should specify either one
432or both of @var{code1} and @var{code2} according to the dimension of
433@var{charset}. For example,
cc6d0d2c
RS
434
435@example
436(make-char 'latin-iso8859-1 72)
437 @result{} 2248
438@end example
0f4da9ce
DL
439
440Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
441before they are used to index @var{charset}. Thus you may use, for
442instance, an ISO 8859 character code rather than subtracting 128, as
443is necessary to index the corresponding Emacs charset.
cc6d0d2c
RS
444@end defun
445
a9f0a989
RS
446@cindex generic characters
447 If you call @code{make-char} with no @var{byte-values}, the result is
448a @dfn{generic character} which stands for @var{charset}. A generic
449character is an integer, but it is @emph{not} valid for insertion in the
450buffer as a character. It can be used in @code{char-table-range} to
451refer to the whole character set (@pxref{Char-Tables}).
452@code{char-valid-p} returns @code{nil} for generic characters.
453For example:
454
455@example
456(make-char 'latin-iso8859-1)
457 @result{} 2176
458(char-valid-p 2176)
459 @result{} nil
7a063989
KH
460(char-valid-p 2176 t)
461 @result{} t
a9f0a989
RS
462(split-char 2176)
463 @result{} (latin-iso8859-1 0)
464@end example
465
ad800164
EZ
466The character sets @code{ascii}, @code{eight-bit-control}, and
467@code{eight-bit-graphic} don't have corresponding generic characters. If
e8262f40
DL
468@var{charset} is one of them and you don't supply @var{code1},
469@code{make-char} returns the character code corresponding to the
470smallest code in @var{charset}.
7a063989 471
a9f0a989
RS
472@node Scanning Charsets
473@section Scanning for Character Sets
474
475 Sometimes it is useful to find out which character sets appear in a
476part of a buffer or a string. One use for this is in determining which
477coding systems (@pxref{Coding Systems}) are capable of representing all
478of the text in question.
479
5ac343ac
RS
480@defun charset-after &optional pos
481This function return the charset of a character in the current buffer
482at position @var{pos}. If @var{pos} is omitted or @code{nil}, it
5a36d834 483defaults to the current value of point. If @var{pos} is out of range,
5ac343ac
RS
484the value is @code{nil}.
485@end defun
486
a9f0a989 487@defun find-charset-region beg end &optional translation
a9f0a989
RS
488This function returns a list of the character sets that appear in the
489current buffer between positions @var{beg} and @var{end}.
490
491The optional argument @var{translation} specifies a translation table to
492be used in scanning the text (@pxref{Translation of Characters}). If it
493is non-@code{nil}, then each character in the region is translated
494through this table, and the value returned describes the translated
495characters instead of the characters actually in the buffer.
a265079f 496@end defun
a9f0a989
RS
497
498@defun find-charset-string string &optional translation
b6954afd
RS
499This function returns a list of the character sets that appear in the
500string @var{string}. It is just like @code{find-charset-region}, except
501that it applies to the contents of @var{string} instead of part of the
502current buffer.
a9f0a989
RS
503@end defun
504
505@node Translation of Characters
506@section Translation of Characters
507@cindex character translation tables
508@cindex translation tables
509
35864124
LT
510 A @dfn{translation table} is a char-table that specifies a mapping
511of characters into characters. These tables are used in encoding and
512decoding, and for other purposes. Some coding systems specify their
513own particular translation tables; there are also default translation
514tables which apply to all other coding systems.
a9f0a989 515
a3d3f60d
RS
516 For instance, the coding-system @code{utf-8} has a translation table
517that maps characters of various charsets (e.g.,
518@code{latin-iso8859-@var{x}}) into Unicode character sets. This way,
519it can encode Latin-2 characters into UTF-8. Meanwhile,
520@code{unify-8859-on-decoding-mode} operates by specifying
521@code{standard-translation-table-for-decode} to translate
522Latin-@var{x} characters into corresponding Unicode characters.
523
8241495d
RS
524@defun make-translation-table &rest translations
525This function returns a translation table based on the argument
f57b6e64
DL
526@var{translations}. Each element of @var{translations} should be a
527list of elements of the form @code{(@var{from} . @var{to})}; this says
528to translate the character @var{from} into @var{to}.
a9f0a989 529
c04c052b
DL
530The arguments and the forms in each argument are processed in order,
531and if a previous form already translates @var{to} to some other
532character, say @var{to-alt}, @var{from} is also translated to
533@var{to-alt}.
534
a9f0a989
RS
535You can also map one whole character set into another character set with
536the same dimension. To do this, you specify a generic character (which
537designates a character set) for @var{from} (@pxref{Splitting Characters}).
35864124
LT
538In this case, if @var{to} is also a generic character, its character
539set should have the same dimension as @var{from}'s. Then the
540translation table translates each character of @var{from}'s character
541set into the corresponding character of @var{to}'s character set. If
542@var{from} is a generic character and @var{to} is an ordinary
543character, then the translation table translates every character of
544@var{from}'s character set into @var{to}.
a9f0a989
RS
545@end defun
546
547 In decoding, the translation table's translations are applied to the
548characters that result from ordinary decoding. If a coding system has
35864124
LT
549property @code{translation-table-for-decode}, that specifies the
550translation table to use. (This is a property of the coding system,
551as returned by @code{coding-system-get}, not a property of the symbol
552that is the coding system's name. @xref{Coding System Basics,, Basic
553Concepts of Coding Systems}.) Otherwise, if
554@code{standard-translation-table-for-decode} is non-@code{nil},
555decoding uses that table.
a9f0a989
RS
556
557 In encoding, the translation table's translations are applied to the
558characters in the buffer, and the result of translation is actually
559encoded. If a coding system has property
35864124
LT
560@code{translation-table-for-encode}, that specifies the translation
561table to use. Otherwise the variable
b1f687a2
RS
562@code{standard-translation-table-for-encode} specifies the translation
563table.
a9f0a989 564
b1f687a2 565@defvar standard-translation-table-for-decode
a9f0a989
RS
566This is the default translation table for decoding, for
567coding systems that don't specify any other translation table.
568@end defvar
569
b1f687a2 570@defvar standard-translation-table-for-encode
a9f0a989
RS
571This is the default translation table for encoding, for
572coding systems that don't specify any other translation table.
573@end defvar
574
131bf943
RS
575@defvar translation-table-for-input
576Self-inserting characters are translated through this translation
fa27fb28
RS
577table before they are inserted. Search commands also translate their
578input through this table, so they can compare more reliably with
579what's in the buffer.
a3d3f60d
RS
580
581@code{set-buffer-file-coding-system} sets this variable so that your
582keyboard input gets translated into the character sets that the buffer
fa27fb28
RS
583is likely to contain. This variable automatically becomes
584buffer-local when set.
131bf943
RS
585@end defvar
586
cc6d0d2c
RS
587@node Coding Systems
588@section Coding Systems
589
590@cindex coding system
591 When Emacs reads or writes a file, and when Emacs sends text to a
592subprocess or receives text from a subprocess, it normally performs
593character code conversion and end-of-line conversion as specified
594by a particular @dfn{coding system}.
595
8241495d
RS
596 How to define a coding system is an arcane matter, and is not
597documented here.
b6954afd 598
a9f0a989 599@menu
5557b83b
RS
600* Coding System Basics:: Basic concepts.
601* Encoding and I/O:: How file I/O functions handle coding systems.
602* Lisp and Coding Systems:: Functions to operate on coding system names.
603* User-Chosen Coding Systems:: Asking the user to choose a coding system.
604* Default Coding Systems:: Controlling the default choices.
605* Specifying Coding Systems:: Requesting a particular coding system
606 for a single file operation.
607* Explicit Encoding:: Encoding or decoding text without doing I/O.
608* Terminal I/O Encoding:: Use of encoding for terminal I/O.
609* MS-DOS File Types:: How DOS "text" and "binary" files
610 relate to coding systems.
a9f0a989
RS
611@end menu
612
613@node Coding System Basics
614@subsection Basic Concepts of Coding Systems
615
cc6d0d2c
RS
616@cindex character code conversion
617 @dfn{Character code conversion} involves conversion between the encoding
618used inside Emacs and some other encoding. Emacs supports many
619different encodings, in that it can convert to and from them. For
620example, it can convert text to or from encodings such as Latin 1, Latin
6212, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some
622cases, Emacs supports several alternative encodings for the same
623characters; for example, there are three coding systems for the Cyrillic
624(Russian) alphabet: ISO, Alternativnyj, and KOI8.
625
cc6d0d2c 626 Most coding systems specify a particular character code for
8241495d
RS
627conversion, but some of them leave the choice unspecified---to be chosen
628heuristically for each file, based on the data.
cc6d0d2c 629
aa945b59
RS
630 In general, a coding system doesn't guarantee roundtrip identity:
631decoding a byte sequence using coding system, then encoding the
632resulting text in the same coding system, can produce a different byte
633sequence. However, the following coding systems do guarantee that the
634byte sequence will be the same as what you originally decoded:
6fa88620
KH
635
636@quotation
637chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule
638greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3
639iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe
640japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text
641@end quotation
642
aa945b59
RS
643 Encoding buffer text and then decoding the result can also fail to
644reproduce the original text. For instance, if you encode Latin-2
8b918214
RS
645characters with @code{utf-8} and decode the result using the same
646coding system, you'll get Unicode characters (of charset
aa945b59
RS
647@code{mule-unicode-0100-24ff}). If you encode Unicode characters with
648@code{iso-latin-2} and decode the result with the same coding system,
649you'll get Latin-2 characters.
6fa88620 650
969fe9b5
RS
651@cindex end of line conversion
652 @dfn{End of line conversion} handles three different conventions used
653on various systems for representing end of line in files. The Unix
654convention is to use the linefeed character (also called newline). The
8241495d
RS
655DOS convention is to use a carriage-return and a linefeed at the end of
656a line. The Mac convention is to use just carriage-return.
969fe9b5 657
cc6d0d2c
RS
658@cindex base coding system
659@cindex variant coding system
660 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
661conversion unspecified, to be chosen based on the data. @dfn{Variant
662coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
663@code{latin-1-mac} specify the end-of-line conversion explicitly as
a9f0a989 664well. Most base coding systems have three corresponding variants whose
cc6d0d2c
RS
665names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
666
a9f0a989
RS
667 The coding system @code{raw-text} is special in that it prevents
668character code conversion, and causes the buffer visited with that
669coding system to be a unibyte buffer. It does not specify the
670end-of-line conversion, allowing that to be determined as usual by the
671data, and has the usual three variants which specify the end-of-line
672conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}:
673it specifies no conversion of either character codes or end-of-line.
674
675 The coding system @code{emacs-mule} specifies that the data is
676represented in the internal Emacs encoding. This is like
677@code{raw-text} in that no code conversion happens, but different in
678that the result is multibyte data.
679
680@defun coding-system-get coding-system property
a9f0a989
RS
681This function returns the specified property of the coding system
682@var{coding-system}. Most coding system properties exist for internal
683purposes, but one that you might find useful is @code{mime-charset}.
684That property's value is the name used in MIME for the character coding
685which this coding system can read and write. Examples:
686
687@example
688(coding-system-get 'iso-latin-1 'mime-charset)
689 @result{} iso-8859-1
690(coding-system-get 'iso-2022-cn 'mime-charset)
691 @result{} iso-2022-cn
692(coding-system-get 'cyrillic-koi8 'mime-charset)
693 @result{} koi8-r
694@end example
695
696The value of the @code{mime-charset} property is also defined
697as an alias for the coding system.
698@end defun
699
700@node Encoding and I/O
701@subsection Encoding and I/O
702
1911e6e5 703 The principal purpose of coding systems is for use in reading and
a9f0a989
RS
704writing files. The function @code{insert-file-contents} uses
705a coding system for decoding the file data, and @code{write-region}
706uses one to encode the buffer contents.
707
708 You can specify the coding system to use either explicitly
5ac343ac 709(@pxref{Specifying Coding Systems}), or implicitly using a default
a9f0a989
RS
710mechanism (@pxref{Default Coding Systems}). But these methods may not
711completely specify what to do. For example, they may choose a coding
712system such as @code{undefined} which leaves the character code
713conversion to be determined from the data. In these cases, the I/O
714operation finishes the job of choosing a coding system. Very often
715you will want to find out afterwards which coding system was chosen.
716
717@defvar buffer-file-coding-system
475aab0d
CY
718This buffer-local variable records the coding system that was used to visit
719the current buffer. It is used for saving the buffer, and for writing part
1b02d12c
EZ
720of the buffer with @code{write-region}. If the text to be written
721cannot be safely encoded using the coding system specified by this
722variable, these operations select an alternative encoding by calling
723the function @code{select-safe-coding-system} (@pxref{User-Chosen
724Coding Systems}). If selecting a different encoding requires to ask
725the user to specify a coding system, @code{buffer-file-coding-system}
726is updated to the newly selected coding system.
727
728@code{buffer-file-coding-system} does @emph{not} affect sending text
b6954afd 729to a subprocess.
a9f0a989
RS
730@end defvar
731
732@defvar save-buffer-coding-system
7a063989
KH
733This variable specifies the coding system for saving the buffer (by
734overriding @code{buffer-file-coding-system}). Note that it is not used
735for @code{write-region}.
8241495d
RS
736
737When a command to save the buffer starts out to use
7a063989
KH
738@code{buffer-file-coding-system} (or @code{save-buffer-coding-system}),
739and that coding system cannot handle
8241495d 740the actual text in the buffer, the command asks the user to choose
1b02d12c
EZ
741another coding system (by calling @code{select-safe-coding-system}).
742After that happens, the command also updates
743@code{buffer-file-coding-system} to represent the coding system that
744the user specified.
a9f0a989
RS
745@end defvar
746
747@defvar last-coding-system-used
a9f0a989
RS
748I/O operations for files and subprocesses set this variable to the
749coding system name that was used. The explicit encoding and decoding
750functions (@pxref{Explicit Encoding}) set it too.
751
752@strong{Warning:} Since receiving subprocess output sets this variable,
8241495d
RS
753it can change whenever Emacs waits; therefore, you should copy the
754value shortly after the function call that stores the value you are
a9f0a989
RS
755interested in.
756@end defvar
757
2eb4136f
RS
758 The variable @code{selection-coding-system} specifies how to encode
759selections for the window system. @xref{Window System Selections}.
760
1ee89891
RS
761@defvar file-name-coding-system
762The variable @code{file-name-coding-system} specifies the coding
763system to use for encoding file names. Emacs encodes file names using
764that coding system for all file operations. If
765@code{file-name-coding-system} is @code{nil}, Emacs uses a default
766coding system determined by the selected language environment. In the
767default language environment, any non-@acronym{ASCII} characters in
768file names are not encoded specially; they appear in the file system
769using the internal Emacs representation.
770@end defvar
771
772 @strong{Warning:} if you change @code{file-name-coding-system} (or
773the language environment) in the middle of an Emacs session, problems
774can result if you have already visited files whose names were encoded
775using the earlier coding system and are handled differently under the
776new coding system. If you try to save one of these buffers under the
777visited file name, saving may use the wrong file name, or it may get
778an error. If such a problem happens, use @kbd{C-x C-w} to specify a
779new file name for that buffer.
780
969fe9b5
RS
781@node Lisp and Coding Systems
782@subsection Coding Systems in Lisp
783
8241495d 784 Here are the Lisp facilities for working with coding systems:
cc6d0d2c 785
cc6d0d2c
RS
786@defun coding-system-list &optional base-only
787This function returns a list of all coding system names (symbols). If
788@var{base-only} is non-@code{nil}, the value includes only the
7a063989
KH
789base coding systems. Otherwise, it includes alias and variant coding
790systems as well.
cc6d0d2c
RS
791@end defun
792
cc6d0d2c
RS
793@defun coding-system-p object
794This function returns @code{t} if @var{object} is a coding system
35864124 795name or @code{nil}.
cc6d0d2c
RS
796@end defun
797
cc6d0d2c
RS
798@defun check-coding-system coding-system
799This function checks the validity of @var{coding-system}.
800If that is valid, it returns @var{coding-system}.
801Otherwise it signals an error with condition @code{coding-system-error}.
802@end defun
803
e1166db9
EZ
804@cindex EOL conversion
805@cindex end-of-line conversion
806@cindex line end conversion
807@defun coding-system-eol-type coding-system
808This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
809conversion used by @var{coding-system}. If @var{coding-system}
810specifies a certain eol conversion, the return value is an integer 0,
8111, or 2, standing for @code{unix}, @code{dos}, and @code{mac},
812respectively. If @var{coding-system} doesn't specify eol conversion
813explicitly, the return value is a vector of coding systems, each one
814with one of the possible eol conversion types, like this:
815
816@lisp
817(coding-system-eol-type 'latin-1)
818 @result{} [latin-1-unix latin-1-dos latin-1-mac]
819@end lisp
820
821@noindent
822If this function returns a vector, Emacs will decide, as part of the
823text encoding or decoding process, what eol conversion to use. For
824decoding, the end-of-line format of the text is auto-detected, and the
825eol conversion is set to match it (e.g., DOS-style CRLF format will
826imply @code{dos} eol conversion). For encoding, the eol conversion is
827taken from the appropriate default coding system (e.g.,
828@code{default-buffer-file-coding-system} for
829@code{buffer-file-coding-system}), or from the default eol conversion
830appropriate for the underlying platform.
831@end defun
832
a9f0a989 833@defun coding-system-change-eol-conversion coding-system eol-type
a9f0a989 834This function returns a coding system which is like @var{coding-system}
1911e6e5 835except for its eol conversion, which is specified by @code{eol-type}.
a9f0a989
RS
836@var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
837@code{nil}. If it is @code{nil}, the returned coding system determines
838the end-of-line conversion from the data.
35864124
LT
839
840@var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
070b546b 841@code{dos} and @code{mac}, respectively.
a9f0a989 842@end defun
969fe9b5 843
a9f0a989 844@defun coding-system-change-text-conversion eol-coding text-coding
a9f0a989
RS
845This function returns a coding system which uses the end-of-line
846conversion of @var{eol-coding}, and the text conversion of
847@var{text-coding}. If @var{text-coding} is @code{nil}, it returns
848@code{undecided}, or one of its variants according to @var{eol-coding}.
969fe9b5
RS
849@end defun
850
a9f0a989 851@defun find-coding-systems-region from to
a9f0a989
RS
852This function returns a list of coding systems that could be used to
853encode a text between @var{from} and @var{to}. All coding systems in
854the list can safely encode any multibyte characters in that portion of
855the text.
856
857If the text contains no multibyte characters, the function returns the
858list @code{(undecided)}.
859@end defun
860
861@defun find-coding-systems-string string
a9f0a989
RS
862This function returns a list of coding systems that could be used to
863encode the text of @var{string}. All coding systems in the list can
864safely encode any multibyte characters in @var{string}. If the text
865contains no multibyte characters, this returns the list
866@code{(undecided)}.
867@end defun
868
869@defun find-coding-systems-for-charsets charsets
a9f0a989
RS
870This function returns a list of coding systems that could be used to
871encode all the character sets in the list @var{charsets}.
872@end defun
873
874@defun detect-coding-region start end &optional highest
cc6d0d2c 875This function chooses a plausible coding system for decoding the text
0ace421a 876from @var{start} to @var{end}. This text should be a byte sequence
969fe9b5 877(@pxref{Explicit Encoding}).
cc6d0d2c 878
a9f0a989 879Normally this function returns a list of coding systems that could
cc6d0d2c 880handle decoding the text that was scanned. They are listed in order of
a9f0a989
RS
881decreasing priority. But if @var{highest} is non-@code{nil}, then the
882return value is just one coding system, the one that is highest in
883priority.
884
6d05494a
KH
885If the region contains only @acronym{ASCII} characters except for such
886ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
887@code{undecided} or @code{(undecided)}, or a variant specifying
35864124 888end-of-line conversion, if that can be deduced from the text.
cc6d0d2c
RS
889@end defun
890
35864124 891@defun detect-coding-string string &optional highest
cc6d0d2c
RS
892This function is like @code{detect-coding-region} except that it
893operates on the contents of @var{string} instead of bytes in the buffer.
1911e6e5
RS
894@end defun
895
35864124
LT
896 @xref{Coding systems for a subprocess,, Process Information}, in
897particular the description of the functions
898@code{process-coding-system} and @code{set-process-coding-system}, for
899how to examine or set the coding systems used for I/O to a subprocess.
1911e6e5
RS
900
901@node User-Chosen Coding Systems
902@subsection User-Chosen Coding Systems
903
1b02d12c 904@cindex select safe coding system
35864124 905@defun select-safe-coding-system from to &optional default-coding-system accept-default-p file
bf23b477
EZ
906This function selects a coding system for encoding specified text,
907asking the user to choose if necessary. Normally the specified text
35864124
LT
908is the text in the current buffer between @var{from} and @var{to}. If
909@var{from} is a string, the string specifies the text to encode, and
910@var{to} is ignored.
bf23b477
EZ
911
912If @var{default-coding-system} is non-@code{nil}, that is the first
913coding system to try; if that can handle the text,
914@code{select-safe-coding-system} returns that coding system. It can
915also be a list of coding systems; then the function tries each of them
35864124
LT
916one by one. After trying all of them, it next tries the current
917buffer's value of @code{buffer-file-coding-system} (if it is not
918@code{undecided}), then the value of
919@code{default-buffer-file-coding-system} and finally the user's most
920preferred coding system, which the user can set using the command
921@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
922Coding Systems, emacs, The GNU Emacs Manual}).
bf23b477
EZ
923
924If one of those coding systems can safely encode all the specified
925text, @code{select-safe-coding-system} chooses it and returns it.
926Otherwise, it asks the user to choose from a list of coding systems
927which can encode all the text, and returns the user's choice.
928
35864124
LT
929@var{default-coding-system} can also be a list whose first element is
930t and whose other elements are coding systems. Then, if no coding
931system in the list can handle the text, @code{select-safe-coding-system}
932queries the user immediately, without trying any of the three
933alternatives described above.
934
bf23b477 935The optional argument @var{accept-default-p}, if non-@code{nil},
35864124
LT
936should be a function to determine whether a coding system selected
937without user interaction is acceptable. @code{select-safe-coding-system}
938calls this function with one argument, the base coding system of the
939selected coding system. If @var{accept-default-p} returns @code{nil},
940@code{select-safe-coding-system} rejects the silently selected coding
941system, and asks the user to select a coding system from a list of
942possible candidates.
bf23b477
EZ
943
944@vindex select-safe-coding-system-accept-default-p
945If the variable @code{select-safe-coding-system-accept-default-p} is
946non-@code{nil}, its value overrides the value of
947@var{accept-default-p}.
35864124
LT
948
949As a final step, before returning the chosen coding system,
950@code{select-safe-coding-system} checks whether that coding system is
951consistent with what would be selected if the contents of the region
952were read from a file. (If not, this could lead to data corruption in
953a file subsequently re-visited and edited.) Normally,
954@code{select-safe-coding-system} uses @code{buffer-file-name} as the
955file for this purpose, but if @var{file} is non-@code{nil}, it uses
956that file instead (this can be relevant for @code{write-region} and
957similar functions). If it detects an apparent inconsistency,
958@code{select-safe-coding-system} queries the user before selecting the
959coding system.
969fe9b5
RS
960@end defun
961
962 Here are two functions you can use to let the user specify a coding
963system, with completion. @xref{Completion}.
964
a9f0a989 965@defun read-coding-system prompt &optional default
969fe9b5
RS
966This function reads a coding system using the minibuffer, prompting with
967string @var{prompt}, and returns the coding system name as a symbol. If
968the user enters null input, @var{default} specifies which coding system
969to return. It should be a symbol or a string.
970@end defun
971
969fe9b5
RS
972@defun read-non-nil-coding-system prompt
973This function reads a coding system using the minibuffer, prompting with
a9f0a989 974string @var{prompt}, and returns the coding system name as a symbol. If
969fe9b5
RS
975the user tries to enter null input, it asks the user to try again.
976@xref{Coding Systems}.
cc6d0d2c
RS
977@end defun
978
979@node Default Coding Systems
a9f0a989 980@subsection Default Coding Systems
cc6d0d2c 981
a9f0a989
RS
982 This section describes variables that specify the default coding
983system for certain files or when running certain subprograms, and the
1911e6e5 984function that I/O operations use to access them.
a9f0a989
RS
985
986 The idea of these variables is that you set them once and for all to the
987defaults you want, and then do not change them again. To specify a
988particular coding system for a particular operation in a Lisp program,
989don't change these variables; instead, override them using
990@code{coding-system-for-read} and @code{coding-system-for-write}
991(@pxref{Specifying Coding Systems}).
cc6d0d2c 992
bf23b477
EZ
993@defvar auto-coding-regexp-alist
994This variable is an alist of text patterns and corresponding coding
995systems. Each element has the form @code{(@var{regexp}
996. @var{coding-system})}; a file whose first few kilobytes match
997@var{regexp} is decoded with @var{coding-system} when its contents are
998read into a buffer. The settings in this alist take priority over
999@code{coding:} tags in the files and the contents of
1000@code{file-coding-system-alist} (see below). The default value is set
1001so that Emacs automatically recognizes mail files in Babyl format and
1002reads them with no code conversions.
1003@end defvar
1004
cc6d0d2c
RS
1005@defvar file-coding-system-alist
1006This variable is an alist that specifies the coding systems to use for
1007reading and writing particular files. Each element has the form
1008@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
1009expression that matches certain file names. The element applies to file
1010names that match @var{pattern}.
1011
35864124 1012The @sc{cdr} of the element, @var{coding}, should be either a coding
8241495d
RS
1013system, a cons cell containing two coding systems, or a function name (a
1014symbol with a function definition). If @var{coding} is a coding system,
1015that coding system is used for both reading the file and writing it. If
35864124
LT
1016@var{coding} is a cons cell containing two coding systems, its @sc{car}
1017specifies the coding system for decoding, and its @sc{cdr} specifies the
8241495d
RS
1018coding system for encoding.
1019
35864124
LT
1020If @var{coding} is a function name, the function should take one
1021argument, a list of all arguments passed to
1022@code{find-operation-coding-system}. It must return a coding system
1023or a cons cell containing two coding systems. This value has the same
1024meaning as described above.
cc6d0d2c
RS
1025@end defvar
1026
cc6d0d2c
RS
1027@defvar process-coding-system-alist
1028This variable is an alist specifying which coding systems to use for a
1029subprocess, depending on which program is running in the subprocess. It
1030works like @code{file-coding-system-alist}, except that @var{pattern} is
1031matched against the program name used to start the subprocess. The coding
1032system or systems specified in this alist are used to initialize the
1033coding systems used for I/O to the subprocess, but you can specify
1034other coding systems later using @code{set-process-coding-system}.
1035@end defvar
1036
8241495d
RS
1037 @strong{Warning:} Coding systems such as @code{undecided}, which
1038determine the coding system from the data, do not work entirely reliably
1911e6e5 1039with asynchronous subprocess output. This is because Emacs handles
a9f0a989
RS
1040asynchronous subprocess output in batches, as it arrives. If the coding
1041system leaves the character code conversion unspecified, or leaves the
1042end-of-line conversion unspecified, Emacs must try to detect the proper
1043conversion from one batch at a time, and this does not always work.
1044
1045 Therefore, with an asynchronous subprocess, if at all possible, use a
1046coding system which determines both the character code conversion and
1047the end of line conversion---that is, one like @code{latin-1-unix},
1048rather than @code{undecided} or @code{latin-1}.
1049
cc6d0d2c
RS
1050@defvar network-coding-system-alist
1051This variable is an alist that specifies the coding system to use for
1052network streams. It works much like @code{file-coding-system-alist},
969fe9b5 1053with the difference that the @var{pattern} in an element may be either a
cc6d0d2c
RS
1054port number or a regular expression. If it is a regular expression, it
1055is matched against the network service name used to open the network
1056stream.
1057@end defvar
1058
cc6d0d2c
RS
1059@defvar default-process-coding-system
1060This variable specifies the coding systems to use for subprocess (and
1061network stream) input and output, when nothing else specifies what to
1062do.
1063
a9f0a989
RS
1064The value should be a cons cell of the form @code{(@var{input-coding}
1065. @var{output-coding})}. Here @var{input-coding} applies to input from
1066the subprocess, and @var{output-coding} applies to output to it.
cc6d0d2c
RS
1067@end defvar
1068
131bf943
RS
1069@defvar auto-coding-functions
1070This variable holds a list of functions that try to determine a
1071coding system for a file based on its undecoded contents.
1072
1073Each function in this list should be written to look at text in the
1074current buffer, but should not modify it in any way. The buffer will
1075contain undecoded text of parts of the file. Each function should
1076take one argument, @var{size}, which tells it how many characters to
1077look at, starting from point. If the function succeeds in determining
1078a coding system for the file, it should return that coding system.
1079Otherwise, it should return @code{nil}.
1080
1081If a file has a @samp{coding:} tag, that takes precedence, so these
1082functions won't be called.
1083@end defvar
1084
a9f0a989 1085@defun find-operation-coding-system operation &rest arguments
a9f0a989
RS
1086This function returns the coding system to use (by default) for
1087performing @var{operation} with @var{arguments}. The value has this
1088form:
1089
1090@example
35864124 1091(@var{decoding-system} . @var{encoding-system})
a9f0a989
RS
1092@end example
1093
1094The first element, @var{decoding-system}, is the coding system to use
1095for decoding (in case @var{operation} does decoding), and
1096@var{encoding-system} is the coding system for encoding (in case
1097@var{operation} does encoding).
1098
342fd6cd
RS
1099The argument @var{operation} should be a symbol, any one of
1100@code{insert-file-contents}, @code{write-region},
1101@code{start-process}, @code{call-process}, @code{call-process-region},
1102or @code{open-network-stream}. These are the names of the Emacs I/O
e1511d87 1103primitives that can do character code and eol conversion.
a9f0a989
RS
1104
1105The remaining arguments should be the same arguments that might be given
e1511d87
EZ
1106to the corresponding I/O primitive. Depending on the primitive, one
1107of those arguments is selected as the @dfn{target}. For example, if
a9f0a989
RS
1108@var{operation} does file I/O, whichever argument specifies the file
1109name is the target. For subprocess primitives, the process name is the
1110target. For @code{open-network-stream}, the target is the service name
1111or port number.
1112
342fd6cd
RS
1113Depending on @var{operation}, this function looks up the target in
1114@code{file-coding-system-alist}, @code{process-coding-system-alist},
e1511d87
EZ
1115or @code{network-coding-system-alist}. If the target is found in the
1116alist, @code{find-operation-coding-system} returns its association in
1117the alist; otherwise it returns @code{nil}.
6d3906d5
KH
1118
1119If @var{operation} is @code{insert-file-contents}, the argument
1120corresponding to the target may be a cons cell of the form
b8909e88 1121@code{(@var{filename} . @var{buffer})}). In that case, @var{filename}
e1511d87 1122is a file name to look up in @code{file-coding-system-alist}, and
749eecf5
RS
1123@var{buffer} is a buffer that contains the file's contents (not yet
1124decoded). If @code{file-coding-system-alist} specifies a function to
1125call for this file, and that function needs to examine the file's
1126contents (as it usually does), it should examine the contents of
1127@var{buffer} instead of reading the file.
a9f0a989
RS
1128@end defun
1129
cc6d0d2c 1130@node Specifying Coding Systems
a9f0a989 1131@subsection Specifying a Coding System for One Operation
cc6d0d2c
RS
1132
1133 You can specify the coding system for a specific operation by binding
1134the variables @code{coding-system-for-read} and/or
1135@code{coding-system-for-write}.
1136
cc6d0d2c
RS
1137@defvar coding-system-for-read
1138If this variable is non-@code{nil}, it specifies the coding system to
1139use for reading a file, or for input from a synchronous subprocess.
1140
1141It also applies to any asynchronous subprocess or network stream, but in
1142a different way: the value of @code{coding-system-for-read} when you
1143start the subprocess or open the network stream specifies the input
1144decoding method for that subprocess or network stream. It remains in
1145use for that subprocess or network stream unless and until overridden.
1146
1147The right way to use this variable is to bind it with @code{let} for a
1148specific I/O operation. Its global value is normally @code{nil}, and
1149you should not globally set it to any other value. Here is an example
1150of the right way to use the variable:
1151
1152@example
1153;; @r{Read the file with no character code conversion.}
ad800164 1154;; @r{Assume @acronym{crlf} represents end-of-line.}
a3d3f60d 1155(let ((coding-system-for-read 'emacs-mule-dos))
cc6d0d2c
RS
1156 (insert-file-contents filename))
1157@end example
1158
1159When its value is non-@code{nil}, @code{coding-system-for-read} takes
a9f0a989 1160precedence over all other methods of specifying a coding system to use for
cc6d0d2c
RS
1161input, including @code{file-coding-system-alist},
1162@code{process-coding-system-alist} and
1163@code{network-coding-system-alist}.
1164@end defvar
1165
cc6d0d2c
RS
1166@defvar coding-system-for-write
1167This works much like @code{coding-system-for-read}, except that it
1168applies to output rather than input. It affects writing to files,
b6954afd 1169as well as sending output to subprocesses and net connections.
cc6d0d2c
RS
1170
1171When a single operation does both input and output, as do
1172@code{call-process-region} and @code{start-process}, both
1173@code{coding-system-for-read} and @code{coding-system-for-write}
1174affect it.
1175@end defvar
1176
cc6d0d2c
RS
1177@defvar inhibit-eol-conversion
1178When this variable is non-@code{nil}, no end-of-line conversion is done,
1179no matter which coding system is specified. This applies to all the
1180Emacs I/O and subprocess primitives, and to the explicit encoding and
1181decoding functions (@pxref{Explicit Encoding}).
1182@end defvar
1183
cc6d0d2c 1184@node Explicit Encoding
a9f0a989 1185@subsection Explicit Encoding and Decoding
cc6d0d2c
RS
1186@cindex encoding text
1187@cindex decoding text
1188
1189 All the operations that transfer text in and out of Emacs have the
1190ability to use a coding system to encode or decode the text.
1191You can also explicitly encode and decode text using the functions
1192in this section.
1193
cc6d0d2c 1194 The result of encoding, and the input to decoding, are not ordinary
0ace421a
GM
1195text. They logically consist of a series of byte values; that is, a
1196series of characters whose codes are in the range 0 through 255. In a
1197multibyte buffer or string, character codes 128 through 159 are
1198represented by multibyte sequences, but this is invisible to Lisp
1199programs.
1200
1201 The usual way to read a file into a buffer as a sequence of bytes, so
1202you can decode the contents explicitly, is with
1203@code{insert-file-contents-literally} (@pxref{Reading from Files});
1204alternatively, specify a non-@code{nil} @var{rawfile} argument when
1205visiting a file with @code{find-file-noselect}. These methods result in
1206a unibyte buffer.
1207
1208 The usual way to use the byte sequence that results from explicitly
1209encoding text is to copy it to a file or process---for example, to write
1210it with @code{write-region} (@pxref{Writing to Files}), and suppress
1211encoding by binding @code{coding-system-for-write} to
1212@code{no-conversion}.
b6954afd
RS
1213
1214 Here are the functions to perform explicit encoding or decoding. The
7f2e71dd 1215encoding functions produce sequences of bytes; the decoding functions
0ace421a
GM
1216are meant to operate on sequences of bytes. All of these functions
1217discard text properties.
1911e6e5 1218
35864124
LT
1219@deffn Command encode-coding-region start end coding-system
1220This command encodes the text from @var{start} to @var{end} according
969fe9b5 1221to coding system @var{coding-system}. The encoded text replaces the
0ace421a
GM
1222original text in the buffer. The result of encoding is logically a
1223sequence of bytes, but the buffer remains multibyte if it was multibyte
1224before.
cc6d0d2c 1225
35864124
LT
1226This command returns the length of the encoded text.
1227@end deffn
1228
1229@defun encode-coding-string string coding-system &optional nocopy
cc6d0d2c
RS
1230This function encodes the text in @var{string} according to coding
1231system @var{coding-system}. It returns a new string containing the
35864124
LT
1232encoded text, except when @var{nocopy} is non-@code{nil}, in which
1233case the function may return @var{string} itself if the encoding
1234operation is trivial. The result of encoding is a unibyte string.
cc6d0d2c
RS
1235@end defun
1236
35864124
LT
1237@deffn Command decode-coding-region start end coding-system
1238This command decodes the text from @var{start} to @var{end} according
cc6d0d2c
RS
1239to coding system @var{coding-system}. The decoded text replaces the
1240original text in the buffer. To make explicit decoding useful, the text
0ace421a
GM
1241before decoding ought to be a sequence of byte values, but both
1242multibyte and unibyte buffers are acceptable.
cc6d0d2c 1243
35864124
LT
1244This command returns the length of the decoded text.
1245@end deffn
1246
1247@defun decode-coding-string string coding-system &optional nocopy
cc6d0d2c
RS
1248This function decodes the text in @var{string} according to coding
1249system @var{coding-system}. It returns a new string containing the
35864124
LT
1250decoded text, except when @var{nocopy} is non-@code{nil}, in which
1251case the function may return @var{string} itself if the decoding
1252operation is trivial. To make explicit decoding useful, the contents
1253of @var{string} ought to be a sequence of byte values, but a multibyte
0ace421a 1254string is acceptable.
cc6d0d2c 1255@end defun
969fe9b5 1256
131bf943
RS
1257@defun decode-coding-inserted-region from to filename &optional visit beg end replace
1258This function decodes the text from @var{from} to @var{to} as if
1259it were being read from file @var{filename} using @code{insert-file-contents}
1260using the rest of the arguments provided.
1261
1262The normal way to use this function is after reading text from a file
1263without decoding, if you decide you would rather have decoded it.
1264Instead of deleting the text and reading it again, this time with
1265decoding, you can call this function.
1266@end defun
1267
a9f0a989
RS
1268@node Terminal I/O Encoding
1269@subsection Terminal I/O Encoding
1270
1271 Emacs can decode keyboard input using a coding system, and encode
2eb4136f
RS
1272terminal output. This is useful for terminals that transmit or display
1273text using a particular encoding such as Latin-1. Emacs does not set
1274@code{last-coding-system-used} for encoding or decoding for the
1275terminal.
a9f0a989
RS
1276
1277@defun keyboard-coding-system
a9f0a989
RS
1278This function returns the coding system that is in use for decoding
1279keyboard input---or @code{nil} if no coding system is to be used.
1280@end defun
1281
35864124
LT
1282@deffn Command set-keyboard-coding-system coding-system
1283This command specifies @var{coding-system} as the coding system to
a9f0a989
RS
1284use for decoding keyboard input. If @var{coding-system} is @code{nil},
1285that means do not decode keyboard input.
35864124 1286@end deffn
a9f0a989
RS
1287
1288@defun terminal-coding-system
a9f0a989
RS
1289This function returns the coding system that is in use for encoding
1290terminal output---or @code{nil} for no encoding.
1291@end defun
1292
35864124
LT
1293@deffn Command set-terminal-coding-system coding-system
1294This command specifies @var{coding-system} as the coding system to use
a9f0a989
RS
1295for encoding terminal output. If @var{coding-system} is @code{nil},
1296that means do not encode terminal output.
35864124 1297@end deffn
a9f0a989 1298
969fe9b5 1299@node MS-DOS File Types
a9f0a989 1300@subsection MS-DOS File Types
969fe9b5
RS
1301@cindex DOS file types
1302@cindex MS-DOS file types
1303@cindex Windows file types
1304@cindex file types on MS-DOS and Windows
1305@cindex text files and binary files
1306@cindex binary files and text files
1307
8241495d
RS
1308 On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
1309end-of-line conversion for a file by looking at the file's name. This
0ace421a 1310feature classifies files as @dfn{text files} and @dfn{binary files}. By
8241495d
RS
1311``binary file'' we mean a file of literal byte values that are not
1312necessarily meant to be characters; Emacs does no end-of-line conversion
1313and no character code conversion for them. On the other hand, the bytes
1314in a text file are intended to represent characters; when you create a
1315new file whose name implies that it is a text file, Emacs uses DOS
1316end-of-line conversion.
969fe9b5
RS
1317
1318@defvar buffer-file-type
1319This variable, automatically buffer-local in each buffer, records the
a9f0a989
RS
1320file type of the buffer's visited file. When a buffer does not specify
1321a coding system with @code{buffer-file-coding-system}, this variable is
1322used to determine which coding system to use when writing the contents
1323of the buffer. It should be @code{nil} for text, @code{t} for binary.
1324If it is @code{t}, the coding system is @code{no-conversion}.
1325Otherwise, @code{undecided-dos} is used.
1326
1327Normally this variable is set by visiting a file; it is set to
1328@code{nil} if the file was visited without any actual conversion.
969fe9b5
RS
1329@end defvar
1330
1331@defopt file-name-buffer-file-type-alist
1332This variable holds an alist for recognizing text and binary files.
1333Each element has the form (@var{regexp} . @var{type}), where
1334@var{regexp} is matched against the file name, and @var{type} may be
1335@code{nil} for text, @code{t} for binary, or a function to call to
1336compute which. If it is a function, then it is called with a single
1337argument (the file name) and should return @code{t} or @code{nil}.
1338
8241495d 1339When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
969fe9b5
RS
1340which coding system to use when reading a file. For a text file,
1341@code{undecided-dos} is used. For a binary file, @code{no-conversion}
1342is used.
1343
1344If no element in this alist matches a given file name, then
1345@code{default-buffer-file-type} says how to treat the file.
1346@end defopt
1347
1348@defopt default-buffer-file-type
1349This variable says how to handle files for which
1350@code{file-name-buffer-file-type-alist} says nothing about the type.
1351
1352If this variable is non-@code{nil}, then these files are treated as
a9f0a989
RS
1353binary: the coding system @code{no-conversion} is used. Otherwise,
1354nothing special is done for them---the coding system is deduced solely
1355from the file contents, in the usual Emacs fashion.
969fe9b5
RS
1356@end defopt
1357
a9f0a989
RS
1358@node Input Methods
1359@section Input Methods
1360@cindex input methods
1361
ad800164 1362 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
a9f0a989 1363characters from the keyboard. Unlike coding systems, which translate
ad800164 1364non-@acronym{ASCII} characters to and from encodings meant to be read by
a9f0a989
RS
1365programs, input methods provide human-friendly commands. (@xref{Input
1366Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1367use input methods to enter text.) How to define input methods is not
1368yet documented in this manual, but here we describe how to use them.
1369
1370 Each input method has a name, which is currently a string;
1371in the future, symbols may also be usable as input method names.
1372
a9f0a989
RS
1373@defvar current-input-method
1374This variable holds the name of the input method now active in the
1375current buffer. (It automatically becomes local in each buffer when set
1376in any fashion.) It is @code{nil} if no input method is active in the
1377buffer now.
969fe9b5
RS
1378@end defvar
1379
35864124 1380@defopt default-input-method
a9f0a989
RS
1381This variable holds the default input method for commands that choose an
1382input method. Unlike @code{current-input-method}, this variable is
1383normally global.
35864124 1384@end defopt
a9f0a989 1385
35864124
LT
1386@deffn Command set-input-method input-method
1387This command activates input method @var{input-method} for the current
a9f0a989 1388buffer. It also sets @code{default-input-method} to @var{input-method}.
35864124 1389If @var{input-method} is @code{nil}, this command deactivates any input
a9f0a989 1390method for the current buffer.
35864124 1391@end deffn
a9f0a989 1392
a9f0a989
RS
1393@defun read-input-method-name prompt &optional default inhibit-null
1394This function reads an input method name with the minibuffer, prompting
1395with @var{prompt}. If @var{default} is non-@code{nil}, that is returned
1396by default, if the user enters empty input. However, if
1397@var{inhibit-null} is non-@code{nil}, empty input signals an error.
1398
1399The returned value is a string.
1400@end defun
1401
a9f0a989
RS
1402@defvar input-method-alist
1403This variable defines all the supported input methods.
1404Each element defines one input method, and should have the form:
1405
1406@example
1911e6e5
RS
1407(@var{input-method} @var{language-env} @var{activate-func}
1408 @var{title} @var{description} @var{args}...)
a9f0a989
RS
1409@end example
1410
1911e6e5
RS
1411Here @var{input-method} is the input method name, a string;
1412@var{language-env} is another string, the name of the language
1413environment this input method is recommended for. (That serves only for
1414documentation purposes.)
a9f0a989 1415
a9f0a989
RS
1416@var{activate-func} is a function to call to activate this method. The
1417@var{args}, if any, are passed as arguments to @var{activate-func}. All
1418told, the arguments to @var{activate-func} are @var{input-method} and
1419the @var{args}.
0ace421a
GM
1420
1421@var{title} is a string to display in the mode line while this method is
1422active. @var{description} is a string describing this method and what
1423it is good for.
1911e6e5 1424@end defvar
a9f0a989 1425
2eb4136f 1426 The fundamental interface to input methods is through the
35864124
LT
1427variable @code{input-method-function}. @xref{Reading One Event},
1428and @ref{Invoking the Input Method}.
2468d0c0
DL
1429
1430@node Locales
1431@section Locales
1432@cindex locale
1433
1434 POSIX defines a concept of ``locales'' which control which language
1435to use in language-related features. These Emacs variables control
1436how Emacs interacts with these features.
1437
1438@defvar locale-coding-system
a007679c 1439@cindex keyboard input decoding on X
2468d0c0 1440This variable specifies the coding system to use for decoding system
a007679c
EZ
1441error messages and---on X Window system only---keyboard input, for
1442encoding the format argument to @code{format-time-string}, and for
1443decoding the return value of @code{format-time-string}.
2468d0c0
DL
1444@end defvar
1445
1446@defvar system-messages-locale
2468d0c0
DL
1447This variable specifies the locale to use for generating system error
1448messages. Changing the locale can cause messages to come out in a
9c17f494 1449different language or in a different orthography. If the variable is
2468d0c0
DL
1450@code{nil}, the locale is specified by environment variables in the
1451usual POSIX fashion.
1452@end defvar
1453
1454@defvar system-time-locale
2468d0c0
DL
1455This variable specifies the locale to use for formatting time values.
1456Changing the locale can cause messages to appear according to the
1457conventions of a different language. If the variable is @code{nil}, the
1458locale is specified by environment variables in the usual POSIX fashion.
1459@end defvar
0ace421a 1460
131bf943
RS
1461@defun locale-info item
1462This function returns locale data @var{item} for the current POSIX
1463locale, if available. @var{item} should be one of these symbols:
1464
1465@table @code
1466@item codeset
1467Return the character set as a string (locale item @code{CODESET}).
1468
1469@item days
1470Return a 7-element vector of day names (locale items
1471@code{DAY_1} through @code{DAY_7});
1472
1473@item months
1474Return a 12-element vector of month names (locale items @code{MON_1}
1475through @code{MON_12}).
1476
1477@item paper
1478Return a list @code{(@var{width} @var{height})} for the default paper
35864124 1479size measured in millimeters (locale items @code{PAPER_WIDTH} and
131bf943
RS
1480@code{PAPER_HEIGHT}).
1481@end table
1482
1483If the system can't provide the requested information, or if
1484@var{item} is not one of those symbols, the value is @code{nil}. All
1485strings in the return value are decoded using
35864124 1486@code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual},
131bf943
RS
1487for more information about locales and locale items.
1488@end defun
ab5796a9
MB
1489
1490@ignore
1491 arch-tag: be705bf8-941b-4c35-84fc-ad7d20ddb7cb
1492@end ignore