(list-buffers-noselect): Append the buffer's process status to its mode name.
[bpt/emacs.git] / lispref / nonascii.texi
CommitLineData
cc6d0d2c
RS
1@c -*-texinfo-*-
2@c This is part of the GNU Emacs Lisp Reference Manual.
177c0ea7 3@c Copyright (C) 1998, 1999 Free Software Foundation, Inc.
cc6d0d2c
RS
4@c See the file elisp.texi for copying conditions.
5@setfilename ../info/characters
6@node Non-ASCII Characters, Searching and Matching, Text, Top
ad800164 7@chapter Non-@acronym{ASCII} Characters
cc6d0d2c 8@cindex multibyte characters
ad800164 9@cindex non-@acronym{ASCII} characters
cc6d0d2c 10
ad800164 11 This chapter covers the special issues relating to non-@acronym{ASCII}
cc6d0d2c
RS
12characters and how they are stored in strings and buffers.
13
14@menu
5557b83b
RS
15* Text Representations:: Unibyte and multibyte representations
16* Converting Representations:: Converting unibyte to multibyte and vice versa.
17* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
18* Character Codes:: How unibyte and multibyte relate to
19 codes of individual characters.
a3d3f60d 20* Character Sets:: The space of possible character codes
5557b83b
RS
21 is divided into various character sets.
22* Chars and Bytes:: More information about multibyte encodings.
23* Splitting Characters:: Converting a character to its byte sequence.
24* Scanning Charsets:: Which character sets are used in a buffer?
25* Translation of Characters:: Translation tables are used for conversion.
26* Coding Systems:: Coding systems are conversions for saving files.
27* Input Methods:: Input methods allow users to enter various
8a9e355c 28 non-ASCII characters without special keyboards.
5557b83b 29* Locales:: Interacting with the POSIX locale.
cc6d0d2c
RS
30@end menu
31
32@node Text Representations
33@section Text Representations
34@cindex text representations
35
36 Emacs has two @dfn{text representations}---two ways to represent text
37in a string or buffer. These are called @dfn{unibyte} and
38@dfn{multibyte}. Each string, and each buffer, uses one of these two
39representations. For most purposes, you can ignore the issue of
40representations, because Emacs converts text between them as
41appropriate. Occasionally in Lisp programming you will need to pay
42attention to the difference.
43
44@cindex unibyte text
45 In unibyte representation, each character occupies one byte and
46therefore the possible character codes range from 0 to 255. Codes 0
ad800164
EZ
47through 127 are @acronym{ASCII} characters; the codes from 128 through 255
48are used for one non-@acronym{ASCII} character set (you can choose which
969fe9b5 49character set by setting the variable @code{nonascii-insert-offset}).
cc6d0d2c
RS
50
51@cindex leading code
52@cindex multibyte text
1911e6e5 53@cindex trailing codes
cc6d0d2c
RS
54 In multibyte representation, a character may occupy more than one
55byte, and as a result, the full range of Emacs character codes can be
56stored. The first byte of a multibyte character is always in the range
57128 through 159 (octal 0200 through 0237). These values are called
a9f0a989
RS
58@dfn{leading codes}. The second and subsequent bytes of a multibyte
59character are always in the range 160 through 255 (octal 0240 through
1911e6e5 600377); these values are @dfn{trailing codes}.
cc6d0d2c 61
0ace421a 62 Some sequences of bytes are not valid in multibyte text: for example,
1e4d32f8
GM
63a single isolated byte in the range 128 through 159 is not allowed. But
64character codes 128 through 159 can appear in multibyte text,
65represented as two-byte sequences. All the character codes 128 through
66255 are possible (though slightly abnormal) in multibyte text; they
0ace421a
GM
67appear in multibyte buffers and strings when you do explicit encoding
68and decoding (@pxref{Explicit Encoding}).
b6954afd 69
cc6d0d2c
RS
70 In a buffer, the buffer-local value of the variable
71@code{enable-multibyte-characters} specifies the representation used.
08f0f5e9
KH
72The representation for a string is determined and recorded in the string
73when the string is constructed.
cc6d0d2c 74
cc6d0d2c
RS
75@defvar enable-multibyte-characters
76This variable specifies the current buffer's text representation.
77If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
78it contains unibyte text.
79
969fe9b5
RS
80You cannot set this variable directly; instead, use the function
81@code{set-buffer-multibyte} to change a buffer's representation.
cc6d0d2c
RS
82@end defvar
83
cc6d0d2c 84@defvar default-enable-multibyte-characters
a9f0a989 85This variable's value is entirely equivalent to @code{(default-value
cc6d0d2c 86'enable-multibyte-characters)}, and setting this variable changes that
a9f0a989
RS
87default value. Setting the local binding of
88@code{enable-multibyte-characters} in a specific buffer is not allowed,
89but changing the default value is supported, and it is a reasonable
90thing to do, because it has no effect on existing buffers.
cc6d0d2c
RS
91
92The @samp{--unibyte} command line option does its job by setting the
93default value to @code{nil} early in startup.
94@end defvar
95
b6954afd
RS
96@defun position-bytes position
97@tindex position-bytes
98Return the byte-position corresponding to buffer position @var{position}
35864124
LT
99in the current buffer. If @var{position} is out of range, the value
100is @code{nil}.
b6954afd
RS
101@end defun
102
103@defun byte-to-position byte-position
104@tindex byte-to-position
105Return the buffer position corresponding to byte-position
35864124
LT
106@var{byte-position} in the current buffer. If @var{byte-position} is
107out of range, the value is @code{nil}.
b6954afd
RS
108@end defun
109
cc6d0d2c 110@defun multibyte-string-p string
b6954afd 111Return @code{t} if @var{string} is a multibyte string.
cc6d0d2c
RS
112@end defun
113
114@node Converting Representations
115@section Converting Text Representations
116
117 Emacs can convert unibyte text to multibyte; it can also convert
118multibyte text to unibyte, though this conversion loses information. In
119general these conversions happen when inserting text into a buffer, or
120when putting text from several strings together in one string. You can
121also explicitly convert a string's contents to either representation.
122
123 Emacs chooses the representation for a string based on the text that
124it is constructed from. The general rule is to convert unibyte text to
125multibyte text when combining it with other multibyte text, because the
126multibyte representation is more general and can hold whatever
127characters the unibyte text has.
128
129 When inserting text into a buffer, Emacs converts the text to the
130buffer's representation, as specified by
131@code{enable-multibyte-characters} in that buffer. In particular, when
132you insert multibyte text into a unibyte buffer, Emacs converts the text
133to unibyte, even though this conversion cannot in general preserve all
134the characters that might be in the multibyte text. The other natural
135alternative, to convert the buffer contents to multibyte, is not
136acceptable because the buffer's representation is a choice made by the
969fe9b5 137user that cannot be overridden automatically.
cc6d0d2c 138
ad800164 139 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
1e4d32f8 140unchanged, and likewise character codes 128 through 159. It converts
ad800164 141the non-@acronym{ASCII} codes 160 through 255 by adding the value
1e4d32f8
GM
142@code{nonascii-insert-offset} to each character code. By setting this
143variable, you specify which character set the unibyte characters
144correspond to (@pxref{Character Sets}). For example, if
145@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
ad800164 146'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
1e4d32f8
GM
147correspond to Latin 1. If it is 2688, which is @code{(- (make-char
148'greek-iso8859-7) 128)}, then they correspond to Greek letters.
cc6d0d2c 149
8241495d
RS
150 Converting multibyte text to unibyte is simpler: it discards all but
151the low 8 bits of each character code. If @code{nonascii-insert-offset}
152has a reasonable value, corresponding to the beginning of some character
153set, this conversion is the inverse of the other: converting unibyte
154text to multibyte and back to unibyte reproduces the original unibyte
155text.
cc6d0d2c 156
cc6d0d2c 157@defvar nonascii-insert-offset
ad800164 158This variable specifies the amount to add to a non-@acronym{ASCII} character
cc6d0d2c 159when converting unibyte text to multibyte. It also applies when
a9f0a989 160@code{self-insert-command} inserts a character in the unibyte
ad800164 161non-@acronym{ASCII} range, 128 through 255. However, the functions
7a063989 162@code{insert} and @code{insert-char} do not perform this conversion.
cc6d0d2c
RS
163
164The right value to use to select character set @var{cs} is @code{(-
a9f0a989 165(make-char @var{cs}) 128)}. If the value of
cc6d0d2c
RS
166@code{nonascii-insert-offset} is zero, then conversion actually uses the
167value for the Latin 1 character set, rather than zero.
168@end defvar
169
a9f0a989 170@defvar nonascii-translation-table
cc6d0d2c
RS
171This variable provides a more general alternative to
172@code{nonascii-insert-offset}. You can use it to specify independently
173how to translate each code in the range of 128 through 255 into a
7a063989 174multibyte character. The value should be a char-table, or @code{nil}.
969fe9b5 175If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
cc6d0d2c
RS
176@end defvar
177
35864124
LT
178The next three functions either return the argument @var{string}, or a
179newly created string with no text properties.
180
cc6d0d2c
RS
181@defun string-make-unibyte string
182This function converts the text of @var{string} to unibyte
1911e6e5 183representation, if it isn't already, and returns the result. If
38eee91c
EZ
184@var{string} is a unibyte string, it is returned unchanged. Multibyte
185character codes are converted to unibyte according to
186@code{nonascii-translation-table} or, if that is @code{nil}, using
187@code{nonascii-insert-offset}. If the lookup in the translation table
188fails, this function takes just the low 8 bits of each character.
cc6d0d2c
RS
189@end defun
190
cc6d0d2c
RS
191@defun string-make-multibyte string
192This function converts the text of @var{string} to multibyte
1911e6e5 193representation, if it isn't already, and returns the result. If
35864124
LT
194@var{string} is a multibyte string or consists entirely of
195@acronym{ASCII} characters, it is returned unchanged. In particular,
196if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
197string is unibyte. (When the characters are all @acronym{ASCII},
198Emacs primitives will treat the string the same way whether it is
199unibyte or multibyte.) If @var{string} is unibyte and contains
200non-@acronym{ASCII} characters, the function
201@code{unibyte-char-to-multibyte} is used to convert each unibyte
202character to a multibyte character.
cc6d0d2c
RS
203@end defun
204
131bf943
RS
205@defun string-to-multibyte string
206This function returns a multibyte string containing the same sequence
35864124
LT
207of character codes as @var{string}. Unlike
208@code{string-make-multibyte}, this function unconditionally returns a
209multibyte string. If @var{string} is a multibyte string, it is
210returned unchanged.
131bf943
RS
211@end defun
212
1ee89891
RS
213@defun multibyte-char-to-unibyte char
214This convert the multibyte character @var{char} to a unibyte
215character, based on @code{nonascii-translation-table} and
216@code{nonascii-insert-offset}.
217@end defun
218
219@defun unibyte-char-to-multibyte char
220This convert the unibyte character @var{char} to a multibyte
221character, based on @code{nonascii-translation-table} and
222@code{nonascii-insert-offset}.
223@end defun
224
cc6d0d2c
RS
225@node Selecting a Representation
226@section Selecting a Representation
227
228 Sometimes it is useful to examine an existing buffer or string as
229multibyte when it was unibyte, or vice versa.
230
cc6d0d2c
RS
231@defun set-buffer-multibyte multibyte
232Set the representation type of the current buffer. If @var{multibyte}
233is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
234is @code{nil}, the buffer becomes unibyte.
235
236This function leaves the buffer contents unchanged when viewed as a
237sequence of bytes. As a consequence, it can change the contents viewed
238as characters; a sequence of two bytes which is treated as one character
239in multibyte representation will count as two characters in unibyte
7a063989
KH
240representation. Character codes 128 through 159 are an exception. They
241are represented by one byte in a unibyte buffer, but when the buffer is
242set to multibyte, they are converted to two-byte sequences, and vice
243versa.
cc6d0d2c
RS
244
245This function sets @code{enable-multibyte-characters} to record which
246representation is in use. It also adjusts various data in the buffer
969fe9b5
RS
247(including overlays, text properties and markers) so that they cover the
248same text as they did before.
b6954afd
RS
249
250You cannot use @code{set-buffer-multibyte} on an indirect buffer,
251because indirect buffers always inherit the representation of the
252base buffer.
cc6d0d2c
RS
253@end defun
254
cc6d0d2c
RS
255@defun string-as-unibyte string
256This function returns a string with the same bytes as @var{string} but
257treating each byte as a character. This means that the value may have
258more characters than @var{string} has.
259
b6954afd 260If @var{string} is already a unibyte string, then the value is
7f84d9ae
DL
261@var{string} itself. Otherwise it is a newly created string, with no
262text properties. If @var{string} is multibyte, any characters it
686ffe28 263contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
7f84d9ae 264are converted to the corresponding single byte.
cc6d0d2c
RS
265@end defun
266
cc6d0d2c
RS
267@defun string-as-multibyte string
268This function returns a string with the same bytes as @var{string} but
269treating each multibyte sequence as one character. This means that the
270value may have fewer characters than @var{string} has.
271
b6954afd 272If @var{string} is already a multibyte string, then the value is
7f84d9ae
DL
273@var{string} itself. Otherwise it is a newly created string, with no
274text properties. If @var{string} is unibyte and contains any individual
2758-bit bytes (i.e.@: not part of a multibyte form), they are converted to
686ffe28
RS
276the corresponding multibyte character of charset @code{eight-bit-control}
277or @code{eight-bit-graphic}.
cc6d0d2c
RS
278@end defun
279
280@node Character Codes
281@section Character Codes
282@cindex character codes
283
284 The unibyte and multibyte text representations use different character
285codes. The valid character codes for unibyte representation range from
2860 to 255---the values that can fit in one byte. The valid character
287codes for multibyte representation range from 0 to 524287, but not all
0ace421a 288values in that range are valid. The values 128 through 255 are not
1e4d32f8 289entirely proper in multibyte text, but they can occur if you do explicit
0ace421a 290encoding and decoding (@pxref{Explicit Encoding}). Some other character
ad800164 291codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes
1e4d32f8 2920 through 127 are completely legitimate in both representations.
cc6d0d2c 293
7a063989 294@defun char-valid-p charcode &optional genericp
cc6d0d2c
RS
295This returns @code{t} if @var{charcode} is valid for either one of the two
296text representations.
297
298@example
299(char-valid-p 65)
300 @result{} t
301(char-valid-p 256)
302 @result{} nil
303(char-valid-p 2248)
304 @result{} t
305@end example
7a063989 306
6fe50867 307If the optional argument @var{genericp} is non-@code{nil}, this
35864124
LT
308function also returns @code{t} if @var{charcode} is a generic
309character (@pxref{Splitting Characters}).
cc6d0d2c
RS
310@end defun
311
312@node Character Sets
313@section Character Sets
314@cindex character sets
315
316 Emacs classifies characters into various @dfn{character sets}, each of
317which has a name which is a symbol. Each character belongs to one and
318only one character set.
319
320 In general, there is one character set for each distinct script. For
321example, @code{latin-iso8859-1} is one character set,
322@code{greek-iso8859-7} is another, and @code{ascii} is another. An
969fe9b5
RS
323Emacs character set can hold at most 9025 characters; therefore, in some
324cases, characters that would logically be grouped together are split
a9f0a989
RS
325into several character sets. For example, one set of Chinese
326characters, generally known as Big 5, is divided into two Emacs
327character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
cc6d0d2c 328
ad800164
EZ
329 @acronym{ASCII} characters are in character set @code{ascii}. The
330non-@acronym{ASCII} characters 128 through 159 are in character set
4240c779
GM
331@code{eight-bit-control}, and codes 160 through 255 are in character set
332@code{eight-bit-graphic}.
333
cc6d0d2c 334@defun charsetp object
8241495d 335Returns @code{t} if @var{object} is a symbol that names a character set,
cc6d0d2c
RS
336@code{nil} otherwise.
337@end defun
338
35864124
LT
339@defvar charset-list
340The value is a list of all defined character set names.
341@end defvar
342
cc6d0d2c 343@defun charset-list
35864124
LT
344This function returns the value of @code{charset-list}. It is only
345provided for backward compatibility.
cc6d0d2c
RS
346@end defun
347
cc6d0d2c 348@defun char-charset character
b6954afd 349This function returns the name of the character set that @var{character}
35864124
LT
350belongs to, or the symbol @code{unknown} if @var{character} is not a
351valid character.
cc6d0d2c
RS
352@end defun
353
8241495d
RS
354@defun charset-plist charset
355@tindex charset-plist
356This function returns the charset property list of the character set
357@var{charset}. Although @var{charset} is a symbol, this is not the same
358as the property list of that symbol. Charset properties are used for
0f4da9ce 359special purposes within Emacs.
8241495d
RS
360@end defun
361
cc6d0d2c
RS
362@node Chars and Bytes
363@section Characters and Bytes
364@cindex bytes and characters
365
a9f0a989
RS
366@cindex introduction sequence
367@cindex dimension (of character set)
cc6d0d2c 368 In multibyte representation, each character occupies one or more
a9f0a989 369bytes. Each character set has an @dfn{introduction sequence}, which is
ad800164
EZ
370normally one or two bytes long. (Exception: the @code{ascii} character
371set and the @code{eight-bit-graphic} character set have a zero-length
7a063989
KH
372introduction sequence.) The introduction sequence is the beginning of
373the byte sequence for any character in the character set. The rest of
374the character's bytes distinguish it from the other characters in the
375same character set. Depending on the character set, there are either
376one or two distinguishing bytes; the number of such bytes is called the
377@dfn{dimension} of the character set.
a9f0a989
RS
378
379@defun charset-dimension charset
b6954afd
RS
380This function returns the dimension of @var{charset}; at present, the
381dimension is always 1 or 2.
382@end defun
383
384@defun charset-bytes charset
385@tindex charset-bytes
386This function returns the number of bytes used to represent a character
387in character set @var{charset}.
a9f0a989
RS
388@end defun
389
390 This is the simplest way to determine the byte length of a character
391set's introduction sequence:
392
393@example
b6954afd 394(- (charset-bytes @var{charset})
a9f0a989
RS
395 (charset-dimension @var{charset}))
396@end example
397
398@node Splitting Characters
399@section Splitting Characters
400
401 The functions in this section convert between characters and the byte
402values used to represent them. For most purposes, there is no need to
403be concerned with the sequence of bytes used to represent a character,
969fe9b5 404because Emacs translates automatically when necessary.
cc6d0d2c 405
cc6d0d2c
RS
406@defun split-char character
407Return a list containing the name of the character set of
a9f0a989
RS
408@var{character}, followed by one or two byte values (integers) which
409identify @var{character} within that character set. The number of byte
410values is the character set's dimension.
cc6d0d2c 411
35864124
LT
412If @var{character} is invalid as a character code, @code{split-char}
413returns a list consisting of the symbol @code{unknown} and @var{character}.
414
cc6d0d2c
RS
415@example
416(split-char 2248)
417 @result{} (latin-iso8859-1 72)
418(split-char 65)
419 @result{} (ascii 65)
7a063989
KH
420(split-char 128)
421 @result{} (eight-bit-control 128)
cc6d0d2c
RS
422@end example
423@end defun
424
e8262f40
DL
425@defun make-char charset &optional code1 code2
426This function returns the character in character set @var{charset} whose
427position codes are @var{code1} and @var{code2}. This is roughly the
428inverse of @code{split-char}. Normally, you should specify either one
429or both of @var{code1} and @var{code2} according to the dimension of
430@var{charset}. For example,
cc6d0d2c
RS
431
432@example
433(make-char 'latin-iso8859-1 72)
434 @result{} 2248
435@end example
0f4da9ce
DL
436
437Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
438before they are used to index @var{charset}. Thus you may use, for
439instance, an ISO 8859 character code rather than subtracting 128, as
440is necessary to index the corresponding Emacs charset.
cc6d0d2c
RS
441@end defun
442
a9f0a989
RS
443@cindex generic characters
444 If you call @code{make-char} with no @var{byte-values}, the result is
445a @dfn{generic character} which stands for @var{charset}. A generic
446character is an integer, but it is @emph{not} valid for insertion in the
447buffer as a character. It can be used in @code{char-table-range} to
448refer to the whole character set (@pxref{Char-Tables}).
449@code{char-valid-p} returns @code{nil} for generic characters.
450For example:
451
452@example
453(make-char 'latin-iso8859-1)
454 @result{} 2176
455(char-valid-p 2176)
456 @result{} nil
7a063989
KH
457(char-valid-p 2176 t)
458 @result{} t
a9f0a989
RS
459(split-char 2176)
460 @result{} (latin-iso8859-1 0)
461@end example
462
ad800164
EZ
463The character sets @code{ascii}, @code{eight-bit-control}, and
464@code{eight-bit-graphic} don't have corresponding generic characters. If
e8262f40
DL
465@var{charset} is one of them and you don't supply @var{code1},
466@code{make-char} returns the character code corresponding to the
467smallest code in @var{charset}.
7a063989 468
a9f0a989
RS
469@node Scanning Charsets
470@section Scanning for Character Sets
471
472 Sometimes it is useful to find out which character sets appear in a
473part of a buffer or a string. One use for this is in determining which
474coding systems (@pxref{Coding Systems}) are capable of representing all
475of the text in question.
476
477@defun find-charset-region beg end &optional translation
a9f0a989
RS
478This function returns a list of the character sets that appear in the
479current buffer between positions @var{beg} and @var{end}.
480
481The optional argument @var{translation} specifies a translation table to
482be used in scanning the text (@pxref{Translation of Characters}). If it
483is non-@code{nil}, then each character in the region is translated
484through this table, and the value returned describes the translated
485characters instead of the characters actually in the buffer.
a265079f 486@end defun
a9f0a989
RS
487
488@defun find-charset-string string &optional translation
b6954afd
RS
489This function returns a list of the character sets that appear in the
490string @var{string}. It is just like @code{find-charset-region}, except
491that it applies to the contents of @var{string} instead of part of the
492current buffer.
a9f0a989
RS
493@end defun
494
495@node Translation of Characters
496@section Translation of Characters
497@cindex character translation tables
498@cindex translation tables
499
35864124
LT
500 A @dfn{translation table} is a char-table that specifies a mapping
501of characters into characters. These tables are used in encoding and
502decoding, and for other purposes. Some coding systems specify their
503own particular translation tables; there are also default translation
504tables which apply to all other coding systems.
a9f0a989 505
a3d3f60d
RS
506 For instance, the coding-system @code{utf-8} has a translation table
507that maps characters of various charsets (e.g.,
508@code{latin-iso8859-@var{x}}) into Unicode character sets. This way,
509it can encode Latin-2 characters into UTF-8. Meanwhile,
510@code{unify-8859-on-decoding-mode} operates by specifying
511@code{standard-translation-table-for-decode} to translate
512Latin-@var{x} characters into corresponding Unicode characters.
513
8241495d
RS
514@defun make-translation-table &rest translations
515This function returns a translation table based on the argument
f57b6e64
DL
516@var{translations}. Each element of @var{translations} should be a
517list of elements of the form @code{(@var{from} . @var{to})}; this says
518to translate the character @var{from} into @var{to}.
a9f0a989 519
c04c052b
DL
520The arguments and the forms in each argument are processed in order,
521and if a previous form already translates @var{to} to some other
522character, say @var{to-alt}, @var{from} is also translated to
523@var{to-alt}.
524
a9f0a989
RS
525You can also map one whole character set into another character set with
526the same dimension. To do this, you specify a generic character (which
527designates a character set) for @var{from} (@pxref{Splitting Characters}).
35864124
LT
528In this case, if @var{to} is also a generic character, its character
529set should have the same dimension as @var{from}'s. Then the
530translation table translates each character of @var{from}'s character
531set into the corresponding character of @var{to}'s character set. If
532@var{from} is a generic character and @var{to} is an ordinary
533character, then the translation table translates every character of
534@var{from}'s character set into @var{to}.
a9f0a989
RS
535@end defun
536
537 In decoding, the translation table's translations are applied to the
538characters that result from ordinary decoding. If a coding system has
35864124
LT
539property @code{translation-table-for-decode}, that specifies the
540translation table to use. (This is a property of the coding system,
541as returned by @code{coding-system-get}, not a property of the symbol
542that is the coding system's name. @xref{Coding System Basics,, Basic
543Concepts of Coding Systems}.) Otherwise, if
544@code{standard-translation-table-for-decode} is non-@code{nil},
545decoding uses that table.
a9f0a989
RS
546
547 In encoding, the translation table's translations are applied to the
548characters in the buffer, and the result of translation is actually
549encoded. If a coding system has property
35864124
LT
550@code{translation-table-for-encode}, that specifies the translation
551table to use. Otherwise the variable
b1f687a2
RS
552@code{standard-translation-table-for-encode} specifies the translation
553table.
a9f0a989 554
b1f687a2 555@defvar standard-translation-table-for-decode
a9f0a989
RS
556This is the default translation table for decoding, for
557coding systems that don't specify any other translation table.
558@end defvar
559
b1f687a2 560@defvar standard-translation-table-for-encode
a9f0a989
RS
561This is the default translation table for encoding, for
562coding systems that don't specify any other translation table.
563@end defvar
564
131bf943
RS
565@defvar translation-table-for-input
566Self-inserting characters are translated through this translation
35864124
LT
567table before they are inserted. This variable automatically becomes
568buffer-local when set.
a3d3f60d
RS
569
570@code{set-buffer-file-coding-system} sets this variable so that your
571keyboard input gets translated into the character sets that the buffer
572is likely to contain.
131bf943
RS
573@end defvar
574
cc6d0d2c
RS
575@node Coding Systems
576@section Coding Systems
577
578@cindex coding system
579 When Emacs reads or writes a file, and when Emacs sends text to a
580subprocess or receives text from a subprocess, it normally performs
581character code conversion and end-of-line conversion as specified
582by a particular @dfn{coding system}.
583
8241495d
RS
584 How to define a coding system is an arcane matter, and is not
585documented here.
b6954afd 586
a9f0a989 587@menu
5557b83b
RS
588* Coding System Basics:: Basic concepts.
589* Encoding and I/O:: How file I/O functions handle coding systems.
590* Lisp and Coding Systems:: Functions to operate on coding system names.
591* User-Chosen Coding Systems:: Asking the user to choose a coding system.
592* Default Coding Systems:: Controlling the default choices.
593* Specifying Coding Systems:: Requesting a particular coding system
594 for a single file operation.
595* Explicit Encoding:: Encoding or decoding text without doing I/O.
596* Terminal I/O Encoding:: Use of encoding for terminal I/O.
597* MS-DOS File Types:: How DOS "text" and "binary" files
598 relate to coding systems.
a9f0a989
RS
599@end menu
600
601@node Coding System Basics
602@subsection Basic Concepts of Coding Systems
603
cc6d0d2c
RS
604@cindex character code conversion
605 @dfn{Character code conversion} involves conversion between the encoding
606used inside Emacs and some other encoding. Emacs supports many
607different encodings, in that it can convert to and from them. For
608example, it can convert text to or from encodings such as Latin 1, Latin
6092, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some
610cases, Emacs supports several alternative encodings for the same
611characters; for example, there are three coding systems for the Cyrillic
612(Russian) alphabet: ISO, Alternativnyj, and KOI8.
613
cc6d0d2c 614 Most coding systems specify a particular character code for
8241495d
RS
615conversion, but some of them leave the choice unspecified---to be chosen
616heuristically for each file, based on the data.
cc6d0d2c 617
969fe9b5
RS
618@cindex end of line conversion
619 @dfn{End of line conversion} handles three different conventions used
620on various systems for representing end of line in files. The Unix
621convention is to use the linefeed character (also called newline). The
8241495d
RS
622DOS convention is to use a carriage-return and a linefeed at the end of
623a line. The Mac convention is to use just carriage-return.
969fe9b5 624
cc6d0d2c
RS
625@cindex base coding system
626@cindex variant coding system
627 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
628conversion unspecified, to be chosen based on the data. @dfn{Variant
629coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
630@code{latin-1-mac} specify the end-of-line conversion explicitly as
a9f0a989 631well. Most base coding systems have three corresponding variants whose
cc6d0d2c
RS
632names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
633
a9f0a989
RS
634 The coding system @code{raw-text} is special in that it prevents
635character code conversion, and causes the buffer visited with that
636coding system to be a unibyte buffer. It does not specify the
637end-of-line conversion, allowing that to be determined as usual by the
638data, and has the usual three variants which specify the end-of-line
639conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}:
640it specifies no conversion of either character codes or end-of-line.
641
642 The coding system @code{emacs-mule} specifies that the data is
643represented in the internal Emacs encoding. This is like
644@code{raw-text} in that no code conversion happens, but different in
645that the result is multibyte data.
646
647@defun coding-system-get coding-system property
a9f0a989
RS
648This function returns the specified property of the coding system
649@var{coding-system}. Most coding system properties exist for internal
650purposes, but one that you might find useful is @code{mime-charset}.
651That property's value is the name used in MIME for the character coding
652which this coding system can read and write. Examples:
653
654@example
655(coding-system-get 'iso-latin-1 'mime-charset)
656 @result{} iso-8859-1
657(coding-system-get 'iso-2022-cn 'mime-charset)
658 @result{} iso-2022-cn
659(coding-system-get 'cyrillic-koi8 'mime-charset)
660 @result{} koi8-r
661@end example
662
663The value of the @code{mime-charset} property is also defined
664as an alias for the coding system.
665@end defun
666
667@node Encoding and I/O
668@subsection Encoding and I/O
669
1911e6e5 670 The principal purpose of coding systems is for use in reading and
a9f0a989
RS
671writing files. The function @code{insert-file-contents} uses
672a coding system for decoding the file data, and @code{write-region}
673uses one to encode the buffer contents.
674
675 You can specify the coding system to use either explicitly
676(@pxref{Specifying Coding Systems}), or implicitly using the defaulting
677mechanism (@pxref{Default Coding Systems}). But these methods may not
678completely specify what to do. For example, they may choose a coding
679system such as @code{undefined} which leaves the character code
680conversion to be determined from the data. In these cases, the I/O
681operation finishes the job of choosing a coding system. Very often
682you will want to find out afterwards which coding system was chosen.
683
684@defvar buffer-file-coding-system
a9f0a989
RS
685This variable records the coding system that was used for visiting the
686current buffer. It is used for saving the buffer, and for writing part
1b02d12c
EZ
687of the buffer with @code{write-region}. If the text to be written
688cannot be safely encoded using the coding system specified by this
689variable, these operations select an alternative encoding by calling
690the function @code{select-safe-coding-system} (@pxref{User-Chosen
691Coding Systems}). If selecting a different encoding requires to ask
692the user to specify a coding system, @code{buffer-file-coding-system}
693is updated to the newly selected coding system.
694
695@code{buffer-file-coding-system} does @emph{not} affect sending text
b6954afd 696to a subprocess.
a9f0a989
RS
697@end defvar
698
699@defvar save-buffer-coding-system
7a063989
KH
700This variable specifies the coding system for saving the buffer (by
701overriding @code{buffer-file-coding-system}). Note that it is not used
702for @code{write-region}.
8241495d
RS
703
704When a command to save the buffer starts out to use
7a063989
KH
705@code{buffer-file-coding-system} (or @code{save-buffer-coding-system}),
706and that coding system cannot handle
8241495d 707the actual text in the buffer, the command asks the user to choose
1b02d12c
EZ
708another coding system (by calling @code{select-safe-coding-system}).
709After that happens, the command also updates
710@code{buffer-file-coding-system} to represent the coding system that
711the user specified.
a9f0a989
RS
712@end defvar
713
714@defvar last-coding-system-used
a9f0a989
RS
715I/O operations for files and subprocesses set this variable to the
716coding system name that was used. The explicit encoding and decoding
717functions (@pxref{Explicit Encoding}) set it too.
718
719@strong{Warning:} Since receiving subprocess output sets this variable,
8241495d
RS
720it can change whenever Emacs waits; therefore, you should copy the
721value shortly after the function call that stores the value you are
a9f0a989
RS
722interested in.
723@end defvar
724
2eb4136f
RS
725 The variable @code{selection-coding-system} specifies how to encode
726selections for the window system. @xref{Window System Selections}.
727
1ee89891
RS
728@defvar file-name-coding-system
729The variable @code{file-name-coding-system} specifies the coding
730system to use for encoding file names. Emacs encodes file names using
731that coding system for all file operations. If
732@code{file-name-coding-system} is @code{nil}, Emacs uses a default
733coding system determined by the selected language environment. In the
734default language environment, any non-@acronym{ASCII} characters in
735file names are not encoded specially; they appear in the file system
736using the internal Emacs representation.
737@end defvar
738
739 @strong{Warning:} if you change @code{file-name-coding-system} (or
740the language environment) in the middle of an Emacs session, problems
741can result if you have already visited files whose names were encoded
742using the earlier coding system and are handled differently under the
743new coding system. If you try to save one of these buffers under the
744visited file name, saving may use the wrong file name, or it may get
745an error. If such a problem happens, use @kbd{C-x C-w} to specify a
746new file name for that buffer.
747
969fe9b5
RS
748@node Lisp and Coding Systems
749@subsection Coding Systems in Lisp
750
8241495d 751 Here are the Lisp facilities for working with coding systems:
cc6d0d2c 752
cc6d0d2c
RS
753@defun coding-system-list &optional base-only
754This function returns a list of all coding system names (symbols). If
755@var{base-only} is non-@code{nil}, the value includes only the
7a063989
KH
756base coding systems. Otherwise, it includes alias and variant coding
757systems as well.
cc6d0d2c
RS
758@end defun
759
cc6d0d2c
RS
760@defun coding-system-p object
761This function returns @code{t} if @var{object} is a coding system
35864124 762name or @code{nil}.
cc6d0d2c
RS
763@end defun
764
cc6d0d2c
RS
765@defun check-coding-system coding-system
766This function checks the validity of @var{coding-system}.
767If that is valid, it returns @var{coding-system}.
768Otherwise it signals an error with condition @code{coding-system-error}.
769@end defun
770
a9f0a989 771@defun coding-system-change-eol-conversion coding-system eol-type
a9f0a989 772This function returns a coding system which is like @var{coding-system}
1911e6e5 773except for its eol conversion, which is specified by @code{eol-type}.
a9f0a989
RS
774@var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
775@code{nil}. If it is @code{nil}, the returned coding system determines
776the end-of-line conversion from the data.
35864124
LT
777
778@var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
070b546b 779@code{dos} and @code{mac}, respectively.
a9f0a989 780@end defun
969fe9b5 781
a9f0a989 782@defun coding-system-change-text-conversion eol-coding text-coding
a9f0a989
RS
783This function returns a coding system which uses the end-of-line
784conversion of @var{eol-coding}, and the text conversion of
785@var{text-coding}. If @var{text-coding} is @code{nil}, it returns
786@code{undecided}, or one of its variants according to @var{eol-coding}.
969fe9b5
RS
787@end defun
788
a9f0a989 789@defun find-coding-systems-region from to
a9f0a989
RS
790This function returns a list of coding systems that could be used to
791encode a text between @var{from} and @var{to}. All coding systems in
792the list can safely encode any multibyte characters in that portion of
793the text.
794
795If the text contains no multibyte characters, the function returns the
796list @code{(undecided)}.
797@end defun
798
799@defun find-coding-systems-string string
a9f0a989
RS
800This function returns a list of coding systems that could be used to
801encode the text of @var{string}. All coding systems in the list can
802safely encode any multibyte characters in @var{string}. If the text
803contains no multibyte characters, this returns the list
804@code{(undecided)}.
805@end defun
806
807@defun find-coding-systems-for-charsets charsets
a9f0a989
RS
808This function returns a list of coding systems that could be used to
809encode all the character sets in the list @var{charsets}.
810@end defun
811
812@defun detect-coding-region start end &optional highest
cc6d0d2c 813This function chooses a plausible coding system for decoding the text
0ace421a 814from @var{start} to @var{end}. This text should be a byte sequence
969fe9b5 815(@pxref{Explicit Encoding}).
cc6d0d2c 816
a9f0a989 817Normally this function returns a list of coding systems that could
cc6d0d2c 818handle decoding the text that was scanned. They are listed in order of
a9f0a989
RS
819decreasing priority. But if @var{highest} is non-@code{nil}, then the
820return value is just one coding system, the one that is highest in
821priority.
822
ad800164 823If the region contains only @acronym{ASCII} characters, the value
35864124
LT
824is @code{undecided} or @code{(undecided)}, or a variant specifying
825end-of-line conversion, if that can be deduced from the text.
cc6d0d2c
RS
826@end defun
827
35864124 828@defun detect-coding-string string &optional highest
cc6d0d2c
RS
829This function is like @code{detect-coding-region} except that it
830operates on the contents of @var{string} instead of bytes in the buffer.
1911e6e5
RS
831@end defun
832
35864124
LT
833 @xref{Coding systems for a subprocess,, Process Information}, in
834particular the description of the functions
835@code{process-coding-system} and @code{set-process-coding-system}, for
836how to examine or set the coding systems used for I/O to a subprocess.
1911e6e5
RS
837
838@node User-Chosen Coding Systems
839@subsection User-Chosen Coding Systems
840
1b02d12c 841@cindex select safe coding system
35864124 842@defun select-safe-coding-system from to &optional default-coding-system accept-default-p file
bf23b477
EZ
843This function selects a coding system for encoding specified text,
844asking the user to choose if necessary. Normally the specified text
35864124
LT
845is the text in the current buffer between @var{from} and @var{to}. If
846@var{from} is a string, the string specifies the text to encode, and
847@var{to} is ignored.
bf23b477
EZ
848
849If @var{default-coding-system} is non-@code{nil}, that is the first
850coding system to try; if that can handle the text,
851@code{select-safe-coding-system} returns that coding system. It can
852also be a list of coding systems; then the function tries each of them
35864124
LT
853one by one. After trying all of them, it next tries the current
854buffer's value of @code{buffer-file-coding-system} (if it is not
855@code{undecided}), then the value of
856@code{default-buffer-file-coding-system} and finally the user's most
857preferred coding system, which the user can set using the command
858@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
859Coding Systems, emacs, The GNU Emacs Manual}).
bf23b477
EZ
860
861If one of those coding systems can safely encode all the specified
862text, @code{select-safe-coding-system} chooses it and returns it.
863Otherwise, it asks the user to choose from a list of coding systems
864which can encode all the text, and returns the user's choice.
865
35864124
LT
866@var{default-coding-system} can also be a list whose first element is
867t and whose other elements are coding systems. Then, if no coding
868system in the list can handle the text, @code{select-safe-coding-system}
869queries the user immediately, without trying any of the three
870alternatives described above.
871
bf23b477 872The optional argument @var{accept-default-p}, if non-@code{nil},
35864124
LT
873should be a function to determine whether a coding system selected
874without user interaction is acceptable. @code{select-safe-coding-system}
875calls this function with one argument, the base coding system of the
876selected coding system. If @var{accept-default-p} returns @code{nil},
877@code{select-safe-coding-system} rejects the silently selected coding
878system, and asks the user to select a coding system from a list of
879possible candidates.
bf23b477
EZ
880
881@vindex select-safe-coding-system-accept-default-p
882If the variable @code{select-safe-coding-system-accept-default-p} is
883non-@code{nil}, its value overrides the value of
884@var{accept-default-p}.
35864124
LT
885
886As a final step, before returning the chosen coding system,
887@code{select-safe-coding-system} checks whether that coding system is
888consistent with what would be selected if the contents of the region
889were read from a file. (If not, this could lead to data corruption in
890a file subsequently re-visited and edited.) Normally,
891@code{select-safe-coding-system} uses @code{buffer-file-name} as the
892file for this purpose, but if @var{file} is non-@code{nil}, it uses
893that file instead (this can be relevant for @code{write-region} and
894similar functions). If it detects an apparent inconsistency,
895@code{select-safe-coding-system} queries the user before selecting the
896coding system.
969fe9b5
RS
897@end defun
898
899 Here are two functions you can use to let the user specify a coding
900system, with completion. @xref{Completion}.
901
a9f0a989 902@defun read-coding-system prompt &optional default
969fe9b5
RS
903This function reads a coding system using the minibuffer, prompting with
904string @var{prompt}, and returns the coding system name as a symbol. If
905the user enters null input, @var{default} specifies which coding system
906to return. It should be a symbol or a string.
907@end defun
908
969fe9b5
RS
909@defun read-non-nil-coding-system prompt
910This function reads a coding system using the minibuffer, prompting with
a9f0a989 911string @var{prompt}, and returns the coding system name as a symbol. If
969fe9b5
RS
912the user tries to enter null input, it asks the user to try again.
913@xref{Coding Systems}.
cc6d0d2c
RS
914@end defun
915
916@node Default Coding Systems
a9f0a989 917@subsection Default Coding Systems
cc6d0d2c 918
a9f0a989
RS
919 This section describes variables that specify the default coding
920system for certain files or when running certain subprograms, and the
1911e6e5 921function that I/O operations use to access them.
a9f0a989
RS
922
923 The idea of these variables is that you set them once and for all to the
924defaults you want, and then do not change them again. To specify a
925particular coding system for a particular operation in a Lisp program,
926don't change these variables; instead, override them using
927@code{coding-system-for-read} and @code{coding-system-for-write}
928(@pxref{Specifying Coding Systems}).
cc6d0d2c 929
bf23b477
EZ
930@defvar auto-coding-regexp-alist
931This variable is an alist of text patterns and corresponding coding
932systems. Each element has the form @code{(@var{regexp}
933. @var{coding-system})}; a file whose first few kilobytes match
934@var{regexp} is decoded with @var{coding-system} when its contents are
935read into a buffer. The settings in this alist take priority over
936@code{coding:} tags in the files and the contents of
937@code{file-coding-system-alist} (see below). The default value is set
938so that Emacs automatically recognizes mail files in Babyl format and
939reads them with no code conversions.
940@end defvar
941
cc6d0d2c
RS
942@defvar file-coding-system-alist
943This variable is an alist that specifies the coding systems to use for
944reading and writing particular files. Each element has the form
945@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
946expression that matches certain file names. The element applies to file
947names that match @var{pattern}.
948
35864124 949The @sc{cdr} of the element, @var{coding}, should be either a coding
8241495d
RS
950system, a cons cell containing two coding systems, or a function name (a
951symbol with a function definition). If @var{coding} is a coding system,
952that coding system is used for both reading the file and writing it. If
35864124
LT
953@var{coding} is a cons cell containing two coding systems, its @sc{car}
954specifies the coding system for decoding, and its @sc{cdr} specifies the
8241495d
RS
955coding system for encoding.
956
35864124
LT
957If @var{coding} is a function name, the function should take one
958argument, a list of all arguments passed to
959@code{find-operation-coding-system}. It must return a coding system
960or a cons cell containing two coding systems. This value has the same
961meaning as described above.
cc6d0d2c
RS
962@end defvar
963
cc6d0d2c
RS
964@defvar process-coding-system-alist
965This variable is an alist specifying which coding systems to use for a
966subprocess, depending on which program is running in the subprocess. It
967works like @code{file-coding-system-alist}, except that @var{pattern} is
968matched against the program name used to start the subprocess. The coding
969system or systems specified in this alist are used to initialize the
970coding systems used for I/O to the subprocess, but you can specify
971other coding systems later using @code{set-process-coding-system}.
972@end defvar
973
8241495d
RS
974 @strong{Warning:} Coding systems such as @code{undecided}, which
975determine the coding system from the data, do not work entirely reliably
1911e6e5 976with asynchronous subprocess output. This is because Emacs handles
a9f0a989
RS
977asynchronous subprocess output in batches, as it arrives. If the coding
978system leaves the character code conversion unspecified, or leaves the
979end-of-line conversion unspecified, Emacs must try to detect the proper
980conversion from one batch at a time, and this does not always work.
981
982 Therefore, with an asynchronous subprocess, if at all possible, use a
983coding system which determines both the character code conversion and
984the end of line conversion---that is, one like @code{latin-1-unix},
985rather than @code{undecided} or @code{latin-1}.
986
cc6d0d2c
RS
987@defvar network-coding-system-alist
988This variable is an alist that specifies the coding system to use for
989network streams. It works much like @code{file-coding-system-alist},
969fe9b5 990with the difference that the @var{pattern} in an element may be either a
cc6d0d2c
RS
991port number or a regular expression. If it is a regular expression, it
992is matched against the network service name used to open the network
993stream.
994@end defvar
995
cc6d0d2c
RS
996@defvar default-process-coding-system
997This variable specifies the coding systems to use for subprocess (and
998network stream) input and output, when nothing else specifies what to
999do.
1000
a9f0a989
RS
1001The value should be a cons cell of the form @code{(@var{input-coding}
1002. @var{output-coding})}. Here @var{input-coding} applies to input from
1003the subprocess, and @var{output-coding} applies to output to it.
cc6d0d2c
RS
1004@end defvar
1005
131bf943
RS
1006@defvar auto-coding-functions
1007This variable holds a list of functions that try to determine a
1008coding system for a file based on its undecoded contents.
1009
1010Each function in this list should be written to look at text in the
1011current buffer, but should not modify it in any way. The buffer will
1012contain undecoded text of parts of the file. Each function should
1013take one argument, @var{size}, which tells it how many characters to
1014look at, starting from point. If the function succeeds in determining
1015a coding system for the file, it should return that coding system.
1016Otherwise, it should return @code{nil}.
1017
1018If a file has a @samp{coding:} tag, that takes precedence, so these
1019functions won't be called.
1020@end defvar
1021
a9f0a989 1022@defun find-operation-coding-system operation &rest arguments
a9f0a989
RS
1023This function returns the coding system to use (by default) for
1024performing @var{operation} with @var{arguments}. The value has this
1025form:
1026
1027@example
35864124 1028(@var{decoding-system} . @var{encoding-system})
a9f0a989
RS
1029@end example
1030
1031The first element, @var{decoding-system}, is the coding system to use
1032for decoding (in case @var{operation} does decoding), and
1033@var{encoding-system} is the coding system for encoding (in case
1034@var{operation} does encoding).
1035
8241495d 1036The argument @var{operation} should be a symbol, one of
a9f0a989
RS
1037@code{insert-file-contents}, @code{write-region}, @code{call-process},
1038@code{call-process-region}, @code{start-process}, or
8241495d
RS
1039@code{open-network-stream}. These are the names of the Emacs I/O primitives
1040that can do coding system conversion.
a9f0a989
RS
1041
1042The remaining arguments should be the same arguments that might be given
8241495d 1043to that I/O primitive. Depending on the primitive, one of those
a9f0a989
RS
1044arguments is selected as the @dfn{target}. For example, if
1045@var{operation} does file I/O, whichever argument specifies the file
1046name is the target. For subprocess primitives, the process name is the
1047target. For @code{open-network-stream}, the target is the service name
1048or port number.
1049
1050This function looks up the target in @code{file-coding-system-alist},
1051@code{process-coding-system-alist}, or
1052@code{network-coding-system-alist}, depending on @var{operation}.
a9f0a989
RS
1053@end defun
1054
cc6d0d2c 1055@node Specifying Coding Systems
a9f0a989 1056@subsection Specifying a Coding System for One Operation
cc6d0d2c
RS
1057
1058 You can specify the coding system for a specific operation by binding
1059the variables @code{coding-system-for-read} and/or
1060@code{coding-system-for-write}.
1061
cc6d0d2c
RS
1062@defvar coding-system-for-read
1063If this variable is non-@code{nil}, it specifies the coding system to
1064use for reading a file, or for input from a synchronous subprocess.
1065
1066It also applies to any asynchronous subprocess or network stream, but in
1067a different way: the value of @code{coding-system-for-read} when you
1068start the subprocess or open the network stream specifies the input
1069decoding method for that subprocess or network stream. It remains in
1070use for that subprocess or network stream unless and until overridden.
1071
1072The right way to use this variable is to bind it with @code{let} for a
1073specific I/O operation. Its global value is normally @code{nil}, and
1074you should not globally set it to any other value. Here is an example
1075of the right way to use the variable:
1076
1077@example
1078;; @r{Read the file with no character code conversion.}
ad800164 1079;; @r{Assume @acronym{crlf} represents end-of-line.}
a3d3f60d 1080(let ((coding-system-for-read 'emacs-mule-dos))
cc6d0d2c
RS
1081 (insert-file-contents filename))
1082@end example
1083
1084When its value is non-@code{nil}, @code{coding-system-for-read} takes
a9f0a989 1085precedence over all other methods of specifying a coding system to use for
cc6d0d2c
RS
1086input, including @code{file-coding-system-alist},
1087@code{process-coding-system-alist} and
1088@code{network-coding-system-alist}.
1089@end defvar
1090
cc6d0d2c
RS
1091@defvar coding-system-for-write
1092This works much like @code{coding-system-for-read}, except that it
1093applies to output rather than input. It affects writing to files,
b6954afd 1094as well as sending output to subprocesses and net connections.
cc6d0d2c
RS
1095
1096When a single operation does both input and output, as do
1097@code{call-process-region} and @code{start-process}, both
1098@code{coding-system-for-read} and @code{coding-system-for-write}
1099affect it.
1100@end defvar
1101
cc6d0d2c
RS
1102@defvar inhibit-eol-conversion
1103When this variable is non-@code{nil}, no end-of-line conversion is done,
1104no matter which coding system is specified. This applies to all the
1105Emacs I/O and subprocess primitives, and to the explicit encoding and
1106decoding functions (@pxref{Explicit Encoding}).
1107@end defvar
1108
cc6d0d2c 1109@node Explicit Encoding
a9f0a989 1110@subsection Explicit Encoding and Decoding
cc6d0d2c
RS
1111@cindex encoding text
1112@cindex decoding text
1113
1114 All the operations that transfer text in and out of Emacs have the
1115ability to use a coding system to encode or decode the text.
1116You can also explicitly encode and decode text using the functions
1117in this section.
1118
cc6d0d2c 1119 The result of encoding, and the input to decoding, are not ordinary
0ace421a
GM
1120text. They logically consist of a series of byte values; that is, a
1121series of characters whose codes are in the range 0 through 255. In a
1122multibyte buffer or string, character codes 128 through 159 are
1123represented by multibyte sequences, but this is invisible to Lisp
1124programs.
1125
1126 The usual way to read a file into a buffer as a sequence of bytes, so
1127you can decode the contents explicitly, is with
1128@code{insert-file-contents-literally} (@pxref{Reading from Files});
1129alternatively, specify a non-@code{nil} @var{rawfile} argument when
1130visiting a file with @code{find-file-noselect}. These methods result in
1131a unibyte buffer.
1132
1133 The usual way to use the byte sequence that results from explicitly
1134encoding text is to copy it to a file or process---for example, to write
1135it with @code{write-region} (@pxref{Writing to Files}), and suppress
1136encoding by binding @code{coding-system-for-write} to
1137@code{no-conversion}.
b6954afd
RS
1138
1139 Here are the functions to perform explicit encoding or decoding. The
0ace421a
GM
1140decoding functions produce sequences of bytes; the encoding functions
1141are meant to operate on sequences of bytes. All of these functions
1142discard text properties.
1911e6e5 1143
35864124
LT
1144@deffn Command encode-coding-region start end coding-system
1145This command encodes the text from @var{start} to @var{end} according
969fe9b5 1146to coding system @var{coding-system}. The encoded text replaces the
0ace421a
GM
1147original text in the buffer. The result of encoding is logically a
1148sequence of bytes, but the buffer remains multibyte if it was multibyte
1149before.
cc6d0d2c 1150
35864124
LT
1151This command returns the length of the encoded text.
1152@end deffn
1153
1154@defun encode-coding-string string coding-system &optional nocopy
cc6d0d2c
RS
1155This function encodes the text in @var{string} according to coding
1156system @var{coding-system}. It returns a new string containing the
35864124
LT
1157encoded text, except when @var{nocopy} is non-@code{nil}, in which
1158case the function may return @var{string} itself if the encoding
1159operation is trivial. The result of encoding is a unibyte string.
cc6d0d2c
RS
1160@end defun
1161
35864124
LT
1162@deffn Command decode-coding-region start end coding-system
1163This command decodes the text from @var{start} to @var{end} according
cc6d0d2c
RS
1164to coding system @var{coding-system}. The decoded text replaces the
1165original text in the buffer. To make explicit decoding useful, the text
0ace421a
GM
1166before decoding ought to be a sequence of byte values, but both
1167multibyte and unibyte buffers are acceptable.
cc6d0d2c 1168
35864124
LT
1169This command returns the length of the decoded text.
1170@end deffn
1171
1172@defun decode-coding-string string coding-system &optional nocopy
cc6d0d2c
RS
1173This function decodes the text in @var{string} according to coding
1174system @var{coding-system}. It returns a new string containing the
35864124
LT
1175decoded text, except when @var{nocopy} is non-@code{nil}, in which
1176case the function may return @var{string} itself if the decoding
1177operation is trivial. To make explicit decoding useful, the contents
1178of @var{string} ought to be a sequence of byte values, but a multibyte
0ace421a 1179string is acceptable.
cc6d0d2c 1180@end defun
969fe9b5 1181
131bf943
RS
1182@defun decode-coding-inserted-region from to filename &optional visit beg end replace
1183This function decodes the text from @var{from} to @var{to} as if
1184it were being read from file @var{filename} using @code{insert-file-contents}
1185using the rest of the arguments provided.
1186
1187The normal way to use this function is after reading text from a file
1188without decoding, if you decide you would rather have decoded it.
1189Instead of deleting the text and reading it again, this time with
1190decoding, you can call this function.
1191@end defun
1192
a9f0a989
RS
1193@node Terminal I/O Encoding
1194@subsection Terminal I/O Encoding
1195
1196 Emacs can decode keyboard input using a coding system, and encode
2eb4136f
RS
1197terminal output. This is useful for terminals that transmit or display
1198text using a particular encoding such as Latin-1. Emacs does not set
1199@code{last-coding-system-used} for encoding or decoding for the
1200terminal.
a9f0a989
RS
1201
1202@defun keyboard-coding-system
a9f0a989
RS
1203This function returns the coding system that is in use for decoding
1204keyboard input---or @code{nil} if no coding system is to be used.
1205@end defun
1206
35864124
LT
1207@deffn Command set-keyboard-coding-system coding-system
1208This command specifies @var{coding-system} as the coding system to
a9f0a989
RS
1209use for decoding keyboard input. If @var{coding-system} is @code{nil},
1210that means do not decode keyboard input.
35864124 1211@end deffn
a9f0a989
RS
1212
1213@defun terminal-coding-system
a9f0a989
RS
1214This function returns the coding system that is in use for encoding
1215terminal output---or @code{nil} for no encoding.
1216@end defun
1217
35864124
LT
1218@deffn Command set-terminal-coding-system coding-system
1219This command specifies @var{coding-system} as the coding system to use
a9f0a989
RS
1220for encoding terminal output. If @var{coding-system} is @code{nil},
1221that means do not encode terminal output.
35864124 1222@end deffn
a9f0a989 1223
969fe9b5 1224@node MS-DOS File Types
a9f0a989 1225@subsection MS-DOS File Types
969fe9b5
RS
1226@cindex DOS file types
1227@cindex MS-DOS file types
1228@cindex Windows file types
1229@cindex file types on MS-DOS and Windows
1230@cindex text files and binary files
1231@cindex binary files and text files
1232
8241495d
RS
1233 On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
1234end-of-line conversion for a file by looking at the file's name. This
0ace421a 1235feature classifies files as @dfn{text files} and @dfn{binary files}. By
8241495d
RS
1236``binary file'' we mean a file of literal byte values that are not
1237necessarily meant to be characters; Emacs does no end-of-line conversion
1238and no character code conversion for them. On the other hand, the bytes
1239in a text file are intended to represent characters; when you create a
1240new file whose name implies that it is a text file, Emacs uses DOS
1241end-of-line conversion.
969fe9b5
RS
1242
1243@defvar buffer-file-type
1244This variable, automatically buffer-local in each buffer, records the
a9f0a989
RS
1245file type of the buffer's visited file. When a buffer does not specify
1246a coding system with @code{buffer-file-coding-system}, this variable is
1247used to determine which coding system to use when writing the contents
1248of the buffer. It should be @code{nil} for text, @code{t} for binary.
1249If it is @code{t}, the coding system is @code{no-conversion}.
1250Otherwise, @code{undecided-dos} is used.
1251
1252Normally this variable is set by visiting a file; it is set to
1253@code{nil} if the file was visited without any actual conversion.
969fe9b5
RS
1254@end defvar
1255
1256@defopt file-name-buffer-file-type-alist
1257This variable holds an alist for recognizing text and binary files.
1258Each element has the form (@var{regexp} . @var{type}), where
1259@var{regexp} is matched against the file name, and @var{type} may be
1260@code{nil} for text, @code{t} for binary, or a function to call to
1261compute which. If it is a function, then it is called with a single
1262argument (the file name) and should return @code{t} or @code{nil}.
1263
8241495d 1264When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
969fe9b5
RS
1265which coding system to use when reading a file. For a text file,
1266@code{undecided-dos} is used. For a binary file, @code{no-conversion}
1267is used.
1268
1269If no element in this alist matches a given file name, then
1270@code{default-buffer-file-type} says how to treat the file.
1271@end defopt
1272
1273@defopt default-buffer-file-type
1274This variable says how to handle files for which
1275@code{file-name-buffer-file-type-alist} says nothing about the type.
1276
1277If this variable is non-@code{nil}, then these files are treated as
a9f0a989
RS
1278binary: the coding system @code{no-conversion} is used. Otherwise,
1279nothing special is done for them---the coding system is deduced solely
1280from the file contents, in the usual Emacs fashion.
969fe9b5
RS
1281@end defopt
1282
a9f0a989
RS
1283@node Input Methods
1284@section Input Methods
1285@cindex input methods
1286
ad800164 1287 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
a9f0a989 1288characters from the keyboard. Unlike coding systems, which translate
ad800164 1289non-@acronym{ASCII} characters to and from encodings meant to be read by
a9f0a989
RS
1290programs, input methods provide human-friendly commands. (@xref{Input
1291Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1292use input methods to enter text.) How to define input methods is not
1293yet documented in this manual, but here we describe how to use them.
1294
1295 Each input method has a name, which is currently a string;
1296in the future, symbols may also be usable as input method names.
1297
a9f0a989
RS
1298@defvar current-input-method
1299This variable holds the name of the input method now active in the
1300current buffer. (It automatically becomes local in each buffer when set
1301in any fashion.) It is @code{nil} if no input method is active in the
1302buffer now.
969fe9b5
RS
1303@end defvar
1304
35864124 1305@defopt default-input-method
a9f0a989
RS
1306This variable holds the default input method for commands that choose an
1307input method. Unlike @code{current-input-method}, this variable is
1308normally global.
35864124 1309@end defopt
a9f0a989 1310
35864124
LT
1311@deffn Command set-input-method input-method
1312This command activates input method @var{input-method} for the current
a9f0a989 1313buffer. It also sets @code{default-input-method} to @var{input-method}.
35864124 1314If @var{input-method} is @code{nil}, this command deactivates any input
a9f0a989 1315method for the current buffer.
35864124 1316@end deffn
a9f0a989 1317
a9f0a989
RS
1318@defun read-input-method-name prompt &optional default inhibit-null
1319This function reads an input method name with the minibuffer, prompting
1320with @var{prompt}. If @var{default} is non-@code{nil}, that is returned
1321by default, if the user enters empty input. However, if
1322@var{inhibit-null} is non-@code{nil}, empty input signals an error.
1323
1324The returned value is a string.
1325@end defun
1326
a9f0a989
RS
1327@defvar input-method-alist
1328This variable defines all the supported input methods.
1329Each element defines one input method, and should have the form:
1330
1331@example
1911e6e5
RS
1332(@var{input-method} @var{language-env} @var{activate-func}
1333 @var{title} @var{description} @var{args}...)
a9f0a989
RS
1334@end example
1335
1911e6e5
RS
1336Here @var{input-method} is the input method name, a string;
1337@var{language-env} is another string, the name of the language
1338environment this input method is recommended for. (That serves only for
1339documentation purposes.)
a9f0a989 1340
a9f0a989
RS
1341@var{activate-func} is a function to call to activate this method. The
1342@var{args}, if any, are passed as arguments to @var{activate-func}. All
1343told, the arguments to @var{activate-func} are @var{input-method} and
1344the @var{args}.
0ace421a
GM
1345
1346@var{title} is a string to display in the mode line while this method is
1347active. @var{description} is a string describing this method and what
1348it is good for.
1911e6e5 1349@end defvar
a9f0a989 1350
2eb4136f 1351 The fundamental interface to input methods is through the
35864124
LT
1352variable @code{input-method-function}. @xref{Reading One Event},
1353and @ref{Invoking the Input Method}.
2468d0c0
DL
1354
1355@node Locales
1356@section Locales
1357@cindex locale
1358
1359 POSIX defines a concept of ``locales'' which control which language
1360to use in language-related features. These Emacs variables control
1361how Emacs interacts with these features.
1362
1363@defvar locale-coding-system
1364@tindex locale-coding-system
a007679c 1365@cindex keyboard input decoding on X
2468d0c0 1366This variable specifies the coding system to use for decoding system
a007679c
EZ
1367error messages and---on X Window system only---keyboard input, for
1368encoding the format argument to @code{format-time-string}, and for
1369decoding the return value of @code{format-time-string}.
2468d0c0
DL
1370@end defvar
1371
1372@defvar system-messages-locale
1373@tindex system-messages-locale
1374This variable specifies the locale to use for generating system error
1375messages. Changing the locale can cause messages to come out in a
9c17f494 1376different language or in a different orthography. If the variable is
2468d0c0
DL
1377@code{nil}, the locale is specified by environment variables in the
1378usual POSIX fashion.
1379@end defvar
1380
1381@defvar system-time-locale
1382@tindex system-time-locale
1383This variable specifies the locale to use for formatting time values.
1384Changing the locale can cause messages to appear according to the
1385conventions of a different language. If the variable is @code{nil}, the
1386locale is specified by environment variables in the usual POSIX fashion.
1387@end defvar
0ace421a 1388
131bf943
RS
1389@defun locale-info item
1390This function returns locale data @var{item} for the current POSIX
1391locale, if available. @var{item} should be one of these symbols:
1392
1393@table @code
1394@item codeset
1395Return the character set as a string (locale item @code{CODESET}).
1396
1397@item days
1398Return a 7-element vector of day names (locale items
1399@code{DAY_1} through @code{DAY_7});
1400
1401@item months
1402Return a 12-element vector of month names (locale items @code{MON_1}
1403through @code{MON_12}).
1404
1405@item paper
1406Return a list @code{(@var{width} @var{height})} for the default paper
35864124 1407size measured in millimeters (locale items @code{PAPER_WIDTH} and
131bf943
RS
1408@code{PAPER_HEIGHT}).
1409@end table
1410
1411If the system can't provide the requested information, or if
1412@var{item} is not one of those symbols, the value is @code{nil}. All
1413strings in the return value are decoded using
35864124 1414@code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual},
131bf943
RS
1415for more information about locales and locale items.
1416@end defun
ab5796a9
MB
1417
1418@ignore
1419 arch-tag: be705bf8-941b-4c35-84fc-ad7d20ddb7cb
1420@end ignore