(visible-mode): Use explicit :group keyword. This changes the group
[bpt/emacs.git] / lispref / nonascii.texi
CommitLineData
cc6d0d2c
RS
1@c -*-texinfo-*-
2@c This is part of the GNU Emacs Lisp Reference Manual.
177c0ea7 3@c Copyright (C) 1998, 1999 Free Software Foundation, Inc.
cc6d0d2c
RS
4@c See the file elisp.texi for copying conditions.
5@setfilename ../info/characters
6@node Non-ASCII Characters, Searching and Matching, Text, Top
ad800164 7@chapter Non-@acronym{ASCII} Characters
cc6d0d2c 8@cindex multibyte characters
ad800164 9@cindex non-@acronym{ASCII} characters
cc6d0d2c 10
ad800164 11 This chapter covers the special issues relating to non-@acronym{ASCII}
cc6d0d2c
RS
12characters and how they are stored in strings and buffers.
13
14@menu
5557b83b
RS
15* Text Representations:: Unibyte and multibyte representations
16* Converting Representations:: Converting unibyte to multibyte and vice versa.
17* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
18* Character Codes:: How unibyte and multibyte relate to
19 codes of individual characters.
a3d3f60d 20* Character Sets:: The space of possible character codes
5557b83b
RS
21 is divided into various character sets.
22* Chars and Bytes:: More information about multibyte encodings.
23* Splitting Characters:: Converting a character to its byte sequence.
24* Scanning Charsets:: Which character sets are used in a buffer?
25* Translation of Characters:: Translation tables are used for conversion.
26* Coding Systems:: Coding systems are conversions for saving files.
27* Input Methods:: Input methods allow users to enter various
8a9e355c 28 non-ASCII characters without special keyboards.
5557b83b 29* Locales:: Interacting with the POSIX locale.
cc6d0d2c
RS
30@end menu
31
32@node Text Representations
33@section Text Representations
34@cindex text representations
35
36 Emacs has two @dfn{text representations}---two ways to represent text
37in a string or buffer. These are called @dfn{unibyte} and
38@dfn{multibyte}. Each string, and each buffer, uses one of these two
39representations. For most purposes, you can ignore the issue of
40representations, because Emacs converts text between them as
41appropriate. Occasionally in Lisp programming you will need to pay
42attention to the difference.
43
44@cindex unibyte text
45 In unibyte representation, each character occupies one byte and
46therefore the possible character codes range from 0 to 255. Codes 0
ad800164
EZ
47through 127 are @acronym{ASCII} characters; the codes from 128 through 255
48are used for one non-@acronym{ASCII} character set (you can choose which
969fe9b5 49character set by setting the variable @code{nonascii-insert-offset}).
cc6d0d2c
RS
50
51@cindex leading code
52@cindex multibyte text
1911e6e5 53@cindex trailing codes
cc6d0d2c
RS
54 In multibyte representation, a character may occupy more than one
55byte, and as a result, the full range of Emacs character codes can be
56stored. The first byte of a multibyte character is always in the range
57128 through 159 (octal 0200 through 0237). These values are called
a9f0a989
RS
58@dfn{leading codes}. The second and subsequent bytes of a multibyte
59character are always in the range 160 through 255 (octal 0240 through
1911e6e5 600377); these values are @dfn{trailing codes}.
cc6d0d2c 61
0ace421a 62 Some sequences of bytes are not valid in multibyte text: for example,
1e4d32f8
GM
63a single isolated byte in the range 128 through 159 is not allowed. But
64character codes 128 through 159 can appear in multibyte text,
65represented as two-byte sequences. All the character codes 128 through
66255 are possible (though slightly abnormal) in multibyte text; they
0ace421a
GM
67appear in multibyte buffers and strings when you do explicit encoding
68and decoding (@pxref{Explicit Encoding}).
b6954afd 69
cc6d0d2c
RS
70 In a buffer, the buffer-local value of the variable
71@code{enable-multibyte-characters} specifies the representation used.
08f0f5e9
KH
72The representation for a string is determined and recorded in the string
73when the string is constructed.
cc6d0d2c 74
cc6d0d2c
RS
75@defvar enable-multibyte-characters
76This variable specifies the current buffer's text representation.
77If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
78it contains unibyte text.
79
969fe9b5
RS
80You cannot set this variable directly; instead, use the function
81@code{set-buffer-multibyte} to change a buffer's representation.
cc6d0d2c
RS
82@end defvar
83
cc6d0d2c 84@defvar default-enable-multibyte-characters
a9f0a989 85This variable's value is entirely equivalent to @code{(default-value
cc6d0d2c 86'enable-multibyte-characters)}, and setting this variable changes that
a9f0a989
RS
87default value. Setting the local binding of
88@code{enable-multibyte-characters} in a specific buffer is not allowed,
89but changing the default value is supported, and it is a reasonable
90thing to do, because it has no effect on existing buffers.
cc6d0d2c
RS
91
92The @samp{--unibyte} command line option does its job by setting the
93default value to @code{nil} early in startup.
94@end defvar
95
b6954afd
RS
96@defun position-bytes position
97@tindex position-bytes
5ac343ac
RS
98Return the byte-position corresponding to buffer position
99@var{position} in the current buffer. This is 1 at the start of the
100buffer, and counts upward in bytes. If @var{position} is out of
101range, the value is @code{nil}.
b6954afd
RS
102@end defun
103
104@defun byte-to-position byte-position
105@tindex byte-to-position
106Return the buffer position corresponding to byte-position
35864124
LT
107@var{byte-position} in the current buffer. If @var{byte-position} is
108out of range, the value is @code{nil}.
b6954afd
RS
109@end defun
110
cc6d0d2c 111@defun multibyte-string-p string
b6954afd 112Return @code{t} if @var{string} is a multibyte string.
cc6d0d2c
RS
113@end defun
114
115@node Converting Representations
116@section Converting Text Representations
117
118 Emacs can convert unibyte text to multibyte; it can also convert
119multibyte text to unibyte, though this conversion loses information. In
120general these conversions happen when inserting text into a buffer, or
121when putting text from several strings together in one string. You can
122also explicitly convert a string's contents to either representation.
123
124 Emacs chooses the representation for a string based on the text that
125it is constructed from. The general rule is to convert unibyte text to
126multibyte text when combining it with other multibyte text, because the
127multibyte representation is more general and can hold whatever
128characters the unibyte text has.
129
130 When inserting text into a buffer, Emacs converts the text to the
131buffer's representation, as specified by
132@code{enable-multibyte-characters} in that buffer. In particular, when
133you insert multibyte text into a unibyte buffer, Emacs converts the text
134to unibyte, even though this conversion cannot in general preserve all
135the characters that might be in the multibyte text. The other natural
136alternative, to convert the buffer contents to multibyte, is not
137acceptable because the buffer's representation is a choice made by the
969fe9b5 138user that cannot be overridden automatically.
cc6d0d2c 139
ad800164 140 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
1e4d32f8 141unchanged, and likewise character codes 128 through 159. It converts
ad800164 142the non-@acronym{ASCII} codes 160 through 255 by adding the value
1e4d32f8
GM
143@code{nonascii-insert-offset} to each character code. By setting this
144variable, you specify which character set the unibyte characters
145correspond to (@pxref{Character Sets}). For example, if
146@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
ad800164 147'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
1e4d32f8
GM
148correspond to Latin 1. If it is 2688, which is @code{(- (make-char
149'greek-iso8859-7) 128)}, then they correspond to Greek letters.
cc6d0d2c 150
8241495d
RS
151 Converting multibyte text to unibyte is simpler: it discards all but
152the low 8 bits of each character code. If @code{nonascii-insert-offset}
153has a reasonable value, corresponding to the beginning of some character
154set, this conversion is the inverse of the other: converting unibyte
155text to multibyte and back to unibyte reproduces the original unibyte
156text.
cc6d0d2c 157
cc6d0d2c 158@defvar nonascii-insert-offset
ad800164 159This variable specifies the amount to add to a non-@acronym{ASCII} character
cc6d0d2c 160when converting unibyte text to multibyte. It also applies when
a9f0a989 161@code{self-insert-command} inserts a character in the unibyte
ad800164 162non-@acronym{ASCII} range, 128 through 255. However, the functions
7a063989 163@code{insert} and @code{insert-char} do not perform this conversion.
cc6d0d2c
RS
164
165The right value to use to select character set @var{cs} is @code{(-
a9f0a989 166(make-char @var{cs}) 128)}. If the value of
cc6d0d2c
RS
167@code{nonascii-insert-offset} is zero, then conversion actually uses the
168value for the Latin 1 character set, rather than zero.
169@end defvar
170
a9f0a989 171@defvar nonascii-translation-table
cc6d0d2c
RS
172This variable provides a more general alternative to
173@code{nonascii-insert-offset}. You can use it to specify independently
174how to translate each code in the range of 128 through 255 into a
7a063989 175multibyte character. The value should be a char-table, or @code{nil}.
969fe9b5 176If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
cc6d0d2c
RS
177@end defvar
178
35864124
LT
179The next three functions either return the argument @var{string}, or a
180newly created string with no text properties.
181
cc6d0d2c
RS
182@defun string-make-unibyte string
183This function converts the text of @var{string} to unibyte
1911e6e5 184representation, if it isn't already, and returns the result. If
38eee91c
EZ
185@var{string} is a unibyte string, it is returned unchanged. Multibyte
186character codes are converted to unibyte according to
187@code{nonascii-translation-table} or, if that is @code{nil}, using
188@code{nonascii-insert-offset}. If the lookup in the translation table
189fails, this function takes just the low 8 bits of each character.
cc6d0d2c
RS
190@end defun
191
cc6d0d2c
RS
192@defun string-make-multibyte string
193This function converts the text of @var{string} to multibyte
1911e6e5 194representation, if it isn't already, and returns the result. If
35864124
LT
195@var{string} is a multibyte string or consists entirely of
196@acronym{ASCII} characters, it is returned unchanged. In particular,
197if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
198string is unibyte. (When the characters are all @acronym{ASCII},
199Emacs primitives will treat the string the same way whether it is
200unibyte or multibyte.) If @var{string} is unibyte and contains
201non-@acronym{ASCII} characters, the function
202@code{unibyte-char-to-multibyte} is used to convert each unibyte
203character to a multibyte character.
cc6d0d2c
RS
204@end defun
205
131bf943
RS
206@defun string-to-multibyte string
207This function returns a multibyte string containing the same sequence
35864124
LT
208of character codes as @var{string}. Unlike
209@code{string-make-multibyte}, this function unconditionally returns a
210multibyte string. If @var{string} is a multibyte string, it is
211returned unchanged.
131bf943
RS
212@end defun
213
1ee89891
RS
214@defun multibyte-char-to-unibyte char
215This convert the multibyte character @var{char} to a unibyte
216character, based on @code{nonascii-translation-table} and
217@code{nonascii-insert-offset}.
218@end defun
219
220@defun unibyte-char-to-multibyte char
221This convert the unibyte character @var{char} to a multibyte
222character, based on @code{nonascii-translation-table} and
223@code{nonascii-insert-offset}.
224@end defun
225
cc6d0d2c
RS
226@node Selecting a Representation
227@section Selecting a Representation
228
229 Sometimes it is useful to examine an existing buffer or string as
230multibyte when it was unibyte, or vice versa.
231
cc6d0d2c
RS
232@defun set-buffer-multibyte multibyte
233Set the representation type of the current buffer. If @var{multibyte}
234is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
235is @code{nil}, the buffer becomes unibyte.
236
237This function leaves the buffer contents unchanged when viewed as a
238sequence of bytes. As a consequence, it can change the contents viewed
239as characters; a sequence of two bytes which is treated as one character
240in multibyte representation will count as two characters in unibyte
7a063989
KH
241representation. Character codes 128 through 159 are an exception. They
242are represented by one byte in a unibyte buffer, but when the buffer is
243set to multibyte, they are converted to two-byte sequences, and vice
244versa.
cc6d0d2c
RS
245
246This function sets @code{enable-multibyte-characters} to record which
247representation is in use. It also adjusts various data in the buffer
969fe9b5
RS
248(including overlays, text properties and markers) so that they cover the
249same text as they did before.
b6954afd
RS
250
251You cannot use @code{set-buffer-multibyte} on an indirect buffer,
252because indirect buffers always inherit the representation of the
253base buffer.
cc6d0d2c
RS
254@end defun
255
cc6d0d2c
RS
256@defun string-as-unibyte string
257This function returns a string with the same bytes as @var{string} but
258treating each byte as a character. This means that the value may have
259more characters than @var{string} has.
260
b6954afd 261If @var{string} is already a unibyte string, then the value is
7f84d9ae
DL
262@var{string} itself. Otherwise it is a newly created string, with no
263text properties. If @var{string} is multibyte, any characters it
686ffe28 264contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
7f84d9ae 265are converted to the corresponding single byte.
cc6d0d2c
RS
266@end defun
267
cc6d0d2c
RS
268@defun string-as-multibyte string
269This function returns a string with the same bytes as @var{string} but
270treating each multibyte sequence as one character. This means that the
271value may have fewer characters than @var{string} has.
272
b6954afd 273If @var{string} is already a multibyte string, then the value is
7f84d9ae
DL
274@var{string} itself. Otherwise it is a newly created string, with no
275text properties. If @var{string} is unibyte and contains any individual
2768-bit bytes (i.e.@: not part of a multibyte form), they are converted to
686ffe28
RS
277the corresponding multibyte character of charset @code{eight-bit-control}
278or @code{eight-bit-graphic}.
cc6d0d2c
RS
279@end defun
280
281@node Character Codes
282@section Character Codes
283@cindex character codes
284
285 The unibyte and multibyte text representations use different character
286codes. The valid character codes for unibyte representation range from
2870 to 255---the values that can fit in one byte. The valid character
288codes for multibyte representation range from 0 to 524287, but not all
0ace421a 289values in that range are valid. The values 128 through 255 are not
1e4d32f8 290entirely proper in multibyte text, but they can occur if you do explicit
0ace421a 291encoding and decoding (@pxref{Explicit Encoding}). Some other character
ad800164 292codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes
1e4d32f8 2930 through 127 are completely legitimate in both representations.
cc6d0d2c 294
7a063989 295@defun char-valid-p charcode &optional genericp
0a58afcd
RS
296This returns @code{t} if @var{charcode} is valid (either for unibyte
297text or for multibyte text).
cc6d0d2c
RS
298
299@example
300(char-valid-p 65)
301 @result{} t
302(char-valid-p 256)
303 @result{} nil
304(char-valid-p 2248)
305 @result{} t
306@end example
7a063989 307
6fe50867 308If the optional argument @var{genericp} is non-@code{nil}, this
35864124
LT
309function also returns @code{t} if @var{charcode} is a generic
310character (@pxref{Splitting Characters}).
cc6d0d2c
RS
311@end defun
312
313@node Character Sets
314@section Character Sets
315@cindex character sets
316
317 Emacs classifies characters into various @dfn{character sets}, each of
318which has a name which is a symbol. Each character belongs to one and
319only one character set.
320
321 In general, there is one character set for each distinct script. For
322example, @code{latin-iso8859-1} is one character set,
323@code{greek-iso8859-7} is another, and @code{ascii} is another. An
969fe9b5
RS
324Emacs character set can hold at most 9025 characters; therefore, in some
325cases, characters that would logically be grouped together are split
a9f0a989
RS
326into several character sets. For example, one set of Chinese
327characters, generally known as Big 5, is divided into two Emacs
328character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
cc6d0d2c 329
ad800164
EZ
330 @acronym{ASCII} characters are in character set @code{ascii}. The
331non-@acronym{ASCII} characters 128 through 159 are in character set
4240c779
GM
332@code{eight-bit-control}, and codes 160 through 255 are in character set
333@code{eight-bit-graphic}.
334
cc6d0d2c 335@defun charsetp object
8241495d 336Returns @code{t} if @var{object} is a symbol that names a character set,
cc6d0d2c
RS
337@code{nil} otherwise.
338@end defun
339
35864124
LT
340@defvar charset-list
341The value is a list of all defined character set names.
342@end defvar
343
cc6d0d2c 344@defun charset-list
35864124
LT
345This function returns the value of @code{charset-list}. It is only
346provided for backward compatibility.
cc6d0d2c
RS
347@end defun
348
cc6d0d2c 349@defun char-charset character
b6954afd 350This function returns the name of the character set that @var{character}
35864124
LT
351belongs to, or the symbol @code{unknown} if @var{character} is not a
352valid character.
cc6d0d2c
RS
353@end defun
354
8241495d
RS
355@defun charset-plist charset
356@tindex charset-plist
357This function returns the charset property list of the character set
358@var{charset}. Although @var{charset} is a symbol, this is not the same
359as the property list of that symbol. Charset properties are used for
0f4da9ce 360special purposes within Emacs.
8241495d
RS
361@end defun
362
5ac343ac
RS
363@deffn Command list-charset-chars charset
364This command displays a list of characters in the character set
365@var{charset}.
366@end deffn
367
cc6d0d2c
RS
368@node Chars and Bytes
369@section Characters and Bytes
370@cindex bytes and characters
371
a9f0a989
RS
372@cindex introduction sequence
373@cindex dimension (of character set)
cc6d0d2c 374 In multibyte representation, each character occupies one or more
a9f0a989 375bytes. Each character set has an @dfn{introduction sequence}, which is
ad800164
EZ
376normally one or two bytes long. (Exception: the @code{ascii} character
377set and the @code{eight-bit-graphic} character set have a zero-length
7a063989
KH
378introduction sequence.) The introduction sequence is the beginning of
379the byte sequence for any character in the character set. The rest of
380the character's bytes distinguish it from the other characters in the
381same character set. Depending on the character set, there are either
382one or two distinguishing bytes; the number of such bytes is called the
383@dfn{dimension} of the character set.
a9f0a989
RS
384
385@defun charset-dimension charset
b6954afd
RS
386This function returns the dimension of @var{charset}; at present, the
387dimension is always 1 or 2.
388@end defun
389
390@defun charset-bytes charset
391@tindex charset-bytes
392This function returns the number of bytes used to represent a character
393in character set @var{charset}.
a9f0a989
RS
394@end defun
395
396 This is the simplest way to determine the byte length of a character
397set's introduction sequence:
398
399@example
b6954afd 400(- (charset-bytes @var{charset})
a9f0a989
RS
401 (charset-dimension @var{charset}))
402@end example
403
404@node Splitting Characters
405@section Splitting Characters
406
407 The functions in this section convert between characters and the byte
408values used to represent them. For most purposes, there is no need to
409be concerned with the sequence of bytes used to represent a character,
969fe9b5 410because Emacs translates automatically when necessary.
cc6d0d2c 411
cc6d0d2c
RS
412@defun split-char character
413Return a list containing the name of the character set of
a9f0a989
RS
414@var{character}, followed by one or two byte values (integers) which
415identify @var{character} within that character set. The number of byte
416values is the character set's dimension.
cc6d0d2c 417
35864124
LT
418If @var{character} is invalid as a character code, @code{split-char}
419returns a list consisting of the symbol @code{unknown} and @var{character}.
420
cc6d0d2c
RS
421@example
422(split-char 2248)
423 @result{} (latin-iso8859-1 72)
424(split-char 65)
425 @result{} (ascii 65)
7a063989
KH
426(split-char 128)
427 @result{} (eight-bit-control 128)
cc6d0d2c
RS
428@end example
429@end defun
430
e8262f40
DL
431@defun make-char charset &optional code1 code2
432This function returns the character in character set @var{charset} whose
433position codes are @var{code1} and @var{code2}. This is roughly the
434inverse of @code{split-char}. Normally, you should specify either one
435or both of @var{code1} and @var{code2} according to the dimension of
436@var{charset}. For example,
cc6d0d2c
RS
437
438@example
439(make-char 'latin-iso8859-1 72)
440 @result{} 2248
441@end example
0f4da9ce
DL
442
443Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
444before they are used to index @var{charset}. Thus you may use, for
445instance, an ISO 8859 character code rather than subtracting 128, as
446is necessary to index the corresponding Emacs charset.
cc6d0d2c
RS
447@end defun
448
a9f0a989
RS
449@cindex generic characters
450 If you call @code{make-char} with no @var{byte-values}, the result is
451a @dfn{generic character} which stands for @var{charset}. A generic
452character is an integer, but it is @emph{not} valid for insertion in the
453buffer as a character. It can be used in @code{char-table-range} to
454refer to the whole character set (@pxref{Char-Tables}).
455@code{char-valid-p} returns @code{nil} for generic characters.
456For example:
457
458@example
459(make-char 'latin-iso8859-1)
460 @result{} 2176
461(char-valid-p 2176)
462 @result{} nil
7a063989
KH
463(char-valid-p 2176 t)
464 @result{} t
a9f0a989
RS
465(split-char 2176)
466 @result{} (latin-iso8859-1 0)
467@end example
468
ad800164
EZ
469The character sets @code{ascii}, @code{eight-bit-control}, and
470@code{eight-bit-graphic} don't have corresponding generic characters. If
e8262f40
DL
471@var{charset} is one of them and you don't supply @var{code1},
472@code{make-char} returns the character code corresponding to the
473smallest code in @var{charset}.
7a063989 474
a9f0a989
RS
475@node Scanning Charsets
476@section Scanning for Character Sets
477
478 Sometimes it is useful to find out which character sets appear in a
479part of a buffer or a string. One use for this is in determining which
480coding systems (@pxref{Coding Systems}) are capable of representing all
481of the text in question.
482
5ac343ac
RS
483@defun charset-after &optional pos
484This function return the charset of a character in the current buffer
485at position @var{pos}. If @var{pos} is omitted or @code{nil}, it
486defauls to the current value of point. If @var{pos} is out of range,
487the value is @code{nil}.
488@end defun
489
a9f0a989 490@defun find-charset-region beg end &optional translation
a9f0a989
RS
491This function returns a list of the character sets that appear in the
492current buffer between positions @var{beg} and @var{end}.
493
494The optional argument @var{translation} specifies a translation table to
495be used in scanning the text (@pxref{Translation of Characters}). If it
496is non-@code{nil}, then each character in the region is translated
497through this table, and the value returned describes the translated
498characters instead of the characters actually in the buffer.
a265079f 499@end defun
a9f0a989
RS
500
501@defun find-charset-string string &optional translation
b6954afd
RS
502This function returns a list of the character sets that appear in the
503string @var{string}. It is just like @code{find-charset-region}, except
504that it applies to the contents of @var{string} instead of part of the
505current buffer.
a9f0a989
RS
506@end defun
507
508@node Translation of Characters
509@section Translation of Characters
510@cindex character translation tables
511@cindex translation tables
512
35864124
LT
513 A @dfn{translation table} is a char-table that specifies a mapping
514of characters into characters. These tables are used in encoding and
515decoding, and for other purposes. Some coding systems specify their
516own particular translation tables; there are also default translation
517tables which apply to all other coding systems.
a9f0a989 518
a3d3f60d
RS
519 For instance, the coding-system @code{utf-8} has a translation table
520that maps characters of various charsets (e.g.,
521@code{latin-iso8859-@var{x}}) into Unicode character sets. This way,
522it can encode Latin-2 characters into UTF-8. Meanwhile,
523@code{unify-8859-on-decoding-mode} operates by specifying
524@code{standard-translation-table-for-decode} to translate
525Latin-@var{x} characters into corresponding Unicode characters.
526
8241495d
RS
527@defun make-translation-table &rest translations
528This function returns a translation table based on the argument
f57b6e64
DL
529@var{translations}. Each element of @var{translations} should be a
530list of elements of the form @code{(@var{from} . @var{to})}; this says
531to translate the character @var{from} into @var{to}.
a9f0a989 532
c04c052b
DL
533The arguments and the forms in each argument are processed in order,
534and if a previous form already translates @var{to} to some other
535character, say @var{to-alt}, @var{from} is also translated to
536@var{to-alt}.
537
a9f0a989
RS
538You can also map one whole character set into another character set with
539the same dimension. To do this, you specify a generic character (which
540designates a character set) for @var{from} (@pxref{Splitting Characters}).
35864124
LT
541In this case, if @var{to} is also a generic character, its character
542set should have the same dimension as @var{from}'s. Then the
543translation table translates each character of @var{from}'s character
544set into the corresponding character of @var{to}'s character set. If
545@var{from} is a generic character and @var{to} is an ordinary
546character, then the translation table translates every character of
547@var{from}'s character set into @var{to}.
a9f0a989
RS
548@end defun
549
550 In decoding, the translation table's translations are applied to the
551characters that result from ordinary decoding. If a coding system has
35864124
LT
552property @code{translation-table-for-decode}, that specifies the
553translation table to use. (This is a property of the coding system,
554as returned by @code{coding-system-get}, not a property of the symbol
555that is the coding system's name. @xref{Coding System Basics,, Basic
556Concepts of Coding Systems}.) Otherwise, if
557@code{standard-translation-table-for-decode} is non-@code{nil},
558decoding uses that table.
a9f0a989
RS
559
560 In encoding, the translation table's translations are applied to the
561characters in the buffer, and the result of translation is actually
562encoded. If a coding system has property
35864124
LT
563@code{translation-table-for-encode}, that specifies the translation
564table to use. Otherwise the variable
b1f687a2
RS
565@code{standard-translation-table-for-encode} specifies the translation
566table.
a9f0a989 567
b1f687a2 568@defvar standard-translation-table-for-decode
a9f0a989
RS
569This is the default translation table for decoding, for
570coding systems that don't specify any other translation table.
571@end defvar
572
b1f687a2 573@defvar standard-translation-table-for-encode
a9f0a989
RS
574This is the default translation table for encoding, for
575coding systems that don't specify any other translation table.
576@end defvar
577
131bf943
RS
578@defvar translation-table-for-input
579Self-inserting characters are translated through this translation
35864124
LT
580table before they are inserted. This variable automatically becomes
581buffer-local when set.
a3d3f60d
RS
582
583@code{set-buffer-file-coding-system} sets this variable so that your
584keyboard input gets translated into the character sets that the buffer
585is likely to contain.
131bf943
RS
586@end defvar
587
cc6d0d2c
RS
588@node Coding Systems
589@section Coding Systems
590
591@cindex coding system
592 When Emacs reads or writes a file, and when Emacs sends text to a
593subprocess or receives text from a subprocess, it normally performs
594character code conversion and end-of-line conversion as specified
595by a particular @dfn{coding system}.
596
8241495d
RS
597 How to define a coding system is an arcane matter, and is not
598documented here.
b6954afd 599
a9f0a989 600@menu
5557b83b
RS
601* Coding System Basics:: Basic concepts.
602* Encoding and I/O:: How file I/O functions handle coding systems.
603* Lisp and Coding Systems:: Functions to operate on coding system names.
604* User-Chosen Coding Systems:: Asking the user to choose a coding system.
605* Default Coding Systems:: Controlling the default choices.
606* Specifying Coding Systems:: Requesting a particular coding system
607 for a single file operation.
608* Explicit Encoding:: Encoding or decoding text without doing I/O.
609* Terminal I/O Encoding:: Use of encoding for terminal I/O.
610* MS-DOS File Types:: How DOS "text" and "binary" files
611 relate to coding systems.
a9f0a989
RS
612@end menu
613
614@node Coding System Basics
615@subsection Basic Concepts of Coding Systems
616
cc6d0d2c
RS
617@cindex character code conversion
618 @dfn{Character code conversion} involves conversion between the encoding
619used inside Emacs and some other encoding. Emacs supports many
620different encodings, in that it can convert to and from them. For
621example, it can convert text to or from encodings such as Latin 1, Latin
6222, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some
623cases, Emacs supports several alternative encodings for the same
624characters; for example, there are three coding systems for the Cyrillic
625(Russian) alphabet: ISO, Alternativnyj, and KOI8.
626
cc6d0d2c 627 Most coding systems specify a particular character code for
8241495d
RS
628conversion, but some of them leave the choice unspecified---to be chosen
629heuristically for each file, based on the data.
cc6d0d2c 630
8b918214
RS
631In general, a coding system doesn't guarantee roundtrip identity:
632decoding text then encoding the result in the same coding system can
633produce a different byte sequence from the one you originally decoded.
634However, the following coding systems do guarantee that the result
635will be the same as what you originally decoded:
6fa88620
KH
636
637@quotation
638chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule
639greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3
640iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe
641japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text
642@end quotation
643
8b918214
RS
644Encoding buffer text and then decoding the result can also fail to
645reproduce the original text. For instance, when you encode Latin-2
646characters with @code{utf-8} and decode the result using the same
647coding system, you'll get Unicode characters (of charset
648@code{mule-unicode-0100-24ff}). When you encode Unicode characters
649with @code{iso-latin-2} and decode them back with the same coding
650system, you'll get Latin-2 characters.
6fa88620 651
969fe9b5
RS
652@cindex end of line conversion
653 @dfn{End of line conversion} handles three different conventions used
654on various systems for representing end of line in files. The Unix
655convention is to use the linefeed character (also called newline). The
8241495d
RS
656DOS convention is to use a carriage-return and a linefeed at the end of
657a line. The Mac convention is to use just carriage-return.
969fe9b5 658
cc6d0d2c
RS
659@cindex base coding system
660@cindex variant coding system
661 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
662conversion unspecified, to be chosen based on the data. @dfn{Variant
663coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
664@code{latin-1-mac} specify the end-of-line conversion explicitly as
a9f0a989 665well. Most base coding systems have three corresponding variants whose
cc6d0d2c
RS
666names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
667
a9f0a989
RS
668 The coding system @code{raw-text} is special in that it prevents
669character code conversion, and causes the buffer visited with that
670coding system to be a unibyte buffer. It does not specify the
671end-of-line conversion, allowing that to be determined as usual by the
672data, and has the usual three variants which specify the end-of-line
673conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}:
674it specifies no conversion of either character codes or end-of-line.
675
676 The coding system @code{emacs-mule} specifies that the data is
677represented in the internal Emacs encoding. This is like
678@code{raw-text} in that no code conversion happens, but different in
679that the result is multibyte data.
680
681@defun coding-system-get coding-system property
a9f0a989
RS
682This function returns the specified property of the coding system
683@var{coding-system}. Most coding system properties exist for internal
684purposes, but one that you might find useful is @code{mime-charset}.
685That property's value is the name used in MIME for the character coding
686which this coding system can read and write. Examples:
687
688@example
689(coding-system-get 'iso-latin-1 'mime-charset)
690 @result{} iso-8859-1
691(coding-system-get 'iso-2022-cn 'mime-charset)
692 @result{} iso-2022-cn
693(coding-system-get 'cyrillic-koi8 'mime-charset)
694 @result{} koi8-r
695@end example
696
697The value of the @code{mime-charset} property is also defined
698as an alias for the coding system.
699@end defun
700
701@node Encoding and I/O
702@subsection Encoding and I/O
703
1911e6e5 704 The principal purpose of coding systems is for use in reading and
a9f0a989
RS
705writing files. The function @code{insert-file-contents} uses
706a coding system for decoding the file data, and @code{write-region}
707uses one to encode the buffer contents.
708
709 You can specify the coding system to use either explicitly
5ac343ac 710(@pxref{Specifying Coding Systems}), or implicitly using a default
a9f0a989
RS
711mechanism (@pxref{Default Coding Systems}). But these methods may not
712completely specify what to do. For example, they may choose a coding
713system such as @code{undefined} which leaves the character code
714conversion to be determined from the data. In these cases, the I/O
715operation finishes the job of choosing a coding system. Very often
716you will want to find out afterwards which coding system was chosen.
717
718@defvar buffer-file-coding-system
a9f0a989
RS
719This variable records the coding system that was used for visiting the
720current buffer. It is used for saving the buffer, and for writing part
1b02d12c
EZ
721of the buffer with @code{write-region}. If the text to be written
722cannot be safely encoded using the coding system specified by this
723variable, these operations select an alternative encoding by calling
724the function @code{select-safe-coding-system} (@pxref{User-Chosen
725Coding Systems}). If selecting a different encoding requires to ask
726the user to specify a coding system, @code{buffer-file-coding-system}
727is updated to the newly selected coding system.
728
729@code{buffer-file-coding-system} does @emph{not} affect sending text
b6954afd 730to a subprocess.
a9f0a989
RS
731@end defvar
732
733@defvar save-buffer-coding-system
7a063989
KH
734This variable specifies the coding system for saving the buffer (by
735overriding @code{buffer-file-coding-system}). Note that it is not used
736for @code{write-region}.
8241495d
RS
737
738When a command to save the buffer starts out to use
7a063989
KH
739@code{buffer-file-coding-system} (or @code{save-buffer-coding-system}),
740and that coding system cannot handle
8241495d 741the actual text in the buffer, the command asks the user to choose
1b02d12c
EZ
742another coding system (by calling @code{select-safe-coding-system}).
743After that happens, the command also updates
744@code{buffer-file-coding-system} to represent the coding system that
745the user specified.
a9f0a989
RS
746@end defvar
747
748@defvar last-coding-system-used
a9f0a989
RS
749I/O operations for files and subprocesses set this variable to the
750coding system name that was used. The explicit encoding and decoding
751functions (@pxref{Explicit Encoding}) set it too.
752
753@strong{Warning:} Since receiving subprocess output sets this variable,
8241495d
RS
754it can change whenever Emacs waits; therefore, you should copy the
755value shortly after the function call that stores the value you are
a9f0a989
RS
756interested in.
757@end defvar
758
2eb4136f
RS
759 The variable @code{selection-coding-system} specifies how to encode
760selections for the window system. @xref{Window System Selections}.
761
1ee89891
RS
762@defvar file-name-coding-system
763The variable @code{file-name-coding-system} specifies the coding
764system to use for encoding file names. Emacs encodes file names using
765that coding system for all file operations. If
766@code{file-name-coding-system} is @code{nil}, Emacs uses a default
767coding system determined by the selected language environment. In the
768default language environment, any non-@acronym{ASCII} characters in
769file names are not encoded specially; they appear in the file system
770using the internal Emacs representation.
771@end defvar
772
773 @strong{Warning:} if you change @code{file-name-coding-system} (or
774the language environment) in the middle of an Emacs session, problems
775can result if you have already visited files whose names were encoded
776using the earlier coding system and are handled differently under the
777new coding system. If you try to save one of these buffers under the
778visited file name, saving may use the wrong file name, or it may get
779an error. If such a problem happens, use @kbd{C-x C-w} to specify a
780new file name for that buffer.
781
969fe9b5
RS
782@node Lisp and Coding Systems
783@subsection Coding Systems in Lisp
784
8241495d 785 Here are the Lisp facilities for working with coding systems:
cc6d0d2c 786
cc6d0d2c
RS
787@defun coding-system-list &optional base-only
788This function returns a list of all coding system names (symbols). If
789@var{base-only} is non-@code{nil}, the value includes only the
7a063989
KH
790base coding systems. Otherwise, it includes alias and variant coding
791systems as well.
cc6d0d2c
RS
792@end defun
793
cc6d0d2c
RS
794@defun coding-system-p object
795This function returns @code{t} if @var{object} is a coding system
35864124 796name or @code{nil}.
cc6d0d2c
RS
797@end defun
798
cc6d0d2c
RS
799@defun check-coding-system coding-system
800This function checks the validity of @var{coding-system}.
801If that is valid, it returns @var{coding-system}.
802Otherwise it signals an error with condition @code{coding-system-error}.
803@end defun
804
a9f0a989 805@defun coding-system-change-eol-conversion coding-system eol-type
a9f0a989 806This function returns a coding system which is like @var{coding-system}
1911e6e5 807except for its eol conversion, which is specified by @code{eol-type}.
a9f0a989
RS
808@var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
809@code{nil}. If it is @code{nil}, the returned coding system determines
810the end-of-line conversion from the data.
35864124
LT
811
812@var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
070b546b 813@code{dos} and @code{mac}, respectively.
a9f0a989 814@end defun
969fe9b5 815
a9f0a989 816@defun coding-system-change-text-conversion eol-coding text-coding
a9f0a989
RS
817This function returns a coding system which uses the end-of-line
818conversion of @var{eol-coding}, and the text conversion of
819@var{text-coding}. If @var{text-coding} is @code{nil}, it returns
820@code{undecided}, or one of its variants according to @var{eol-coding}.
969fe9b5
RS
821@end defun
822
a9f0a989 823@defun find-coding-systems-region from to
a9f0a989
RS
824This function returns a list of coding systems that could be used to
825encode a text between @var{from} and @var{to}. All coding systems in
826the list can safely encode any multibyte characters in that portion of
827the text.
828
829If the text contains no multibyte characters, the function returns the
830list @code{(undecided)}.
831@end defun
832
833@defun find-coding-systems-string string
a9f0a989
RS
834This function returns a list of coding systems that could be used to
835encode the text of @var{string}. All coding systems in the list can
836safely encode any multibyte characters in @var{string}. If the text
837contains no multibyte characters, this returns the list
838@code{(undecided)}.
839@end defun
840
841@defun find-coding-systems-for-charsets charsets
a9f0a989
RS
842This function returns a list of coding systems that could be used to
843encode all the character sets in the list @var{charsets}.
844@end defun
845
846@defun detect-coding-region start end &optional highest
cc6d0d2c 847This function chooses a plausible coding system for decoding the text
0ace421a 848from @var{start} to @var{end}. This text should be a byte sequence
969fe9b5 849(@pxref{Explicit Encoding}).
cc6d0d2c 850
a9f0a989 851Normally this function returns a list of coding systems that could
cc6d0d2c 852handle decoding the text that was scanned. They are listed in order of
a9f0a989
RS
853decreasing priority. But if @var{highest} is non-@code{nil}, then the
854return value is just one coding system, the one that is highest in
855priority.
856
ad800164 857If the region contains only @acronym{ASCII} characters, the value
35864124
LT
858is @code{undecided} or @code{(undecided)}, or a variant specifying
859end-of-line conversion, if that can be deduced from the text.
cc6d0d2c
RS
860@end defun
861
35864124 862@defun detect-coding-string string &optional highest
cc6d0d2c
RS
863This function is like @code{detect-coding-region} except that it
864operates on the contents of @var{string} instead of bytes in the buffer.
1911e6e5
RS
865@end defun
866
35864124
LT
867 @xref{Coding systems for a subprocess,, Process Information}, in
868particular the description of the functions
869@code{process-coding-system} and @code{set-process-coding-system}, for
870how to examine or set the coding systems used for I/O to a subprocess.
1911e6e5
RS
871
872@node User-Chosen Coding Systems
873@subsection User-Chosen Coding Systems
874
1b02d12c 875@cindex select safe coding system
35864124 876@defun select-safe-coding-system from to &optional default-coding-system accept-default-p file
bf23b477
EZ
877This function selects a coding system for encoding specified text,
878asking the user to choose if necessary. Normally the specified text
35864124
LT
879is the text in the current buffer between @var{from} and @var{to}. If
880@var{from} is a string, the string specifies the text to encode, and
881@var{to} is ignored.
bf23b477
EZ
882
883If @var{default-coding-system} is non-@code{nil}, that is the first
884coding system to try; if that can handle the text,
885@code{select-safe-coding-system} returns that coding system. It can
886also be a list of coding systems; then the function tries each of them
35864124
LT
887one by one. After trying all of them, it next tries the current
888buffer's value of @code{buffer-file-coding-system} (if it is not
889@code{undecided}), then the value of
890@code{default-buffer-file-coding-system} and finally the user's most
891preferred coding system, which the user can set using the command
892@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
893Coding Systems, emacs, The GNU Emacs Manual}).
bf23b477
EZ
894
895If one of those coding systems can safely encode all the specified
896text, @code{select-safe-coding-system} chooses it and returns it.
897Otherwise, it asks the user to choose from a list of coding systems
898which can encode all the text, and returns the user's choice.
899
35864124
LT
900@var{default-coding-system} can also be a list whose first element is
901t and whose other elements are coding systems. Then, if no coding
902system in the list can handle the text, @code{select-safe-coding-system}
903queries the user immediately, without trying any of the three
904alternatives described above.
905
bf23b477 906The optional argument @var{accept-default-p}, if non-@code{nil},
35864124
LT
907should be a function to determine whether a coding system selected
908without user interaction is acceptable. @code{select-safe-coding-system}
909calls this function with one argument, the base coding system of the
910selected coding system. If @var{accept-default-p} returns @code{nil},
911@code{select-safe-coding-system} rejects the silently selected coding
912system, and asks the user to select a coding system from a list of
913possible candidates.
bf23b477
EZ
914
915@vindex select-safe-coding-system-accept-default-p
916If the variable @code{select-safe-coding-system-accept-default-p} is
917non-@code{nil}, its value overrides the value of
918@var{accept-default-p}.
35864124
LT
919
920As a final step, before returning the chosen coding system,
921@code{select-safe-coding-system} checks whether that coding system is
922consistent with what would be selected if the contents of the region
923were read from a file. (If not, this could lead to data corruption in
924a file subsequently re-visited and edited.) Normally,
925@code{select-safe-coding-system} uses @code{buffer-file-name} as the
926file for this purpose, but if @var{file} is non-@code{nil}, it uses
927that file instead (this can be relevant for @code{write-region} and
928similar functions). If it detects an apparent inconsistency,
929@code{select-safe-coding-system} queries the user before selecting the
930coding system.
969fe9b5
RS
931@end defun
932
933 Here are two functions you can use to let the user specify a coding
934system, with completion. @xref{Completion}.
935
a9f0a989 936@defun read-coding-system prompt &optional default
969fe9b5
RS
937This function reads a coding system using the minibuffer, prompting with
938string @var{prompt}, and returns the coding system name as a symbol. If
939the user enters null input, @var{default} specifies which coding system
940to return. It should be a symbol or a string.
941@end defun
942
969fe9b5
RS
943@defun read-non-nil-coding-system prompt
944This function reads a coding system using the minibuffer, prompting with
a9f0a989 945string @var{prompt}, and returns the coding system name as a symbol. If
969fe9b5
RS
946the user tries to enter null input, it asks the user to try again.
947@xref{Coding Systems}.
cc6d0d2c
RS
948@end defun
949
950@node Default Coding Systems
a9f0a989 951@subsection Default Coding Systems
cc6d0d2c 952
a9f0a989
RS
953 This section describes variables that specify the default coding
954system for certain files or when running certain subprograms, and the
1911e6e5 955function that I/O operations use to access them.
a9f0a989
RS
956
957 The idea of these variables is that you set them once and for all to the
958defaults you want, and then do not change them again. To specify a
959particular coding system for a particular operation in a Lisp program,
960don't change these variables; instead, override them using
961@code{coding-system-for-read} and @code{coding-system-for-write}
962(@pxref{Specifying Coding Systems}).
cc6d0d2c 963
bf23b477
EZ
964@defvar auto-coding-regexp-alist
965This variable is an alist of text patterns and corresponding coding
966systems. Each element has the form @code{(@var{regexp}
967. @var{coding-system})}; a file whose first few kilobytes match
968@var{regexp} is decoded with @var{coding-system} when its contents are
969read into a buffer. The settings in this alist take priority over
970@code{coding:} tags in the files and the contents of
971@code{file-coding-system-alist} (see below). The default value is set
972so that Emacs automatically recognizes mail files in Babyl format and
973reads them with no code conversions.
974@end defvar
975
cc6d0d2c
RS
976@defvar file-coding-system-alist
977This variable is an alist that specifies the coding systems to use for
978reading and writing particular files. Each element has the form
979@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
980expression that matches certain file names. The element applies to file
981names that match @var{pattern}.
982
35864124 983The @sc{cdr} of the element, @var{coding}, should be either a coding
8241495d
RS
984system, a cons cell containing two coding systems, or a function name (a
985symbol with a function definition). If @var{coding} is a coding system,
986that coding system is used for both reading the file and writing it. If
35864124
LT
987@var{coding} is a cons cell containing two coding systems, its @sc{car}
988specifies the coding system for decoding, and its @sc{cdr} specifies the
8241495d
RS
989coding system for encoding.
990
35864124
LT
991If @var{coding} is a function name, the function should take one
992argument, a list of all arguments passed to
993@code{find-operation-coding-system}. It must return a coding system
994or a cons cell containing two coding systems. This value has the same
995meaning as described above.
cc6d0d2c
RS
996@end defvar
997
cc6d0d2c
RS
998@defvar process-coding-system-alist
999This variable is an alist specifying which coding systems to use for a
1000subprocess, depending on which program is running in the subprocess. It
1001works like @code{file-coding-system-alist}, except that @var{pattern} is
1002matched against the program name used to start the subprocess. The coding
1003system or systems specified in this alist are used to initialize the
1004coding systems used for I/O to the subprocess, but you can specify
1005other coding systems later using @code{set-process-coding-system}.
1006@end defvar
1007
8241495d
RS
1008 @strong{Warning:} Coding systems such as @code{undecided}, which
1009determine the coding system from the data, do not work entirely reliably
1911e6e5 1010with asynchronous subprocess output. This is because Emacs handles
a9f0a989
RS
1011asynchronous subprocess output in batches, as it arrives. If the coding
1012system leaves the character code conversion unspecified, or leaves the
1013end-of-line conversion unspecified, Emacs must try to detect the proper
1014conversion from one batch at a time, and this does not always work.
1015
1016 Therefore, with an asynchronous subprocess, if at all possible, use a
1017coding system which determines both the character code conversion and
1018the end of line conversion---that is, one like @code{latin-1-unix},
1019rather than @code{undecided} or @code{latin-1}.
1020
cc6d0d2c
RS
1021@defvar network-coding-system-alist
1022This variable is an alist that specifies the coding system to use for
1023network streams. It works much like @code{file-coding-system-alist},
969fe9b5 1024with the difference that the @var{pattern} in an element may be either a
cc6d0d2c
RS
1025port number or a regular expression. If it is a regular expression, it
1026is matched against the network service name used to open the network
1027stream.
1028@end defvar
1029
cc6d0d2c
RS
1030@defvar default-process-coding-system
1031This variable specifies the coding systems to use for subprocess (and
1032network stream) input and output, when nothing else specifies what to
1033do.
1034
a9f0a989
RS
1035The value should be a cons cell of the form @code{(@var{input-coding}
1036. @var{output-coding})}. Here @var{input-coding} applies to input from
1037the subprocess, and @var{output-coding} applies to output to it.
cc6d0d2c
RS
1038@end defvar
1039
131bf943
RS
1040@defvar auto-coding-functions
1041This variable holds a list of functions that try to determine a
1042coding system for a file based on its undecoded contents.
1043
1044Each function in this list should be written to look at text in the
1045current buffer, but should not modify it in any way. The buffer will
1046contain undecoded text of parts of the file. Each function should
1047take one argument, @var{size}, which tells it how many characters to
1048look at, starting from point. If the function succeeds in determining
1049a coding system for the file, it should return that coding system.
1050Otherwise, it should return @code{nil}.
1051
1052If a file has a @samp{coding:} tag, that takes precedence, so these
1053functions won't be called.
1054@end defvar
1055
a9f0a989 1056@defun find-operation-coding-system operation &rest arguments
a9f0a989
RS
1057This function returns the coding system to use (by default) for
1058performing @var{operation} with @var{arguments}. The value has this
1059form:
1060
1061@example
35864124 1062(@var{decoding-system} . @var{encoding-system})
a9f0a989
RS
1063@end example
1064
1065The first element, @var{decoding-system}, is the coding system to use
1066for decoding (in case @var{operation} does decoding), and
1067@var{encoding-system} is the coding system for encoding (in case
1068@var{operation} does encoding).
1069
8241495d 1070The argument @var{operation} should be a symbol, one of
a9f0a989
RS
1071@code{insert-file-contents}, @code{write-region}, @code{call-process},
1072@code{call-process-region}, @code{start-process}, or
8241495d
RS
1073@code{open-network-stream}. These are the names of the Emacs I/O primitives
1074that can do coding system conversion.
a9f0a989
RS
1075
1076The remaining arguments should be the same arguments that might be given
8241495d 1077to that I/O primitive. Depending on the primitive, one of those
a9f0a989
RS
1078arguments is selected as the @dfn{target}. For example, if
1079@var{operation} does file I/O, whichever argument specifies the file
1080name is the target. For subprocess primitives, the process name is the
1081target. For @code{open-network-stream}, the target is the service name
1082or port number.
1083
1084This function looks up the target in @code{file-coding-system-alist},
1085@code{process-coding-system-alist}, or
1086@code{network-coding-system-alist}, depending on @var{operation}.
a9f0a989
RS
1087@end defun
1088
cc6d0d2c 1089@node Specifying Coding Systems
a9f0a989 1090@subsection Specifying a Coding System for One Operation
cc6d0d2c
RS
1091
1092 You can specify the coding system for a specific operation by binding
1093the variables @code{coding-system-for-read} and/or
1094@code{coding-system-for-write}.
1095
cc6d0d2c
RS
1096@defvar coding-system-for-read
1097If this variable is non-@code{nil}, it specifies the coding system to
1098use for reading a file, or for input from a synchronous subprocess.
1099
1100It also applies to any asynchronous subprocess or network stream, but in
1101a different way: the value of @code{coding-system-for-read} when you
1102start the subprocess or open the network stream specifies the input
1103decoding method for that subprocess or network stream. It remains in
1104use for that subprocess or network stream unless and until overridden.
1105
1106The right way to use this variable is to bind it with @code{let} for a
1107specific I/O operation. Its global value is normally @code{nil}, and
1108you should not globally set it to any other value. Here is an example
1109of the right way to use the variable:
1110
1111@example
1112;; @r{Read the file with no character code conversion.}
ad800164 1113;; @r{Assume @acronym{crlf} represents end-of-line.}
a3d3f60d 1114(let ((coding-system-for-read 'emacs-mule-dos))
cc6d0d2c
RS
1115 (insert-file-contents filename))
1116@end example
1117
1118When its value is non-@code{nil}, @code{coding-system-for-read} takes
a9f0a989 1119precedence over all other methods of specifying a coding system to use for
cc6d0d2c
RS
1120input, including @code{file-coding-system-alist},
1121@code{process-coding-system-alist} and
1122@code{network-coding-system-alist}.
1123@end defvar
1124
cc6d0d2c
RS
1125@defvar coding-system-for-write
1126This works much like @code{coding-system-for-read}, except that it
1127applies to output rather than input. It affects writing to files,
b6954afd 1128as well as sending output to subprocesses and net connections.
cc6d0d2c
RS
1129
1130When a single operation does both input and output, as do
1131@code{call-process-region} and @code{start-process}, both
1132@code{coding-system-for-read} and @code{coding-system-for-write}
1133affect it.
1134@end defvar
1135
cc6d0d2c
RS
1136@defvar inhibit-eol-conversion
1137When this variable is non-@code{nil}, no end-of-line conversion is done,
1138no matter which coding system is specified. This applies to all the
1139Emacs I/O and subprocess primitives, and to the explicit encoding and
1140decoding functions (@pxref{Explicit Encoding}).
1141@end defvar
1142
cc6d0d2c 1143@node Explicit Encoding
a9f0a989 1144@subsection Explicit Encoding and Decoding
cc6d0d2c
RS
1145@cindex encoding text
1146@cindex decoding text
1147
1148 All the operations that transfer text in and out of Emacs have the
1149ability to use a coding system to encode or decode the text.
1150You can also explicitly encode and decode text using the functions
1151in this section.
1152
cc6d0d2c 1153 The result of encoding, and the input to decoding, are not ordinary
0ace421a
GM
1154text. They logically consist of a series of byte values; that is, a
1155series of characters whose codes are in the range 0 through 255. In a
1156multibyte buffer or string, character codes 128 through 159 are
1157represented by multibyte sequences, but this is invisible to Lisp
1158programs.
1159
1160 The usual way to read a file into a buffer as a sequence of bytes, so
1161you can decode the contents explicitly, is with
1162@code{insert-file-contents-literally} (@pxref{Reading from Files});
1163alternatively, specify a non-@code{nil} @var{rawfile} argument when
1164visiting a file with @code{find-file-noselect}. These methods result in
1165a unibyte buffer.
1166
1167 The usual way to use the byte sequence that results from explicitly
1168encoding text is to copy it to a file or process---for example, to write
1169it with @code{write-region} (@pxref{Writing to Files}), and suppress
1170encoding by binding @code{coding-system-for-write} to
1171@code{no-conversion}.
b6954afd
RS
1172
1173 Here are the functions to perform explicit encoding or decoding. The
0ace421a
GM
1174decoding functions produce sequences of bytes; the encoding functions
1175are meant to operate on sequences of bytes. All of these functions
1176discard text properties.
1911e6e5 1177
35864124
LT
1178@deffn Command encode-coding-region start end coding-system
1179This command encodes the text from @var{start} to @var{end} according
969fe9b5 1180to coding system @var{coding-system}. The encoded text replaces the
0ace421a
GM
1181original text in the buffer. The result of encoding is logically a
1182sequence of bytes, but the buffer remains multibyte if it was multibyte
1183before.
cc6d0d2c 1184
35864124
LT
1185This command returns the length of the encoded text.
1186@end deffn
1187
1188@defun encode-coding-string string coding-system &optional nocopy
cc6d0d2c
RS
1189This function encodes the text in @var{string} according to coding
1190system @var{coding-system}. It returns a new string containing the
35864124
LT
1191encoded text, except when @var{nocopy} is non-@code{nil}, in which
1192case the function may return @var{string} itself if the encoding
1193operation is trivial. The result of encoding is a unibyte string.
cc6d0d2c
RS
1194@end defun
1195
35864124
LT
1196@deffn Command decode-coding-region start end coding-system
1197This command decodes the text from @var{start} to @var{end} according
cc6d0d2c
RS
1198to coding system @var{coding-system}. The decoded text replaces the
1199original text in the buffer. To make explicit decoding useful, the text
0ace421a
GM
1200before decoding ought to be a sequence of byte values, but both
1201multibyte and unibyte buffers are acceptable.
cc6d0d2c 1202
35864124
LT
1203This command returns the length of the decoded text.
1204@end deffn
1205
1206@defun decode-coding-string string coding-system &optional nocopy
cc6d0d2c
RS
1207This function decodes the text in @var{string} according to coding
1208system @var{coding-system}. It returns a new string containing the
35864124
LT
1209decoded text, except when @var{nocopy} is non-@code{nil}, in which
1210case the function may return @var{string} itself if the decoding
1211operation is trivial. To make explicit decoding useful, the contents
1212of @var{string} ought to be a sequence of byte values, but a multibyte
0ace421a 1213string is acceptable.
cc6d0d2c 1214@end defun
969fe9b5 1215
131bf943
RS
1216@defun decode-coding-inserted-region from to filename &optional visit beg end replace
1217This function decodes the text from @var{from} to @var{to} as if
1218it were being read from file @var{filename} using @code{insert-file-contents}
1219using the rest of the arguments provided.
1220
1221The normal way to use this function is after reading text from a file
1222without decoding, if you decide you would rather have decoded it.
1223Instead of deleting the text and reading it again, this time with
1224decoding, you can call this function.
1225@end defun
1226
a9f0a989
RS
1227@node Terminal I/O Encoding
1228@subsection Terminal I/O Encoding
1229
1230 Emacs can decode keyboard input using a coding system, and encode
2eb4136f
RS
1231terminal output. This is useful for terminals that transmit or display
1232text using a particular encoding such as Latin-1. Emacs does not set
1233@code{last-coding-system-used} for encoding or decoding for the
1234terminal.
a9f0a989
RS
1235
1236@defun keyboard-coding-system
a9f0a989
RS
1237This function returns the coding system that is in use for decoding
1238keyboard input---or @code{nil} if no coding system is to be used.
1239@end defun
1240
35864124
LT
1241@deffn Command set-keyboard-coding-system coding-system
1242This command specifies @var{coding-system} as the coding system to
a9f0a989
RS
1243use for decoding keyboard input. If @var{coding-system} is @code{nil},
1244that means do not decode keyboard input.
35864124 1245@end deffn
a9f0a989
RS
1246
1247@defun terminal-coding-system
a9f0a989
RS
1248This function returns the coding system that is in use for encoding
1249terminal output---or @code{nil} for no encoding.
1250@end defun
1251
35864124
LT
1252@deffn Command set-terminal-coding-system coding-system
1253This command specifies @var{coding-system} as the coding system to use
a9f0a989
RS
1254for encoding terminal output. If @var{coding-system} is @code{nil},
1255that means do not encode terminal output.
35864124 1256@end deffn
a9f0a989 1257
969fe9b5 1258@node MS-DOS File Types
a9f0a989 1259@subsection MS-DOS File Types
969fe9b5
RS
1260@cindex DOS file types
1261@cindex MS-DOS file types
1262@cindex Windows file types
1263@cindex file types on MS-DOS and Windows
1264@cindex text files and binary files
1265@cindex binary files and text files
1266
8241495d
RS
1267 On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
1268end-of-line conversion for a file by looking at the file's name. This
0ace421a 1269feature classifies files as @dfn{text files} and @dfn{binary files}. By
8241495d
RS
1270``binary file'' we mean a file of literal byte values that are not
1271necessarily meant to be characters; Emacs does no end-of-line conversion
1272and no character code conversion for them. On the other hand, the bytes
1273in a text file are intended to represent characters; when you create a
1274new file whose name implies that it is a text file, Emacs uses DOS
1275end-of-line conversion.
969fe9b5
RS
1276
1277@defvar buffer-file-type
1278This variable, automatically buffer-local in each buffer, records the
a9f0a989
RS
1279file type of the buffer's visited file. When a buffer does not specify
1280a coding system with @code{buffer-file-coding-system}, this variable is
1281used to determine which coding system to use when writing the contents
1282of the buffer. It should be @code{nil} for text, @code{t} for binary.
1283If it is @code{t}, the coding system is @code{no-conversion}.
1284Otherwise, @code{undecided-dos} is used.
1285
1286Normally this variable is set by visiting a file; it is set to
1287@code{nil} if the file was visited without any actual conversion.
969fe9b5
RS
1288@end defvar
1289
1290@defopt file-name-buffer-file-type-alist
1291This variable holds an alist for recognizing text and binary files.
1292Each element has the form (@var{regexp} . @var{type}), where
1293@var{regexp} is matched against the file name, and @var{type} may be
1294@code{nil} for text, @code{t} for binary, or a function to call to
1295compute which. If it is a function, then it is called with a single
1296argument (the file name) and should return @code{t} or @code{nil}.
1297
8241495d 1298When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
969fe9b5
RS
1299which coding system to use when reading a file. For a text file,
1300@code{undecided-dos} is used. For a binary file, @code{no-conversion}
1301is used.
1302
1303If no element in this alist matches a given file name, then
1304@code{default-buffer-file-type} says how to treat the file.
1305@end defopt
1306
1307@defopt default-buffer-file-type
1308This variable says how to handle files for which
1309@code{file-name-buffer-file-type-alist} says nothing about the type.
1310
1311If this variable is non-@code{nil}, then these files are treated as
a9f0a989
RS
1312binary: the coding system @code{no-conversion} is used. Otherwise,
1313nothing special is done for them---the coding system is deduced solely
1314from the file contents, in the usual Emacs fashion.
969fe9b5
RS
1315@end defopt
1316
a9f0a989
RS
1317@node Input Methods
1318@section Input Methods
1319@cindex input methods
1320
ad800164 1321 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
a9f0a989 1322characters from the keyboard. Unlike coding systems, which translate
ad800164 1323non-@acronym{ASCII} characters to and from encodings meant to be read by
a9f0a989
RS
1324programs, input methods provide human-friendly commands. (@xref{Input
1325Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1326use input methods to enter text.) How to define input methods is not
1327yet documented in this manual, but here we describe how to use them.
1328
1329 Each input method has a name, which is currently a string;
1330in the future, symbols may also be usable as input method names.
1331
a9f0a989
RS
1332@defvar current-input-method
1333This variable holds the name of the input method now active in the
1334current buffer. (It automatically becomes local in each buffer when set
1335in any fashion.) It is @code{nil} if no input method is active in the
1336buffer now.
969fe9b5
RS
1337@end defvar
1338
35864124 1339@defopt default-input-method
a9f0a989
RS
1340This variable holds the default input method for commands that choose an
1341input method. Unlike @code{current-input-method}, this variable is
1342normally global.
35864124 1343@end defopt
a9f0a989 1344
35864124
LT
1345@deffn Command set-input-method input-method
1346This command activates input method @var{input-method} for the current
a9f0a989 1347buffer. It also sets @code{default-input-method} to @var{input-method}.
35864124 1348If @var{input-method} is @code{nil}, this command deactivates any input
a9f0a989 1349method for the current buffer.
35864124 1350@end deffn
a9f0a989 1351
a9f0a989
RS
1352@defun read-input-method-name prompt &optional default inhibit-null
1353This function reads an input method name with the minibuffer, prompting
1354with @var{prompt}. If @var{default} is non-@code{nil}, that is returned
1355by default, if the user enters empty input. However, if
1356@var{inhibit-null} is non-@code{nil}, empty input signals an error.
1357
1358The returned value is a string.
1359@end defun
1360
a9f0a989
RS
1361@defvar input-method-alist
1362This variable defines all the supported input methods.
1363Each element defines one input method, and should have the form:
1364
1365@example
1911e6e5
RS
1366(@var{input-method} @var{language-env} @var{activate-func}
1367 @var{title} @var{description} @var{args}...)
a9f0a989
RS
1368@end example
1369
1911e6e5
RS
1370Here @var{input-method} is the input method name, a string;
1371@var{language-env} is another string, the name of the language
1372environment this input method is recommended for. (That serves only for
1373documentation purposes.)
a9f0a989 1374
a9f0a989
RS
1375@var{activate-func} is a function to call to activate this method. The
1376@var{args}, if any, are passed as arguments to @var{activate-func}. All
1377told, the arguments to @var{activate-func} are @var{input-method} and
1378the @var{args}.
0ace421a
GM
1379
1380@var{title} is a string to display in the mode line while this method is
1381active. @var{description} is a string describing this method and what
1382it is good for.
1911e6e5 1383@end defvar
a9f0a989 1384
2eb4136f 1385 The fundamental interface to input methods is through the
35864124
LT
1386variable @code{input-method-function}. @xref{Reading One Event},
1387and @ref{Invoking the Input Method}.
2468d0c0
DL
1388
1389@node Locales
1390@section Locales
1391@cindex locale
1392
1393 POSIX defines a concept of ``locales'' which control which language
1394to use in language-related features. These Emacs variables control
1395how Emacs interacts with these features.
1396
1397@defvar locale-coding-system
1398@tindex locale-coding-system
a007679c 1399@cindex keyboard input decoding on X
2468d0c0 1400This variable specifies the coding system to use for decoding system
a007679c
EZ
1401error messages and---on X Window system only---keyboard input, for
1402encoding the format argument to @code{format-time-string}, and for
1403decoding the return value of @code{format-time-string}.
2468d0c0
DL
1404@end defvar
1405
1406@defvar system-messages-locale
1407@tindex system-messages-locale
1408This variable specifies the locale to use for generating system error
1409messages. Changing the locale can cause messages to come out in a
9c17f494 1410different language or in a different orthography. If the variable is
2468d0c0
DL
1411@code{nil}, the locale is specified by environment variables in the
1412usual POSIX fashion.
1413@end defvar
1414
1415@defvar system-time-locale
1416@tindex system-time-locale
1417This variable specifies the locale to use for formatting time values.
1418Changing the locale can cause messages to appear according to the
1419conventions of a different language. If the variable is @code{nil}, the
1420locale is specified by environment variables in the usual POSIX fashion.
1421@end defvar
0ace421a 1422
131bf943
RS
1423@defun locale-info item
1424This function returns locale data @var{item} for the current POSIX
1425locale, if available. @var{item} should be one of these symbols:
1426
1427@table @code
1428@item codeset
1429Return the character set as a string (locale item @code{CODESET}).
1430
1431@item days
1432Return a 7-element vector of day names (locale items
1433@code{DAY_1} through @code{DAY_7});
1434
1435@item months
1436Return a 12-element vector of month names (locale items @code{MON_1}
1437through @code{MON_12}).
1438
1439@item paper
1440Return a list @code{(@var{width} @var{height})} for the default paper
35864124 1441size measured in millimeters (locale items @code{PAPER_WIDTH} and
131bf943
RS
1442@code{PAPER_HEIGHT}).
1443@end table
1444
1445If the system can't provide the requested information, or if
1446@var{item} is not one of those symbols, the value is @code{nil}. All
1447strings in the return value are decoded using
35864124 1448@code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual},
131bf943
RS
1449for more information about locales and locale items.
1450@end defun
ab5796a9
MB
1451
1452@ignore
1453 arch-tag: be705bf8-941b-4c35-84fc-ad7d20ddb7cb
1454@end ignore