*** empty log message ***
[bpt/emacs.git] / lispref / nonascii.texi
1 @c -*-texinfo-*-
2 @c This is part of the GNU Emacs Lisp Reference Manual.
3 @c Copyright (C) 1998 Free Software Foundation, Inc.
4 @c See the file elisp.texi for copying conditions.
5 @setfilename ../info/characters
6 @node Non-ASCII Characters, Searching and Matching, Text, Top
7 @chapter Non-ASCII Characters
8 @cindex multibyte characters
9 @cindex non-ASCII characters
10
11 This chapter covers the special issues relating to non-@sc{ascii}
12 characters and how they are stored in strings and buffers.
13
14 @menu
15 * Text Representations::
16 * Converting Representations::
17 * Selecting a Representation::
18 * Character Codes::
19 * Character Sets::
20 * Chars and Bytes::
21 * Splitting Characters::
22 * Scanning Charsets::
23 * Translation of Characters::
24 * Coding Systems::
25 * Input Methods::
26 @end menu
27
28 @node Text Representations
29 @section Text Representations
30 @cindex text representations
31
32 Emacs has two @dfn{text representations}---two ways to represent text
33 in a string or buffer. These are called @dfn{unibyte} and
34 @dfn{multibyte}. Each string, and each buffer, uses one of these two
35 representations. For most purposes, you can ignore the issue of
36 representations, because Emacs converts text between them as
37 appropriate. Occasionally in Lisp programming you will need to pay
38 attention to the difference.
39
40 @cindex unibyte text
41 In unibyte representation, each character occupies one byte and
42 therefore the possible character codes range from 0 to 255. Codes 0
43 through 127 are @sc{ascii} characters; the codes from 128 through 255
44 are used for one non-@sc{ascii} character set (you can choose which
45 character set by setting the variable @code{nonascii-insert-offset}).
46
47 @cindex leading code
48 @cindex multibyte text
49 @cindex trailing codes
50 In multibyte representation, a character may occupy more than one
51 byte, and as a result, the full range of Emacs character codes can be
52 stored. The first byte of a multibyte character is always in the range
53 128 through 159 (octal 0200 through 0237). These values are called
54 @dfn{leading codes}. The second and subsequent bytes of a multibyte
55 character are always in the range 160 through 255 (octal 0240 through
56 0377); these values are @dfn{trailing codes}.
57
58 Some sequences of bytes do not form meaningful multibyte characters:
59 for example, a single isolated byte in the range 128 through 255 is
60 never meaningful. Such byte sequences are not entirely valid, and never
61 appear in proper multibyte text (since that consists of a sequence of
62 @emph{characters}); but they can appear as part of ``raw bytes''
63 (@pxref{Explicit Encoding}).
64
65 In a buffer, the buffer-local value of the variable
66 @code{enable-multibyte-characters} specifies the representation used.
67 The representation for a string is determined and recorded in the string
68 when the string is constructed.
69
70 @defvar enable-multibyte-characters
71 @tindex enable-multibyte-characters
72 This variable specifies the current buffer's text representation.
73 If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
74 it contains unibyte text.
75
76 You cannot set this variable directly; instead, use the function
77 @code{set-buffer-multibyte} to change a buffer's representation.
78 @end defvar
79
80 @defvar default-enable-multibyte-characters
81 @tindex default-enable-multibyte-characters
82 This variable's value is entirely equivalent to @code{(default-value
83 'enable-multibyte-characters)}, and setting this variable changes that
84 default value. Setting the local binding of
85 @code{enable-multibyte-characters} in a specific buffer is not allowed,
86 but changing the default value is supported, and it is a reasonable
87 thing to do, because it has no effect on existing buffers.
88
89 The @samp{--unibyte} command line option does its job by setting the
90 default value to @code{nil} early in startup.
91 @end defvar
92
93 @defun position-bytes position
94 @tindex position-bytes
95 Return the byte-position corresponding to buffer position @var{position}
96 in the current buffer.
97 @end defun
98
99 @defun byte-to-position byte-position
100 @tindex byte-to-position
101 Return the buffer position corresponding to byte-position
102 @var{byte-position} in the current buffer.
103 @end defun
104
105 @defun multibyte-string-p string
106 @tindex multibyte-string-p
107 Return @code{t} if @var{string} is a multibyte string.
108 @end defun
109
110 @node Converting Representations
111 @section Converting Text Representations
112
113 Emacs can convert unibyte text to multibyte; it can also convert
114 multibyte text to unibyte, though this conversion loses information. In
115 general these conversions happen when inserting text into a buffer, or
116 when putting text from several strings together in one string. You can
117 also explicitly convert a string's contents to either representation.
118
119 Emacs chooses the representation for a string based on the text that
120 it is constructed from. The general rule is to convert unibyte text to
121 multibyte text when combining it with other multibyte text, because the
122 multibyte representation is more general and can hold whatever
123 characters the unibyte text has.
124
125 When inserting text into a buffer, Emacs converts the text to the
126 buffer's representation, as specified by
127 @code{enable-multibyte-characters} in that buffer. In particular, when
128 you insert multibyte text into a unibyte buffer, Emacs converts the text
129 to unibyte, even though this conversion cannot in general preserve all
130 the characters that might be in the multibyte text. The other natural
131 alternative, to convert the buffer contents to multibyte, is not
132 acceptable because the buffer's representation is a choice made by the
133 user that cannot be overridden automatically.
134
135 Converting unibyte text to multibyte text leaves @sc{ascii} characters
136 unchanged, and likewise 128 through 159. It converts the non-@sc{ascii}
137 codes 160 through 255 by adding the value @code{nonascii-insert-offset}
138 to each character code. By setting this variable, you specify which
139 character set the unibyte characters correspond to (@pxref{Character
140 Sets}). For example, if @code{nonascii-insert-offset} is 2048, which is
141 @code{(- (make-char 'latin-iso8859-1) 128)}, then the unibyte
142 non-@sc{ascii} characters correspond to Latin 1. If it is 2688, which
143 is @code{(- (make-char 'greek-iso8859-7) 128)}, then they correspond to
144 Greek letters.
145
146 Converting multibyte text to unibyte is simpler: it discards all but
147 the low 8 bits of each character code. If @code{nonascii-insert-offset}
148 has a reasonable value, corresponding to the beginning of some character
149 set, this conversion is the inverse of the other: converting unibyte
150 text to multibyte and back to unibyte reproduces the original unibyte
151 text.
152
153 @defvar nonascii-insert-offset
154 @tindex nonascii-insert-offset
155 This variable specifies the amount to add to a non-@sc{ascii} character
156 when converting unibyte text to multibyte. It also applies when
157 @code{self-insert-command} inserts a character in the unibyte
158 non-@sc{ascii} range, 128 through 255. However, the function
159 @code{insert-char} does not perform this conversion.
160
161 The right value to use to select character set @var{cs} is @code{(-
162 (make-char @var{cs}) 128)}. If the value of
163 @code{nonascii-insert-offset} is zero, then conversion actually uses the
164 value for the Latin 1 character set, rather than zero.
165 @end defvar
166
167 @defvar nonascii-translation-table
168 @tindex nonascii-translation-table
169 This variable provides a more general alternative to
170 @code{nonascii-insert-offset}. You can use it to specify independently
171 how to translate each code in the range of 128 through 255 into a
172 multibyte character. The value should be a vector, or @code{nil}.
173 If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
174 @end defvar
175
176 @defun string-make-unibyte string
177 @tindex string-make-unibyte
178 This function converts the text of @var{string} to unibyte
179 representation, if it isn't already, and returns the result. If
180 @var{string} is a unibyte string, it is returned unchanged.
181 @end defun
182
183 @defun string-make-multibyte string
184 @tindex string-make-multibyte
185 This function converts the text of @var{string} to multibyte
186 representation, if it isn't already, and returns the result. If
187 @var{string} is a multibyte string, it is returned unchanged.
188 @end defun
189
190 @node Selecting a Representation
191 @section Selecting a Representation
192
193 Sometimes it is useful to examine an existing buffer or string as
194 multibyte when it was unibyte, or vice versa.
195
196 @defun set-buffer-multibyte multibyte
197 @tindex set-buffer-multibyte
198 Set the representation type of the current buffer. If @var{multibyte}
199 is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
200 is @code{nil}, the buffer becomes unibyte.
201
202 This function leaves the buffer contents unchanged when viewed as a
203 sequence of bytes. As a consequence, it can change the contents viewed
204 as characters; a sequence of two bytes which is treated as one character
205 in multibyte representation will count as two characters in unibyte
206 representation.
207
208 This function sets @code{enable-multibyte-characters} to record which
209 representation is in use. It also adjusts various data in the buffer
210 (including overlays, text properties and markers) so that they cover the
211 same text as they did before.
212
213 You cannot use @code{set-buffer-multibyte} on an indirect buffer,
214 because indirect buffers always inherit the representation of the
215 base buffer.
216 @end defun
217
218 @defun string-as-unibyte string
219 @tindex string-as-unibyte
220 This function returns a string with the same bytes as @var{string} but
221 treating each byte as a character. This means that the value may have
222 more characters than @var{string} has.
223
224 If @var{string} is already a unibyte string, then the value is
225 @var{string} itself.
226 @end defun
227
228 @defun string-as-multibyte string
229 @tindex string-as-multibyte
230 This function returns a string with the same bytes as @var{string} but
231 treating each multibyte sequence as one character. This means that the
232 value may have fewer characters than @var{string} has.
233
234 If @var{string} is already a multibyte string, then the value is
235 @var{string} itself.
236 @end defun
237
238 @node Character Codes
239 @section Character Codes
240 @cindex character codes
241
242 The unibyte and multibyte text representations use different character
243 codes. The valid character codes for unibyte representation range from
244 0 to 255---the values that can fit in one byte. The valid character
245 codes for multibyte representation range from 0 to 524287, but not all
246 values in that range are valid. In particular, the values 128 through
247 255 are not legitimate in multibyte text (though they can occur in ``raw
248 bytes''; @pxref{Explicit Encoding}). Only the @sc{ascii} codes 0
249 through 127 are fully legitimate in both representations.
250
251 @defun char-valid-p charcode
252 This returns @code{t} if @var{charcode} is valid for either one of the two
253 text representations.
254
255 @example
256 (char-valid-p 65)
257 @result{} t
258 (char-valid-p 256)
259 @result{} nil
260 (char-valid-p 2248)
261 @result{} t
262 @end example
263 @end defun
264
265 @node Character Sets
266 @section Character Sets
267 @cindex character sets
268
269 Emacs classifies characters into various @dfn{character sets}, each of
270 which has a name which is a symbol. Each character belongs to one and
271 only one character set.
272
273 In general, there is one character set for each distinct script. For
274 example, @code{latin-iso8859-1} is one character set,
275 @code{greek-iso8859-7} is another, and @code{ascii} is another. An
276 Emacs character set can hold at most 9025 characters; therefore, in some
277 cases, characters that would logically be grouped together are split
278 into several character sets. For example, one set of Chinese
279 characters, generally known as Big 5, is divided into two Emacs
280 character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
281
282 @defun charsetp object
283 @tindex charsetp
284 Returns @code{t} if @var{object} is a symbol that names a character set,
285 @code{nil} otherwise.
286 @end defun
287
288 @defun charset-list
289 @tindex charset-list
290 This function returns a list of all defined character set names.
291 @end defun
292
293 @defun char-charset character
294 @tindex char-charset
295 This function returns the name of the character set that @var{character}
296 belongs to.
297 @end defun
298
299 @defun charset-plist charset
300 @tindex charset-plist
301 This function returns the charset property list of the character set
302 @var{charset}. Although @var{charset} is a symbol, this is not the same
303 as the property list of that symbol. Charset properties are used for
304 special purposes within Emacs; for example, @code{x-charset-registry}
305 helps determine which fonts to use (@pxref{Font Selection}).
306 @end defun
307
308 @node Chars and Bytes
309 @section Characters and Bytes
310 @cindex bytes and characters
311
312 @cindex introduction sequence
313 @cindex dimension (of character set)
314 In multibyte representation, each character occupies one or more
315 bytes. Each character set has an @dfn{introduction sequence}, which is
316 normally one or two bytes long. (Exception: the @sc{ascii} character
317 set has a zero-length introduction sequence.) The introduction sequence
318 is the beginning of the byte sequence for any character in the character
319 set. The rest of the character's bytes distinguish it from the other
320 characters in the same character set. Depending on the character set,
321 there are either one or two distinguishing bytes; the number of such
322 bytes is called the @dfn{dimension} of the character set.
323
324 @defun charset-dimension charset
325 @tindex charset-dimension
326 This function returns the dimension of @var{charset}; at present, the
327 dimension is always 1 or 2.
328 @end defun
329
330 @defun charset-bytes charset
331 @tindex charset-bytes
332 This function returns the number of bytes used to represent a character
333 in character set @var{charset}.
334 @end defun
335
336 This is the simplest way to determine the byte length of a character
337 set's introduction sequence:
338
339 @example
340 (- (charset-bytes @var{charset})
341 (charset-dimension @var{charset}))
342 @end example
343
344 @node Splitting Characters
345 @section Splitting Characters
346
347 The functions in this section convert between characters and the byte
348 values used to represent them. For most purposes, there is no need to
349 be concerned with the sequence of bytes used to represent a character,
350 because Emacs translates automatically when necessary.
351
352 @defun split-char character
353 @tindex split-char
354 Return a list containing the name of the character set of
355 @var{character}, followed by one or two byte values (integers) which
356 identify @var{character} within that character set. The number of byte
357 values is the character set's dimension.
358
359 @example
360 (split-char 2248)
361 @result{} (latin-iso8859-1 72)
362 (split-char 65)
363 @result{} (ascii 65)
364 @end example
365
366 Unibyte non-@sc{ascii} characters are considered as part of
367 the @code{ascii} character set:
368
369 @example
370 (split-char 192)
371 @result{} (ascii 192)
372 @end example
373 @end defun
374
375 @defun make-char charset &rest byte-values
376 @tindex make-char
377 This function returns the character in character set @var{charset}
378 identified by @var{byte-values}. This is roughly the inverse of
379 @code{split-char}. Normally, you should specify either one or two
380 @var{byte-values}, according to the dimension of @var{charset}. For
381 example,
382
383 @example
384 (make-char 'latin-iso8859-1 72)
385 @result{} 2248
386 @end example
387 @end defun
388
389 @cindex generic characters
390 If you call @code{make-char} with no @var{byte-values}, the result is
391 a @dfn{generic character} which stands for @var{charset}. A generic
392 character is an integer, but it is @emph{not} valid for insertion in the
393 buffer as a character. It can be used in @code{char-table-range} to
394 refer to the whole character set (@pxref{Char-Tables}).
395 @code{char-valid-p} returns @code{nil} for generic characters.
396 For example:
397
398 @example
399 (make-char 'latin-iso8859-1)
400 @result{} 2176
401 (char-valid-p 2176)
402 @result{} nil
403 (split-char 2176)
404 @result{} (latin-iso8859-1 0)
405 @end example
406
407 @node Scanning Charsets
408 @section Scanning for Character Sets
409
410 Sometimes it is useful to find out which character sets appear in a
411 part of a buffer or a string. One use for this is in determining which
412 coding systems (@pxref{Coding Systems}) are capable of representing all
413 of the text in question.
414
415 @defun find-charset-region beg end &optional translation
416 @tindex find-charset-region
417 This function returns a list of the character sets that appear in the
418 current buffer between positions @var{beg} and @var{end}.
419
420 The optional argument @var{translation} specifies a translation table to
421 be used in scanning the text (@pxref{Translation of Characters}). If it
422 is non-@code{nil}, then each character in the region is translated
423 through this table, and the value returned describes the translated
424 characters instead of the characters actually in the buffer.
425
426 In two peculiar cases, the value includes the symbol @code{unknown}:
427
428 @itemize @bullet
429 @item
430 When a unibyte buffer contains non-@sc{ascii} characters.
431
432 @item
433 When a multibyte buffer contains invalid byte-sequences (raw bytes).
434 @xref{Explicit Encoding}.
435 @end itemize
436 @end defun
437
438 @defun find-charset-string string &optional translation
439 @tindex find-charset-string
440 This function returns a list of the character sets that appear in the
441 string @var{string}. It is just like @code{find-charset-region}, except
442 that it applies to the contents of @var{string} instead of part of the
443 current buffer.
444 @end defun
445
446 @node Translation of Characters
447 @section Translation of Characters
448 @cindex character translation tables
449 @cindex translation tables
450
451 A @dfn{translation table} specifies a mapping of characters
452 into characters. These tables are used in encoding and decoding, and
453 for other purposes. Some coding systems specify their own particular
454 translation tables; there are also default translation tables which
455 apply to all other coding systems.
456
457 @defun make-translation-table &rest translations
458 This function returns a translation table based on the argument
459 @var{translations}. Each element of
460 @var{translations} should be a list of the form @code{(@var{from}
461 . @var{to})}; this says to translate the character @var{from} into
462 @var{to}.
463
464 You can also map one whole character set into another character set with
465 the same dimension. To do this, you specify a generic character (which
466 designates a character set) for @var{from} (@pxref{Splitting Characters}).
467 In this case, @var{to} should also be a generic character, for another
468 character set of the same dimension. Then the translation table
469 translates each character of @var{from}'s character set into the
470 corresponding character of @var{to}'s character set.
471 @end defun
472
473 In decoding, the translation table's translations are applied to the
474 characters that result from ordinary decoding. If a coding system has
475 property @code{character-translation-table-for-decode}, that specifies
476 the translation table to use. Otherwise, if
477 @code{standard-translation-table-for-decode} is non-@code{nil}, decoding
478 uses that table.
479
480 In encoding, the translation table's translations are applied to the
481 characters in the buffer, and the result of translation is actually
482 encoded. If a coding system has property
483 @code{character-translation-table-for-encode}, that specifies the
484 translation table to use. Otherwise the variable
485 @code{standard-translation-table-for-encode} specifies the translation
486 table.
487
488 @defvar standard-translation-table-for-decode
489 This is the default translation table for decoding, for
490 coding systems that don't specify any other translation table.
491 @end defvar
492
493 @defvar standard-translation-table-for-encode
494 This is the default translation table for encoding, for
495 coding systems that don't specify any other translation table.
496 @end defvar
497
498 @node Coding Systems
499 @section Coding Systems
500
501 @cindex coding system
502 When Emacs reads or writes a file, and when Emacs sends text to a
503 subprocess or receives text from a subprocess, it normally performs
504 character code conversion and end-of-line conversion as specified
505 by a particular @dfn{coding system}.
506
507 How to define a coding system is an arcane matter, and is not
508 documented here.
509
510 @menu
511 * Coding System Basics::
512 * Encoding and I/O::
513 * Lisp and Coding Systems::
514 * User-Chosen Coding Systems::
515 * Default Coding Systems::
516 * Specifying Coding Systems::
517 * Explicit Encoding::
518 * Terminal I/O Encoding::
519 * MS-DOS File Types::
520 @end menu
521
522 @node Coding System Basics
523 @subsection Basic Concepts of Coding Systems
524
525 @cindex character code conversion
526 @dfn{Character code conversion} involves conversion between the encoding
527 used inside Emacs and some other encoding. Emacs supports many
528 different encodings, in that it can convert to and from them. For
529 example, it can convert text to or from encodings such as Latin 1, Latin
530 2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some
531 cases, Emacs supports several alternative encodings for the same
532 characters; for example, there are three coding systems for the Cyrillic
533 (Russian) alphabet: ISO, Alternativnyj, and KOI8.
534
535 Most coding systems specify a particular character code for
536 conversion, but some of them leave the choice unspecified---to be chosen
537 heuristically for each file, based on the data.
538
539 @cindex end of line conversion
540 @dfn{End of line conversion} handles three different conventions used
541 on various systems for representing end of line in files. The Unix
542 convention is to use the linefeed character (also called newline). The
543 DOS convention is to use a carriage-return and a linefeed at the end of
544 a line. The Mac convention is to use just carriage-return.
545
546 @cindex base coding system
547 @cindex variant coding system
548 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
549 conversion unspecified, to be chosen based on the data. @dfn{Variant
550 coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
551 @code{latin-1-mac} specify the end-of-line conversion explicitly as
552 well. Most base coding systems have three corresponding variants whose
553 names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
554
555 The coding system @code{raw-text} is special in that it prevents
556 character code conversion, and causes the buffer visited with that
557 coding system to be a unibyte buffer. It does not specify the
558 end-of-line conversion, allowing that to be determined as usual by the
559 data, and has the usual three variants which specify the end-of-line
560 conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}:
561 it specifies no conversion of either character codes or end-of-line.
562
563 The coding system @code{emacs-mule} specifies that the data is
564 represented in the internal Emacs encoding. This is like
565 @code{raw-text} in that no code conversion happens, but different in
566 that the result is multibyte data.
567
568 @defun coding-system-get coding-system property
569 @tindex coding-system-get
570 This function returns the specified property of the coding system
571 @var{coding-system}. Most coding system properties exist for internal
572 purposes, but one that you might find useful is @code{mime-charset}.
573 That property's value is the name used in MIME for the character coding
574 which this coding system can read and write. Examples:
575
576 @example
577 (coding-system-get 'iso-latin-1 'mime-charset)
578 @result{} iso-8859-1
579 (coding-system-get 'iso-2022-cn 'mime-charset)
580 @result{} iso-2022-cn
581 (coding-system-get 'cyrillic-koi8 'mime-charset)
582 @result{} koi8-r
583 @end example
584
585 The value of the @code{mime-charset} property is also defined
586 as an alias for the coding system.
587 @end defun
588
589 @node Encoding and I/O
590 @subsection Encoding and I/O
591
592 The principal purpose of coding systems is for use in reading and
593 writing files. The function @code{insert-file-contents} uses
594 a coding system for decoding the file data, and @code{write-region}
595 uses one to encode the buffer contents.
596
597 You can specify the coding system to use either explicitly
598 (@pxref{Specifying Coding Systems}), or implicitly using the defaulting
599 mechanism (@pxref{Default Coding Systems}). But these methods may not
600 completely specify what to do. For example, they may choose a coding
601 system such as @code{undefined} which leaves the character code
602 conversion to be determined from the data. In these cases, the I/O
603 operation finishes the job of choosing a coding system. Very often
604 you will want to find out afterwards which coding system was chosen.
605
606 @defvar buffer-file-coding-system
607 @tindex buffer-file-coding-system
608 This variable records the coding system that was used for visiting the
609 current buffer. It is used for saving the buffer, and for writing part
610 of the buffer with @code{write-region}. When those operations ask the
611 user to specify a different coding system,
612 @code{buffer-file-coding-system} is updated to the coding system
613 specified.
614
615 However, @code{buffer-file-coding-system} does not affect sending text
616 to a subprocess.
617 @end defvar
618
619 @defvar save-buffer-coding-system
620 @tindex save-buffer-coding-system
621 This variable specifies the coding system for saving the buffer---but it
622 is not used for @code{write-region}.
623
624 When a command to save the buffer starts out to use
625 @code{save-buffer-coding-system}, and that coding system cannot handle
626 the actual text in the buffer, the command asks the user to choose
627 another coding system. After that happens, the command also updates
628 @code{save-buffer-coding-system} to represent the coding system that the
629 user specified.
630 @end defvar
631
632 @defvar last-coding-system-used
633 @tindex last-coding-system-used
634 I/O operations for files and subprocesses set this variable to the
635 coding system name that was used. The explicit encoding and decoding
636 functions (@pxref{Explicit Encoding}) set it too.
637
638 @strong{Warning:} Since receiving subprocess output sets this variable,
639 it can change whenever Emacs waits; therefore, you should copy the
640 value shortly after the function call that stores the value you are
641 interested in.
642 @end defvar
643
644 The variable @code{selection-coding-system} specifies how to encode
645 selections for the window system. @xref{Window System Selections}.
646
647 @node Lisp and Coding Systems
648 @subsection Coding Systems in Lisp
649
650 Here are the Lisp facilities for working with coding systems:
651
652 @defun coding-system-list &optional base-only
653 @tindex coding-system-list
654 This function returns a list of all coding system names (symbols). If
655 @var{base-only} is non-@code{nil}, the value includes only the
656 base coding systems. Otherwise, it includes variant coding systems as well.
657 @end defun
658
659 @defun coding-system-p object
660 @tindex coding-system-p
661 This function returns @code{t} if @var{object} is a coding system
662 name.
663 @end defun
664
665 @defun check-coding-system coding-system
666 @tindex check-coding-system
667 This function checks the validity of @var{coding-system}.
668 If that is valid, it returns @var{coding-system}.
669 Otherwise it signals an error with condition @code{coding-system-error}.
670 @end defun
671
672 @defun coding-system-change-eol-conversion coding-system eol-type
673 @tindex coding-system-change-eol-conversion
674 This function returns a coding system which is like @var{coding-system}
675 except for its eol conversion, which is specified by @code{eol-type}.
676 @var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
677 @code{nil}. If it is @code{nil}, the returned coding system determines
678 the end-of-line conversion from the data.
679 @end defun
680
681 @defun coding-system-change-text-conversion eol-coding text-coding
682 @tindex coding-system-change-text-conversion
683 This function returns a coding system which uses the end-of-line
684 conversion of @var{eol-coding}, and the text conversion of
685 @var{text-coding}. If @var{text-coding} is @code{nil}, it returns
686 @code{undecided}, or one of its variants according to @var{eol-coding}.
687 @end defun
688
689 @defun find-coding-systems-region from to
690 @tindex find-coding-systems-region
691 This function returns a list of coding systems that could be used to
692 encode a text between @var{from} and @var{to}. All coding systems in
693 the list can safely encode any multibyte characters in that portion of
694 the text.
695
696 If the text contains no multibyte characters, the function returns the
697 list @code{(undecided)}.
698 @end defun
699
700 @defun find-coding-systems-string string
701 @tindex find-coding-systems-string
702 This function returns a list of coding systems that could be used to
703 encode the text of @var{string}. All coding systems in the list can
704 safely encode any multibyte characters in @var{string}. If the text
705 contains no multibyte characters, this returns the list
706 @code{(undecided)}.
707 @end defun
708
709 @defun find-coding-systems-for-charsets charsets
710 @tindex find-coding-systems-for-charsets
711 This function returns a list of coding systems that could be used to
712 encode all the character sets in the list @var{charsets}.
713 @end defun
714
715 @defun detect-coding-region start end &optional highest
716 @tindex detect-coding-region
717 This function chooses a plausible coding system for decoding the text
718 from @var{start} to @var{end}. This text should be ``raw bytes''
719 (@pxref{Explicit Encoding}).
720
721 Normally this function returns a list of coding systems that could
722 handle decoding the text that was scanned. They are listed in order of
723 decreasing priority. But if @var{highest} is non-@code{nil}, then the
724 return value is just one coding system, the one that is highest in
725 priority.
726
727 If the region contains only @sc{ascii} characters, the value
728 is @code{undecided} or @code{(undecided)}.
729 @end defun
730
731 @defun detect-coding-string string highest
732 @tindex detect-coding-string
733 This function is like @code{detect-coding-region} except that it
734 operates on the contents of @var{string} instead of bytes in the buffer.
735 @end defun
736
737 @xref{Process Information}, for how to examine or set the coding
738 systems used for I/O to a subprocess.
739
740 @node User-Chosen Coding Systems
741 @subsection User-Chosen Coding Systems
742
743 @tindex select-safe-coding-system
744 @defun select-safe-coding-system from to &optional preferred-coding-system
745 This function selects a coding system for encoding the text between
746 @var{from} and @var{to}, asking the user to choose if necessary.
747
748 The optional argument @var{preferred-coding-system} specifies a coding
749 system to try first. If that one can handle the text in the specified
750 region, then it is used. If this argument is omitted, the current
751 buffer's value of @code{buffer-file-coding-system} is tried first.
752
753 If the region contains some multibyte characters that the preferred
754 coding system cannot encode, this function asks the user to choose from
755 a list of coding systems which can encode the text, and returns the
756 user's choice.
757
758 One other kludgy feature: if @var{from} is a string, the string is the
759 target text, and @var{to} is ignored.
760 @end defun
761
762 Here are two functions you can use to let the user specify a coding
763 system, with completion. @xref{Completion}.
764
765 @defun read-coding-system prompt &optional default
766 @tindex read-coding-system
767 This function reads a coding system using the minibuffer, prompting with
768 string @var{prompt}, and returns the coding system name as a symbol. If
769 the user enters null input, @var{default} specifies which coding system
770 to return. It should be a symbol or a string.
771 @end defun
772
773 @defun read-non-nil-coding-system prompt
774 @tindex read-non-nil-coding-system
775 This function reads a coding system using the minibuffer, prompting with
776 string @var{prompt}, and returns the coding system name as a symbol. If
777 the user tries to enter null input, it asks the user to try again.
778 @xref{Coding Systems}.
779 @end defun
780
781 @node Default Coding Systems
782 @subsection Default Coding Systems
783
784 This section describes variables that specify the default coding
785 system for certain files or when running certain subprograms, and the
786 function that I/O operations use to access them.
787
788 The idea of these variables is that you set them once and for all to the
789 defaults you want, and then do not change them again. To specify a
790 particular coding system for a particular operation in a Lisp program,
791 don't change these variables; instead, override them using
792 @code{coding-system-for-read} and @code{coding-system-for-write}
793 (@pxref{Specifying Coding Systems}).
794
795 @defvar file-coding-system-alist
796 @tindex file-coding-system-alist
797 This variable is an alist that specifies the coding systems to use for
798 reading and writing particular files. Each element has the form
799 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
800 expression that matches certain file names. The element applies to file
801 names that match @var{pattern}.
802
803 The @sc{cdr} of the element, @var{coding}, should be either a coding
804 system, a cons cell containing two coding systems, or a function name (a
805 symbol with a function definition). If @var{coding} is a coding system,
806 that coding system is used for both reading the file and writing it. If
807 @var{coding} is a cons cell containing two coding systems, its @sc{car}
808 specifies the coding system for decoding, and its @sc{cdr} specifies the
809 coding system for encoding.
810
811 If @var{coding} is a function name, the function must return a coding
812 system or a cons cell containing two coding systems. This value is used
813 as described above.
814 @end defvar
815
816 @defvar process-coding-system-alist
817 @tindex process-coding-system-alist
818 This variable is an alist specifying which coding systems to use for a
819 subprocess, depending on which program is running in the subprocess. It
820 works like @code{file-coding-system-alist}, except that @var{pattern} is
821 matched against the program name used to start the subprocess. The coding
822 system or systems specified in this alist are used to initialize the
823 coding systems used for I/O to the subprocess, but you can specify
824 other coding systems later using @code{set-process-coding-system}.
825 @end defvar
826
827 @strong{Warning:} Coding systems such as @code{undecided}, which
828 determine the coding system from the data, do not work entirely reliably
829 with asynchronous subprocess output. This is because Emacs handles
830 asynchronous subprocess output in batches, as it arrives. If the coding
831 system leaves the character code conversion unspecified, or leaves the
832 end-of-line conversion unspecified, Emacs must try to detect the proper
833 conversion from one batch at a time, and this does not always work.
834
835 Therefore, with an asynchronous subprocess, if at all possible, use a
836 coding system which determines both the character code conversion and
837 the end of line conversion---that is, one like @code{latin-1-unix},
838 rather than @code{undecided} or @code{latin-1}.
839
840 @defvar network-coding-system-alist
841 @tindex network-coding-system-alist
842 This variable is an alist that specifies the coding system to use for
843 network streams. It works much like @code{file-coding-system-alist},
844 with the difference that the @var{pattern} in an element may be either a
845 port number or a regular expression. If it is a regular expression, it
846 is matched against the network service name used to open the network
847 stream.
848 @end defvar
849
850 @defvar default-process-coding-system
851 @tindex default-process-coding-system
852 This variable specifies the coding systems to use for subprocess (and
853 network stream) input and output, when nothing else specifies what to
854 do.
855
856 The value should be a cons cell of the form @code{(@var{input-coding}
857 . @var{output-coding})}. Here @var{input-coding} applies to input from
858 the subprocess, and @var{output-coding} applies to output to it.
859 @end defvar
860
861 @defun find-operation-coding-system operation &rest arguments
862 @tindex find-operation-coding-system
863 This function returns the coding system to use (by default) for
864 performing @var{operation} with @var{arguments}. The value has this
865 form:
866
867 @example
868 (@var{decoding-system} @var{encoding-system})
869 @end example
870
871 The first element, @var{decoding-system}, is the coding system to use
872 for decoding (in case @var{operation} does decoding), and
873 @var{encoding-system} is the coding system for encoding (in case
874 @var{operation} does encoding).
875
876 The argument @var{operation} should be a symbol, one of
877 @code{insert-file-contents}, @code{write-region}, @code{call-process},
878 @code{call-process-region}, @code{start-process}, or
879 @code{open-network-stream}. These are the names of the Emacs I/O primitives
880 that can do coding system conversion.
881
882 The remaining arguments should be the same arguments that might be given
883 to that I/O primitive. Depending on the primitive, one of those
884 arguments is selected as the @dfn{target}. For example, if
885 @var{operation} does file I/O, whichever argument specifies the file
886 name is the target. For subprocess primitives, the process name is the
887 target. For @code{open-network-stream}, the target is the service name
888 or port number.
889
890 This function looks up the target in @code{file-coding-system-alist},
891 @code{process-coding-system-alist}, or
892 @code{network-coding-system-alist}, depending on @var{operation}.
893 @xref{Default Coding Systems}.
894 @end defun
895
896 @node Specifying Coding Systems
897 @subsection Specifying a Coding System for One Operation
898
899 You can specify the coding system for a specific operation by binding
900 the variables @code{coding-system-for-read} and/or
901 @code{coding-system-for-write}.
902
903 @defvar coding-system-for-read
904 @tindex coding-system-for-read
905 If this variable is non-@code{nil}, it specifies the coding system to
906 use for reading a file, or for input from a synchronous subprocess.
907
908 It also applies to any asynchronous subprocess or network stream, but in
909 a different way: the value of @code{coding-system-for-read} when you
910 start the subprocess or open the network stream specifies the input
911 decoding method for that subprocess or network stream. It remains in
912 use for that subprocess or network stream unless and until overridden.
913
914 The right way to use this variable is to bind it with @code{let} for a
915 specific I/O operation. Its global value is normally @code{nil}, and
916 you should not globally set it to any other value. Here is an example
917 of the right way to use the variable:
918
919 @example
920 ;; @r{Read the file with no character code conversion.}
921 ;; @r{Assume @sc{crlf} represents end-of-line.}
922 (let ((coding-system-for-write 'emacs-mule-dos))
923 (insert-file-contents filename))
924 @end example
925
926 When its value is non-@code{nil}, @code{coding-system-for-read} takes
927 precedence over all other methods of specifying a coding system to use for
928 input, including @code{file-coding-system-alist},
929 @code{process-coding-system-alist} and
930 @code{network-coding-system-alist}.
931 @end defvar
932
933 @defvar coding-system-for-write
934 @tindex coding-system-for-write
935 This works much like @code{coding-system-for-read}, except that it
936 applies to output rather than input. It affects writing to files,
937 as well as sending output to subprocesses and net connections.
938
939 When a single operation does both input and output, as do
940 @code{call-process-region} and @code{start-process}, both
941 @code{coding-system-for-read} and @code{coding-system-for-write}
942 affect it.
943 @end defvar
944
945 @defvar inhibit-eol-conversion
946 @tindex inhibit-eol-conversion
947 When this variable is non-@code{nil}, no end-of-line conversion is done,
948 no matter which coding system is specified. This applies to all the
949 Emacs I/O and subprocess primitives, and to the explicit encoding and
950 decoding functions (@pxref{Explicit Encoding}).
951 @end defvar
952
953 @node Explicit Encoding
954 @subsection Explicit Encoding and Decoding
955 @cindex encoding text
956 @cindex decoding text
957
958 All the operations that transfer text in and out of Emacs have the
959 ability to use a coding system to encode or decode the text.
960 You can also explicitly encode and decode text using the functions
961 in this section.
962
963 @cindex raw bytes
964 The result of encoding, and the input to decoding, are not ordinary
965 text. They are ``raw bytes''---bytes that represent text in the same
966 way that an external file would. When a buffer contains raw bytes, it
967 is most natural to mark that buffer as using unibyte representation,
968 using @code{set-buffer-multibyte} (@pxref{Selecting a Representation}),
969 but this is not required. If the buffer's contents are only temporarily
970 raw, leave the buffer multibyte, which will be correct after you decode
971 them.
972
973 The usual way to get raw bytes in a buffer, for explicit decoding, is
974 to read them from a file with @code{insert-file-contents-literally}
975 (@pxref{Reading from Files}) or specify a non-@code{nil} @var{rawfile}
976 argument when visiting a file with @code{find-file-noselect}.
977
978 The usual way to use the raw bytes that result from explicitly
979 encoding text is to copy them to a file or process---for example, to
980 write them with @code{write-region} (@pxref{Writing to Files}), and
981 suppress encoding for that @code{write-region} call by binding
982 @code{coding-system-for-write} to @code{no-conversion}.
983
984 Raw bytes typically contain stray individual bytes with values in the
985 range 128 through 255, that are legitimate only as part of multibyte
986 sequences. Even if the buffer is multibyte, Emacs treats each such
987 individual byte as a character and uses the byte value as its character
988 code. In this way, character codes 128 through 255 can be found in a
989 multibyte buffer, even though they are not legitimate multibyte
990 character codes.
991
992 Raw bytes sometimes contain overlong byte-sequences that look like a
993 proper multibyte character plus extra superfluous trailing codes. For
994 most purposes, Emacs treats such a sequence in a buffer or string as a
995 single character, and if you look at its character code, you get the
996 value that corresponds to the multibyte character
997 sequence---disregarding the extra trailing codes. This is not quite
998 clean, but raw bytes are used only in limited ways, so as a practical
999 matter it is not worth the trouble to treat this case differently.
1000
1001 When a multibyte buffer contains illegitimate byte sequences,
1002 sometimes insertion or deletion can cause them to coalesce into a
1003 legitimate multibyte character. For example, suppose the buffer
1004 contains the sequence 129 68 192, 68 being the character @samp{D}. If
1005 you delete the @samp{D}, the bytes 129 and 192 become adjacent, and thus
1006 become one multibyte character (Latin-1 A with grave accent). Point
1007 moves to one side or the other of the character, since it cannot be
1008 within a character. Don't be alarmed by this.
1009
1010 Some really peculiar situations prevent proper coalescence. For
1011 example, if you narrow the buffer so that the accessible portion begins
1012 just before the @samp{D}, then delete the @samp{D}, the two surrounding
1013 bytes cannot coalesce because one of them is outside the accessible
1014 portion of the buffer. In this case, the deletion cannot be done, so
1015 @code{delete-region} signals an error.
1016
1017 Here are the functions to perform explicit encoding or decoding. The
1018 decoding functions produce ``raw bytes''; the encoding functions are
1019 meant to operate on ``raw bytes''. All of these functions discard text
1020 properties.
1021
1022 @defun encode-coding-region start end coding-system
1023 @tindex encode-coding-region
1024 This function encodes the text from @var{start} to @var{end} according
1025 to coding system @var{coding-system}. The encoded text replaces the
1026 original text in the buffer. The result of encoding is ``raw bytes,''
1027 but the buffer remains multibyte if it was multibyte before.
1028 @end defun
1029
1030 @defun encode-coding-string string coding-system
1031 @tindex encode-coding-string
1032 This function encodes the text in @var{string} according to coding
1033 system @var{coding-system}. It returns a new string containing the
1034 encoded text. The result of encoding is a unibyte string of ``raw bytes.''
1035 @end defun
1036
1037 @defun decode-coding-region start end coding-system
1038 @tindex decode-coding-region
1039 This function decodes the text from @var{start} to @var{end} according
1040 to coding system @var{coding-system}. The decoded text replaces the
1041 original text in the buffer. To make explicit decoding useful, the text
1042 before decoding ought to be ``raw bytes.''
1043 @end defun
1044
1045 @defun decode-coding-string string coding-system
1046 @tindex decode-coding-string
1047 This function decodes the text in @var{string} according to coding
1048 system @var{coding-system}. It returns a new string containing the
1049 decoded text. To make explicit decoding useful, the contents of
1050 @var{string} ought to be ``raw bytes.''
1051 @end defun
1052
1053 @node Terminal I/O Encoding
1054 @subsection Terminal I/O Encoding
1055
1056 Emacs can decode keyboard input using a coding system, and encode
1057 terminal output. This is useful for terminals that transmit or display
1058 text using a particular encoding such as Latin-1. Emacs does not set
1059 @code{last-coding-system-used} for encoding or decoding for the
1060 terminal.
1061
1062 @defun keyboard-coding-system
1063 @tindex keyboard-coding-system
1064 This function returns the coding system that is in use for decoding
1065 keyboard input---or @code{nil} if no coding system is to be used.
1066 @end defun
1067
1068 @defun set-keyboard-coding-system coding-system
1069 @tindex set-keyboard-coding-system
1070 This function specifies @var{coding-system} as the coding system to
1071 use for decoding keyboard input. If @var{coding-system} is @code{nil},
1072 that means do not decode keyboard input.
1073 @end defun
1074
1075 @defun terminal-coding-system
1076 @tindex terminal-coding-system
1077 This function returns the coding system that is in use for encoding
1078 terminal output---or @code{nil} for no encoding.
1079 @end defun
1080
1081 @defun set-terminal-coding-system coding-system
1082 @tindex set-terminal-coding-system
1083 This function specifies @var{coding-system} as the coding system to use
1084 for encoding terminal output. If @var{coding-system} is @code{nil},
1085 that means do not encode terminal output.
1086 @end defun
1087
1088 @node MS-DOS File Types
1089 @subsection MS-DOS File Types
1090 @cindex DOS file types
1091 @cindex MS-DOS file types
1092 @cindex Windows file types
1093 @cindex file types on MS-DOS and Windows
1094 @cindex text files and binary files
1095 @cindex binary files and text files
1096
1097 On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
1098 end-of-line conversion for a file by looking at the file's name. This
1099 feature classifies fils as @dfn{text files} and @dfn{binary files}. By
1100 ``binary file'' we mean a file of literal byte values that are not
1101 necessarily meant to be characters; Emacs does no end-of-line conversion
1102 and no character code conversion for them. On the other hand, the bytes
1103 in a text file are intended to represent characters; when you create a
1104 new file whose name implies that it is a text file, Emacs uses DOS
1105 end-of-line conversion.
1106
1107 @defvar buffer-file-type
1108 This variable, automatically buffer-local in each buffer, records the
1109 file type of the buffer's visited file. When a buffer does not specify
1110 a coding system with @code{buffer-file-coding-system}, this variable is
1111 used to determine which coding system to use when writing the contents
1112 of the buffer. It should be @code{nil} for text, @code{t} for binary.
1113 If it is @code{t}, the coding system is @code{no-conversion}.
1114 Otherwise, @code{undecided-dos} is used.
1115
1116 Normally this variable is set by visiting a file; it is set to
1117 @code{nil} if the file was visited without any actual conversion.
1118 @end defvar
1119
1120 @defopt file-name-buffer-file-type-alist
1121 This variable holds an alist for recognizing text and binary files.
1122 Each element has the form (@var{regexp} . @var{type}), where
1123 @var{regexp} is matched against the file name, and @var{type} may be
1124 @code{nil} for text, @code{t} for binary, or a function to call to
1125 compute which. If it is a function, then it is called with a single
1126 argument (the file name) and should return @code{t} or @code{nil}.
1127
1128 When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
1129 which coding system to use when reading a file. For a text file,
1130 @code{undecided-dos} is used. For a binary file, @code{no-conversion}
1131 is used.
1132
1133 If no element in this alist matches a given file name, then
1134 @code{default-buffer-file-type} says how to treat the file.
1135 @end defopt
1136
1137 @defopt default-buffer-file-type
1138 This variable says how to handle files for which
1139 @code{file-name-buffer-file-type-alist} says nothing about the type.
1140
1141 If this variable is non-@code{nil}, then these files are treated as
1142 binary: the coding system @code{no-conversion} is used. Otherwise,
1143 nothing special is done for them---the coding system is deduced solely
1144 from the file contents, in the usual Emacs fashion.
1145 @end defopt
1146
1147 @node Input Methods
1148 @section Input Methods
1149 @cindex input methods
1150
1151 @dfn{Input methods} provide convenient ways of entering non-@sc{ascii}
1152 characters from the keyboard. Unlike coding systems, which translate
1153 non-@sc{ascii} characters to and from encodings meant to be read by
1154 programs, input methods provide human-friendly commands. (@xref{Input
1155 Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1156 use input methods to enter text.) How to define input methods is not
1157 yet documented in this manual, but here we describe how to use them.
1158
1159 Each input method has a name, which is currently a string;
1160 in the future, symbols may also be usable as input method names.
1161
1162 @tindex current-input-method
1163 @defvar current-input-method
1164 This variable holds the name of the input method now active in the
1165 current buffer. (It automatically becomes local in each buffer when set
1166 in any fashion.) It is @code{nil} if no input method is active in the
1167 buffer now.
1168 @end defvar
1169
1170 @tindex default-input-method
1171 @defvar default-input-method
1172 This variable holds the default input method for commands that choose an
1173 input method. Unlike @code{current-input-method}, this variable is
1174 normally global.
1175 @end defvar
1176
1177 @tindex set-input-method
1178 @defun set-input-method input-method
1179 This function activates input method @var{input-method} for the current
1180 buffer. It also sets @code{default-input-method} to @var{input-method}.
1181 If @var{input-method} is @code{nil}, this function deactivates any input
1182 method for the current buffer.
1183 @end defun
1184
1185 @tindex read-input-method-name
1186 @defun read-input-method-name prompt &optional default inhibit-null
1187 This function reads an input method name with the minibuffer, prompting
1188 with @var{prompt}. If @var{default} is non-@code{nil}, that is returned
1189 by default, if the user enters empty input. However, if
1190 @var{inhibit-null} is non-@code{nil}, empty input signals an error.
1191
1192 The returned value is a string.
1193 @end defun
1194
1195 @tindex input-method-alist
1196 @defvar input-method-alist
1197 This variable defines all the supported input methods.
1198 Each element defines one input method, and should have the form:
1199
1200 @example
1201 (@var{input-method} @var{language-env} @var{activate-func}
1202 @var{title} @var{description} @var{args}...)
1203 @end example
1204
1205 Here @var{input-method} is the input method name, a string;
1206 @var{language-env} is another string, the name of the language
1207 environment this input method is recommended for. (That serves only for
1208 documentation purposes.)
1209
1210 @var{title} is a string to display in the mode line while this method is
1211 active. @var{description} is a string describing this method and what
1212 it is good for.
1213
1214 @var{activate-func} is a function to call to activate this method. The
1215 @var{args}, if any, are passed as arguments to @var{activate-func}. All
1216 told, the arguments to @var{activate-func} are @var{input-method} and
1217 the @var{args}.
1218 @end defvar
1219
1220 The fundamental interface to input methods is through the
1221 variable @code{input-method-function}. @xref{Reading One Event}.