lispref/nonascii.texi

   1 @c -*-texinfo-*-
   2 @c This is part of the GNU Emacs Lisp Reference Manual.
   3 @c Copyright (C) 1998 Free Software Foundation, Inc.
   4 @c See the file elisp.texi for copying conditions.
   5 @setfilename ../info/characters
   6 @node Non-ASCII Characters, Searching and Matching, Text, Top
   7 @chapter Non-ASCII Characters
   8 @cindex multibyte characters
   9 @cindex non-ASCII characters
  10
  11   This chapter covers the special issues relating to non-@sc{ASCII}
  12 characters and how they are stored in strings and buffers.
  13
  14 @menu
  15 * Text Representations::
  16 * Converting Representations::
  17 * Selecting a Representation::
  18 * Character Codes::
  19 * Character Sets::
  20 * Scanning Charsets::
  21 * Chars and Bytes::
  22 * Coding Systems::
  23 * Lisp and Coding System::
  24 * Default Coding Systems::
  25 * Specifying Coding Systems::
  26 * Explicit Encoding::
  27 * MS-DOS File Types::
  28 * MS-DOS Subprocesses::
  29 @end menu
  30
  31 @node Text Representations
  32 @section Text Representations
  33 @cindex text representations
  34
  35   Emacs has two @dfn{text representations}---two ways to represent text
  36 in a string or buffer.  These are called @dfn{unibyte} and
  37 @dfn{multibyte}.  Each string, and each buffer, uses one of these two
  38 representations.  For most purposes, you can ignore the issue of
  39 representations, because Emacs converts text between them as
  40 appropriate.  Occasionally in Lisp programming you will need to pay
  41 attention to the difference.
  42
  43 @cindex unibyte text
  44   In unibyte representation, each character occupies one byte and
  45 therefore the possible character codes range from 0 to 255.  Codes 0
  46 through 127 are @sc{ASCII} characters; the codes from 128 through 255
  47 are used for one non-@sc{ASCII} character set (you can choose which
  48 character set by setting the variable @code{nonascii-insert-offset}).
  49
  50 @cindex leading code
  51 @cindex multibyte text
  52   In multibyte representation, a character may occupy more than one
  53 byte, and as a result, the full range of Emacs character codes can be
  54 stored.  The first byte of a multibyte character is always in the range
  55 128 through 159 (octal 0200 through 0237).  These values are called
  56 @dfn{leading codes}.  The first byte determines which character set the
  57 character belongs to (@pxref{Character Sets}); in particular, it
  58 determines how many bytes long the sequence is.  The second and
  59 subsequent bytes of a multibyte character are always in the range 160
  60 through 255 (octal 0240 through 0377).
  61
  62   In a buffer, the buffer-local value of the variable
  63 @code{enable-multibyte-characters} specifies the representation used.
  64 The representation for a string is determined based on the string
  65 contents when the string is constructed.
  66
  67 @tindex enable-multibyte-characters
  68 @defvar enable-multibyte-characters
  69 This variable specifies the current buffer's text representation.
  70 If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
  71 it contains unibyte text.
  72
  73 You cannot set this variable directly; instead, use the function
  74 @code{set-buffer-multibyte} to change a buffer's representation.
  75 @end defvar
  76
  77 @tindex default-enable-multibyte-characters
  78 @defvar default-enable-multibyte-characters
  79 This variable`s value is entirely equivalent to @code{(default-value
  80 'enable-multibyte-characters)}, and setting this variable changes that
  81 default value.  Although setting the local binding of
  82 @code{enable-multibyte-characters} in a specific buffer is dangerous,
  83 changing the default value is safe, and it is a reasonable thing to do.
  84
  85 The @samp{--unibyte} command line option does its job by setting the
  86 default value to @code{nil} early in startup.
  87 @end defvar
  88
  89 @tindex multibyte-string-p
  90 @defun multibyte-string-p string
  91 Return @code{t} if @var{string} contains multibyte characters.
  92 @end defun
  93
  94 @node Converting Representations
  95 @section Converting Text Representations
  96
  97   Emacs can convert unibyte text to multibyte; it can also convert
  98 multibyte text to unibyte, though this conversion loses information.  In
  99 general these conversions happen when inserting text into a buffer, or
 100 when putting text from several strings together in one string.  You can
 101 also explicitly convert a string's contents to either representation.
 102
 103   Emacs chooses the representation for a string based on the text that
 104 it is constructed from.  The general rule is to convert unibyte text to
 105 multibyte text when combining it with other multibyte text, because the
 106 multibyte representation is more general and can hold whatever
 107 characters the unibyte text has.
 108
 109   When inserting text into a buffer, Emacs converts the text to the
 110 buffer's representation, as specified by
 111 @code{enable-multibyte-characters} in that buffer.  In particular, when
 112 you insert multibyte text into a unibyte buffer, Emacs converts the text
 113 to unibyte, even though this conversion cannot in general preserve all
 114 the characters that might be in the multibyte text.  The other natural
 115 alternative, to convert the buffer contents to multibyte, is not
 116 acceptable because the buffer's representation is a choice made by the
 117 user that cannot be overridden automatically.
 118
 119   Converting unibyte text to multibyte text leaves @sc{ASCII} characters
 120 unchanged, and likewise 128 through 159.  It converts the non-@sc{ASCII}
 121 codes 160 through 255 by adding the value @code{nonascii-insert-offset}
 122 to each character code.  By setting this variable, you specify which
 123 character set the unibyte characters correspond to.  For example, if
 124 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
 125 'latin-iso8859-1 0) 128)}, then the unibyte non-@sc{ASCII} characters
 126 correspond to Latin 1.  If it is 2688, which is @code{(- (make-char
 127 'greek-iso8859-7 0) 128)}, then they correspond to Greek letters.
 128
 129   Converting multibyte text to unibyte is simpler: it performs
 130 logical-and of each character code with 255.  If
 131 @code{nonascii-insert-offset} has a reasonable value, corresponding to
 132 the beginning of some character set, this conversion is the inverse of
 133 the other: converting unibyte text to multibyte and back to unibyte
 134 reproduces the original unibyte text.
 135
 136 @tindex nonascii-insert-offset
 137 @defvar nonascii-insert-offset
 138 This variable specifies the amount to add to a non-@sc{ASCII} character
 139 when converting unibyte text to multibyte.  It also applies when
 140 @code{insert-char} or @code{self-insert-command} inserts a character in
 141 the unibyte non-@sc{ASCII} range, 128 through 255.
 142
 143 The right value to use to select character set @var{cs} is @code{(-
 144 (make-char @var{cs} 0) 128)}.  If the value of
 145 @code{nonascii-insert-offset} is zero, then conversion actually uses the
 146 value for the Latin 1 character set, rather than zero.
 147 @end defvar
 148
 149 @tindex nonascii-translate-table
 150 @defvar nonascii-translate-table
 151 This variable provides a more general alternative to
 152 @code{nonascii-insert-offset}.  You can use it to specify independently
 153 how to translate each code in the range of 128 through 255 into a
 154 multibyte character.  The value should be a vector, or @code{nil}.
 155 If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
 156 @end defvar
 157
 158 @tindex string-make-unibyte
 159 @defun string-make-unibyte string
 160 This function converts the text of @var{string} to unibyte
 161 representation, if it isn't already, and return the result.  If
 162 @var{string} is a unibyte string, it is returned unchanged.
 163 @end defun
 164
 165 @tindex string-make-multibyte
 166 @defun string-make-multibyte string
 167 This function converts the text of @var{string} to multibyte
 168 representation, if it isn't already, and return the result.  If
 169 @var{string} is a multibyte string, it is returned unchanged.
 170 @end defun
 171
 172 @node Selecting a Representation
 173 @section Selecting a Representation
 174
 175   Sometimes it is useful to examine an existing buffer or string as
 176 multibyte when it was unibyte, or vice versa.
 177
 178 @tindex set-buffer-multibyte
 179 @defun set-buffer-multibyte multibyte
 180 Set the representation type of the current buffer.  If @var{multibyte}
 181 is non-@code{nil}, the buffer becomes multibyte.  If @var{multibyte}
 182 is @code{nil}, the buffer becomes unibyte.
 183
 184 This function leaves the buffer contents unchanged when viewed as a
 185 sequence of bytes.  As a consequence, it can change the contents viewed
 186 as characters; a sequence of two bytes which is treated as one character
 187 in multibyte representation will count as two characters in unibyte
 188 representation.
 189
 190 This function sets @code{enable-multibyte-characters} to record which
 191 representation is in use.  It also adjusts various data in the buffer
 192 (including overlays, text properties and markers) so that they cover the
 193 same text as they did before.
 194 @end defun
 195
 196 @tindex string-as-unibyte
 197 @defun string-as-unibyte string
 198 This function returns a string with the same bytes as @var{string} but
 199 treating each byte as a character.  This means that the value may have
 200 more characters than @var{string} has.
 201
 202 If @var{string} is unibyte already, then the value is @var{string}
 203 itself.
 204 @end defun
 205
 206 @tindex string-as-multibyte
 207 @defun string-as-multibyte string
 208 This function returns a string with the same bytes as @var{string} but
 209 treating each multibyte sequence as one character.  This means that the
 210 value may have fewer characters than @var{string} has.
 211
 212 If @var{string} is multibyte already, then the value is @var{string}
 213 itself.
 214 @end defun
 215
 216 @node Character Codes
 217 @section Character Codes
 218 @cindex character codes
 219
 220   The unibyte and multibyte text representations use different character
 221 codes.  The valid character codes for unibyte representation range from
 222 0 to 255---the values that can fit in one byte.  The valid character
 223 codes for multibyte representation range from 0 to 524287, but not all
 224 values in that range are valid.  In particular, the values 128 through
 225 255 are not legitimate in multibyte text (though they can occur in ``raw
 226 bytes''; @pxref{Explicit Encoding}).  Only the @sc{ASCII} codes 0
 227 through 127 are fully legitimate in both representations.
 228
 229 @defun char-valid-p charcode
 230 This returns @code{t} if @var{charcode} is valid for either one of the two
 231 text representations.
 232
 233 @example
 234 (char-valid-p 65)
 235      @result{} t
 236 (char-valid-p 256)
 237      @result{} nil
 238 (char-valid-p 2248)
 239      @result{} t
 240 @end example
 241 @end defun
 242
 243 @node Character Sets
 244 @section Character Sets
 245 @cindex character sets
 246
 247   Emacs classifies characters into various @dfn{character sets}, each of
 248 which has a name which is a symbol.  Each character belongs to one and
 249 only one character set.
 250
 251   In general, there is one character set for each distinct script.  For
 252 example, @code{latin-iso8859-1} is one character set,
 253 @code{greek-iso8859-7} is another, and @code{ascii} is another.  An
 254 Emacs character set can hold at most 9025 characters; therefore, in some
 255 cases, characters that would logically be grouped together are split
 256 into several character sets.  For example, one set of Chinese characters
 257 is divided into eight Emacs character sets, @code{chinese-cns11643-1}
 258 through @code{chinese-cns11643-7}.
 259
 260 @tindex charsetp
 261 @defun charsetp object
 262 Return @code{t} if @var{object} is a character set name symbol,
 263 @code{nil} otherwise.
 264 @end defun
 265
 266 @tindex charset-list
 267 @defun charset-list
 268 This function returns a list of all defined character set names.
 269 @end defun
 270
 271 @tindex char-charset
 272 @defun char-charset character
 273 This function returns the the name of the character
 274 set that @var{character} belongs to.
 275 @end defun
 276
 277 @node Scanning Charsets
 278 @section Scanning for Character Sets
 279
 280   Sometimes it is useful to find out which character sets appear in a
 281 part of a buffer or a string.  One use for this is in determining which
 282 coding systems (@pxref{Coding Systems}) are capable of representing all
 283 of the text in question.
 284
 285 @tindex find-charset-region
 286 @defun find-charset-region beg end &optional unification
 287 This function returns a list of the character sets
 288 that appear in the current buffer between positions @var{beg}
 289 and @var{end}.
 290 @end defun
 291
 292 @tindex find-charset-string
 293 @defun find-charset-string string &optional unification
 294 This function returns a list of the character sets
 295 that appear in the string @var{string}.
 296 @end defun
 297
 298 @node Chars and Bytes
 299 @section Characters and Bytes
 300 @cindex bytes and characters
 301
 302   In multibyte representation, each character occupies one or more
 303 bytes.  The functions in this section convert between characters and the
 304 byte values used to represent them.  For most purposes, there is no need
 305 to be concerned with the number of bytes used to represent a character
 306 because Emacs translates automatically when necessary.
 307
 308 @tindex char-bytes
 309 @defun char-bytes character
 310 This function returns the number of bytes used to represent the
 311 character @var{character}.  In most cases, this is the same as
 312 @code{(length (split-char @var{character}))}; the only exception is for
 313 ASCII characters and the codes used in unibyte text, which use just one
 314 byte.
 315
 316 @example
 317 (char-bytes 2248)
 318      @result{} 2
 319 (char-bytes 65)
 320      @result{} 1
 321 @end example
 322
 323 This function's values are correct for both multibyte and unibyte
 324 representations, because the non-@sc{ASCII} character codes used in
 325 those two representations do not overlap.
 326
 327 @example
 328 (char-bytes 192)
 329      @result{} 1
 330 @end example
 331 @end defun
 332
 333 @tindex split-char
 334 @defun split-char character
 335 Return a list containing the name of the character set of
 336 @var{character}, followed by one or two byte-values which identify
 337 @var{character} within that character set.
 338
 339 @example
 340 (split-char 2248)
 341      @result{} (latin-iso8859-1 72)
 342 (split-char 65)
 343      @result{} (ascii 65)
 344 @end example
 345
 346 Unibyte non-@sc{ASCII} characters are considered as part of
 347 the @code{ascii} character set:
 348
 349 @example
 350 (split-char 192)
 351      @result{} (ascii 192)
 352 @end example
 353 @end defun
 354
 355 @tindex make-char
 356 @defun make-char charset &rest byte-values
 357 Thus function returns the character in character set @var{charset}
 358 identified by @var{byte-values}.  This is roughly the opposite of
 359 split-char.
 360
 361 @example
 362 (make-char 'latin-iso8859-1 72)
 363      @result{} 2248
 364 @end example
 365 @end defun
 366
 367 @node Coding Systems
 368 @section Coding Systems
 369
 370 @cindex coding system
 371   When Emacs reads or writes a file, and when Emacs sends text to a
 372 subprocess or receives text from a subprocess, it normally performs
 373 character code conversion and end-of-line conversion as specified
 374 by a particular @dfn{coding system}.
 375
 376 @cindex character code conversion
 377   @dfn{Character code conversion} involves conversion between the encoding
 378 used inside Emacs and some other encoding.  Emacs supports many
 379 different encodings, in that it can convert to and from them.  For
 380 example, it can convert text to or from encodings such as Latin 1, Latin
 381 2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022.  In some
 382 cases, Emacs supports several alternative encodings for the same
 383 characters; for example, there are three coding systems for the Cyrillic
 384 (Russian) alphabet: ISO, Alternativnyj, and KOI8.
 385
 386   Most coding systems specify a particular character code for
 387 conversion, but some of them leave this unspecified---to be chosen
 388 heuristically based on the data.
 389
 390 @cindex end of line conversion
 391   @dfn{End of line conversion} handles three different conventions used
 392 on various systems for representing end of line in files.  The Unix
 393 convention is to use the linefeed character (also called newline).  The
 394 DOS convention is to use the two character sequence, carriage-return
 395 linefeed, at the end of a line.  The Mac convention is to use just
 396 carriage-return.
 397
 398 @cindex base coding system
 399 @cindex variant coding system
 400   @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
 401 conversion unspecified, to be chosen based on the data.  @dfn{Variant
 402 coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
 403 @code{latin-1-mac} specify the end-of-line conversion explicitly as
 404 well.  Each base coding system has three corresponding variants whose
 405 names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
 406
 407 @node Lisp and Coding Systems
 408 @subsection Coding Systems in Lisp
 409
 410   Here are Lisp facilities for working with coding systems;
 411
 412 @tindex coding-system-list
 413 @defun coding-system-list &optional base-only
 414 This function returns a list of all coding system names (symbols).  If
 415 @var{base-only} is non-@code{nil}, the value includes only the
 416 base coding systems.  Otherwise, it includes variant coding systems as well.
 417 @end defun
 418
 419 @tindex coding-system-p
 420 @defun coding-system-p object
 421 This function returns @code{t} if @var{object} is a coding system
 422 name.
 423 @end defun
 424
 425 @tindex check-coding-system
 426 @defun check-coding-system coding-system
 427 This function checks the validity of @var{coding-system}.
 428 If that is valid, it returns @var{coding-system}.
 429 Otherwise it signals an error with condition @code{coding-system-error}.
 430 @end defun
 431
 432 @tindex find-safe-coding-system
 433 @defun find-safe-coding-system from to
 434 Return a list of proper coding systems to encode a text between
 435 @var{from} and @var{to}.  All coding systems in the list can safely
 436 encode any multibyte characters in the text.
 437
 438 If the text contains no multibyte characters, return a list of a single
 439 element @code{undecided}.
 440 @end defun
 441
 442 @tindex detect-coding-region
 443 @defun detect-coding-region start end highest
 444 This function chooses a plausible coding system for decoding the text
 445 from @var{start} to @var{end}.  This text should be ``raw bytes''
 446 (@pxref{Explicit Encoding}).
 447
 448 Normally this function returns is a list of coding systems that could
 449 handle decoding the text that was scanned.  They are listed in order of
 450 decreasing priority, based on the priority specified by the user with
 451 @code{prefer-coding-system}.  But if @var{highest} is non-@code{nil},
 452 then the return value is just one coding system, the one that is highest
 453 in priority.
 454 @end defun
 455
 456 @tindex detect-coding-string string highest
 457 @defun detect-coding-string
 458 This function is like @code{detect-coding-region} except that it
 459 operates on the contents of @var{string} instead of bytes in the buffer.
 460 @end defun
 461
 462 @defun find-operation-coding-system operation &rest arguments
 463 This function returns the coding system to use (by default) for
 464 performing @var{operation} with @var{arguments}.  The value has this
 465 form:
 466
 467 @example
 468 (@var{decoding-system} @var{encoding-system})
 469 @end example
 470
 471 The first element, @var{decoding-system}, is the coding system to use
 472 for decoding (in case @var{operation} does decoding), and
 473 @var{encoding-system} is the coding system for encoding (in case
 474 @var{operation} does encoding).
 475
 476 The argument @var{operation} should be an Emacs I/O primitive:
 477 @code{insert-file-contents}, @code{write-region}, @code{call-process},
 478 @code{call-process-region}, @code{start-process}, or
 479 @code{open-network-stream}.
 480
 481 The remaining arguments should be the same arguments that might be given
 482 to that I/O primitive.  Depending on which primitive, one of those
 483 arguments is selected as the @dfn{target}.  For example, if
 484 @var{operation} does file I/O, whichever argument specifies the file
 485 name is the target.  For subprocess primitives, the process name is the
 486 target.  For @code{open-network-stream}, the target is the service name
 487 or port number.
 488
 489 This function looks up the target in @code{file-coding-system-alist},
 490 @code{process-coding-system-alist}, or
 491 @code{network-coding-system-alist}, depending on @var{operation}.
 492 @xref{Default Coding Systems}.
 493 @end defun
 494
 495   Here are two functions you can use to let the user specify a coding
 496 system, with completion.  @xref{Completion}.
 497
 498 @tindex read-coding-system
 499 @defun read-coding-system prompt default
 500 This function reads a coding system using the minibuffer, prompting with
 501 string @var{prompt}, and returns the coding system name as a symbol.  If
 502 the user enters null input, @var{default} specifies which coding system
 503 to return.  It should be a symbol or a string.
 504 @end defun
 505
 506 @tindex read-non-nil-coding-system
 507 @defun read-non-nil-coding-system prompt
 508 This function reads a coding system using the minibuffer, prompting with
 509 string @var{prompt},and returns the coding system name as a symbol.  If
 510 the user tries to enter null input, it asks the user to try again.
 511 @xref{Coding Systems}.
 512 @end defun
 513
 514 @node Default Coding Systems
 515 @section Default Coding Systems
 516
 517   These variable specify which coding system to use by default for
 518 certain files or when running certain subprograms.  The idea of these
 519 variables is that you set them once and for all to the defaults you
 520 want, and then do not change them again.  To specify a particular coding
 521 system for a particular operation in a Lisp program, don't change these
 522 variables; instead, override them using @code{coding-system-for-read}
 523 and @code{coding-system-for-write} (@pxref{Specifying Coding Systems}).
 524
 525 @tindex file-coding-system-alist
 526 @defvar file-coding-system-alist
 527 This variable is an alist that specifies the coding systems to use for
 528 reading and writing particular files.  Each element has the form
 529 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
 530 expression that matches certain file names.  The element applies to file
 531 names that match @var{pattern}.
 532
 533 The @sc{cdr} of the element, @var{val}, should be either a coding
 534 system, a cons cell containing two coding systems, or a function symbol.
 535 If @var{val} is a coding system, that coding system is used for both
 536 reading the file and writing it.  If @var{val} is a cons cell containing
 537 two coding systems, its @sc{car} specifies the coding system for
 538 decoding, and its @sc{cdr} specifies the coding system for encoding.
 539
 540 If @var{val} is a function symbol, the function must return a coding
 541 system or a cons cell containing two coding systems.  This value is used
 542 as described above.
 543 @end defvar
 544
 545 @tindex process-coding-system-alist
 546 @defvar process-coding-system-alist
 547 This variable is an alist specifying which coding systems to use for a
 548 subprocess, depending on which program is running in the subprocess.  It
 549 works like @code{file-coding-system-alist}, except that @var{pattern} is
 550 matched against the program name used to start the subprocess.  The coding
 551 system or systems specified in this alist are used to initialize the
 552 coding systems used for I/O to the subprocess, but you can specify
 553 other coding systems later using @code{set-process-coding-system}.
 554 @end defvar
 555
 556 @tindex network-coding-system-alist
 557 @defvar network-coding-system-alist
 558 This variable is an alist that specifies the coding system to use for
 559 network streams.  It works much like @code{file-coding-system-alist},
 560 with the difference that the @var{pattern} in an element may be either a
 561 port number or a regular expression.  If it is a regular expression, it
 562 is matched against the network service name used to open the network
 563 stream.
 564 @end defvar
 565
 566 @tindex default-process-coding-system
 567 @defvar default-process-coding-system
 568 This variable specifies the coding systems to use for subprocess (and
 569 network stream) input and output, when nothing else specifies what to
 570 do.
 571
 572 The value should be a cons cell of the form @code{(@var{output-coding}
 573 . @var{input-coding})}.  Here @var{output-coding} applies to output to
 574 the subprocess, and @var{input-coding} applies to input from it.
 575 @end defvar
 576
 577 @node Specifying Coding Systems
 578 @section Specifying a Coding System for One Operation
 579
 580   You can specify the coding system for a specific operation by binding
 581 the variables @code{coding-system-for-read} and/or
 582 @code{coding-system-for-write}.
 583
 584 @tindex coding-system-for-read
 585 @defvar coding-system-for-read
 586 If this variable is non-@code{nil}, it specifies the coding system to
 587 use for reading a file, or for input from a synchronous subprocess.
 588
 589 It also applies to any asynchronous subprocess or network stream, but in
 590 a different way: the value of @code{coding-system-for-read} when you
 591 start the subprocess or open the network stream specifies the input
 592 decoding method for that subprocess or network stream.  It remains in
 593 use for that subprocess or network stream unless and until overridden.
 594
 595 The right way to use this variable is to bind it with @code{let} for a
 596 specific I/O operation.  Its global value is normally @code{nil}, and
 597 you should not globally set it to any other value.  Here is an example
 598 of the right way to use the variable:
 599
 600 @example
 601 ;; @r{Read the file with no character code conversion.}
 602 ;; @r{Assume @sc{crlf} represents end-of-line.}
 603 (let ((coding-system-for-write 'emacs-mule-dos))
 604   (insert-file-contents filename))
 605 @end example
 606
 607 When its value is non-@code{nil}, @code{coding-system-for-read} takes
 608 precedence all other methods of specifying a coding system to use for
 609 input, including @code{file-coding-system-alist},
 610 @code{process-coding-system-alist} and
 611 @code{network-coding-system-alist}.
 612 @end defvar
 613
 614 @tindex coding-system-for-write
 615 @defvar coding-system-for-write
 616 This works much like @code{coding-system-for-read}, except that it
 617 applies to output rather than input.  It affects writing to files,
 618 subprocesses, and net connections.
 619
 620 When a single operation does both input and output, as do
 621 @code{call-process-region} and @code{start-process}, both
 622 @code{coding-system-for-read} and @code{coding-system-for-write}
 623 affect it.
 624 @end defvar
 625
 626 @tindex last-coding-system-used
 627 @defvar last-coding-system-used
 628 All I/O operations that use a coding system set this variable
 629 to the coding system name that was used.
 630 @end defvar
 631
 632 @tindex inhibit-eol-conversion
 633 @defvar inhibit-eol-conversion
 634 When this variable is non-@code{nil}, no end-of-line conversion is done,
 635 no matter which coding system is specified.  This applies to all the
 636 Emacs I/O and subprocess primitives, and to the explicit encoding and
 637 decoding functions (@pxref{Explicit Encoding}).
 638 @end defvar
 639
 640 @tindex keyboard-coding-system
 641 @defun keyboard-coding-system
 642 This function returns the coding system that is in use for decoding
 643 keyboard input---or @code{nil} if no coding system is to be used.
 644 @end defun
 645
 646 @tindex set-keyboard-coding-system
 647 @defun set-keyboard-coding-system coding-system
 648 This function specifies @var{coding-system} as the coding system to
 649 use for decoding keyboard input.  If @var{coding-system} is @code{nil},
 650 that means do not decode keyboard input.
 651 @end defun
 652
 653 @tindex terminal-coding-system
 654 @defun terminal-coding-system
 655 This function returns the coding system that is in use for encoding
 656 terminal output---or @code{nil} for no encoding.
 657 @end defun
 658
 659 @tindex set-terminal-coding-system
 660 @defun set-terminal-coding-system coding-system
 661 This function specifies @var{coding-system} as the coding system to use
 662 for encoding terminal output.  If @var{coding-system} is @code{nil},
 663 that means do not encode terminal output.
 664 @end defun
 665
 666   See also the functions @code{process-coding-system} and
 667 @code{set-process-coding-system}.  @xref{Process Information}.
 668
 669   See also @code{read-coding-system} in @ref{High-Level Completion}.
 670
 671 @node Explicit Encoding
 672 @section Explicit Encoding and Decoding
 673 @cindex encoding text
 674 @cindex decoding text
 675
 676   All the operations that transfer text in and out of Emacs have the
 677 ability to use a coding system to encode or decode the text.
 678 You can also explicitly encode and decode text using the functions
 679 in this section.
 680
 681 @cindex raw bytes
 682   The result of encoding, and the input to decoding, are not ordinary
 683 text.  They are ``raw bytes''---bytes that represent text in the same
 684 way that an external file would.  When a buffer contains raw bytes, it
 685 is most natural to mark that buffer as using unibyte representation,
 686 using @code{set-buffer-multibyte} (@pxref{Selecting a Representation}),
 687 but this is not required.  If the buffer's contents are only temporarily
 688 raw, leave the buffer multibyte, which will be correct after you decode
 689 them.
 690
 691   The usual way to get raw bytes in a buffer, for explicit decoding, is
 692 to read them from a file with @code{insert-file-contents-literally}
 693 (@pxref{Reading from Files}) or specify a non-@code{nil} @var{rawfile}
 694 argument when visiting a file with @code{find-file-noselect}.
 695
 696   The usual way to use the raw bytes that result from explicitly
 697 encoding text is to copy them to a file or process---for example, to
 698 write them with @code{write-region} (@pxref{Writing to Files}), and
 699 suppress encoding for that @code{write-region} call by binding
 700 @code{coding-system-for-write} to @code{no-conversion}.
 701
 702 @tindex encode-coding-region
 703 @defun encode-coding-region start end coding-system
 704 This function encodes the text from @var{start} to @var{end} according
 705 to coding system @var{coding-system}.  The encoded text replaces the
 706 original text in the buffer.  The result of encoding is ``raw bytes,''
 707 but the buffer remains multibyte if it was multibyte before.
 708 @end defun
 709
 710 @tindex encode-coding-string
 711 @defun encode-coding-string string coding-system
 712 This function encodes the text in @var{string} according to coding
 713 system @var{coding-system}.  It returns a new string containing the
 714 encoded text.  The result of encoding is a unibyte string of ``raw bytes.''
 715 @end defun
 716
 717 @tindex decode-coding-region
 718 @defun decode-coding-region start end coding-system
 719 This function decodes the text from @var{start} to @var{end} according
 720 to coding system @var{coding-system}.  The decoded text replaces the
 721 original text in the buffer.  To make explicit decoding useful, the text
 722 before decoding ought to be ``raw bytes.''
 723 @end defun
 724
 725 @tindex decode-coding-string
 726 @defun decode-coding-string string coding-system
 727 This function decodes the text in @var{string} according to coding
 728 system @var{coding-system}.  It returns a new string containing the
 729 decoded text.  To make explicit decoding useful, the contents of
 730 @var{string} ought to be ``raw bytes.''
 731 @end defun
 732
 733 @node MS-DOS File Types
 734 @section MS-DOS File Types
 735 @cindex DOS file types
 736 @cindex MS-DOS file types
 737 @cindex Windows file types
 738 @cindex file types on MS-DOS and Windows
 739 @cindex text files and binary files
 740 @cindex binary files and text files
 741
 742   Emacs on MS-DOS and on MS-Windows recognizes certain file names as
 743 text files or binary files.  For a text file, Emacs always uses DOS
 744 end-of-line conversion.  For a binary file, Emacs does no end-of-line
 745 conversion and no character code conversion.
 746
 747 @defvar buffer-file-type
 748 This variable, automatically buffer-local in each buffer, records the
 749 file type of the buffer's visited file.  The value is @code{nil} for
 750 text, @code{t} for binary.  When a buffer does not specify a coding
 751 system with @code{buffer-file-coding-system}, this variable is used by
 752 the function @code{find-buffer-file-type-coding-system} to determine
 753 which coding system to use when writing the contents of the buffer.
 754 @end defvar
 755
 756 @defopt file-name-buffer-file-type-alist
 757 This variable holds an alist for recognizing text and binary files.
 758 Each element has the form (@var{regexp} . @var{type}), where
 759 @var{regexp} is matched against the file name, and @var{type} may be
 760 @code{nil} for text, @code{t} for binary, or a function to call to
 761 compute which.  If it is a function, then it is called with a single
 762 argument (the file name) and should return @code{t} or @code{nil}.
 763
 764 Emacs when running on MS-DOS or MS-Windows checks this alist to decide
 765 which coding system to use when reading a file.  For a text file,
 766 @code{undecided-dos} is used.  For a binary file, @code{no-conversion}
 767 is used.
 768
 769 If no element in this alist matches a given file name, then
 770 @code{default-buffer-file-type} says how to treat the file.
 771 @end defopt
 772
 773 @defopt default-buffer-file-type
 774 This variable says how to handle files for which
 775 @code{file-name-buffer-file-type-alist} says nothing about the type.
 776
 777 If this variable is non-@code{nil}, then these files are treated as
 778 binary.  Otherwise, nothing special is done for them---the coding system
 779 is deduced solely from the file contents, in the usual Emacs fashion.
 780 @end defopt
 781
 782 @node MS-DOS Subprocesses
 783 @section MS-DOS Subprocesses
 784
 785   On Microsoft operating systems, these variables provide an alternative
 786 way to specify the kind of end-of-line conversion to use for input and
 787 output.  The variable @code{binary-process-input} applies to input sent
 788 to the subprocess, and @code{binary-process-output} applies to output
 789 received from it.  A non-@code{nil} value means the data is ``binary,''
 790 and @code{nil} means the data is text.
 791
 792 @defvar binary-process-input
 793 If this variable is @code{nil}, convert newlines to @sc{crlf} sequences in
 794 the input to a synchronous subprocess.
 795 @end defvar
 796
 797 @defvar binary-process-output
 798 If this variable is @code{nil}, convert @sc{crlf} sequences to newlines in
 799 the output from a synchronous subprocess.
 800 @end defvar