@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
-@c Copyright (C) 1998, 1999 Free Software Foundation, Inc.
+@c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
+@c 2005, 2006, 2007 Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@setfilename ../info/characters
@node Non-ASCII Characters, Searching and Matching, Text, Top
@chapter Non-@acronym{ASCII} Characters
@cindex multibyte characters
+@cindex characters, multi-byte
@cindex non-@acronym{ASCII} characters
This chapter covers the special issues relating to non-@acronym{ASCII}
* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
* Character Codes:: How unibyte and multibyte relate to
codes of individual characters.
-* Character Sets:: The space of possible characters codes
+* Character Sets:: The space of possible character codes
is divided into various character sets.
* Chars and Bytes:: More information about multibyte encodings.
* Splitting Characters:: Converting a character to its byte sequence.
@end defvar
@defun position-bytes position
-@tindex position-bytes
-Return the byte-position corresponding to buffer position @var{position}
-in the current buffer. If @var{position} is out of range, the value
-is @code{nil}.
+Return the byte-position corresponding to buffer position
+@var{position} in the current buffer. This is 1 at the start of the
+buffer, and counts upward in bytes. If @var{position} is out of
+range, the value is @code{nil}.
@end defun
@defun byte-to-position byte-position
-@tindex byte-to-position
Return the buffer position corresponding to byte-position
@var{byte-position} in the current buffer. If @var{byte-position} is
out of range, the value is @code{nil}.
Return @code{t} if @var{string} is a multibyte string.
@end defun
+@defun string-bytes string
+@cindex string, number of bytes
+This function returns the number of bytes in @var{string}.
+If @var{string} is a multibyte string, this can be greater than
+@code{(length @var{string})}.
+@end defun
+
@node Converting Representations
@section Converting Text Representations
returned unchanged.
@end defun
+@defun multibyte-char-to-unibyte char
+This convert the multibyte character @var{char} to a unibyte
+character, based on @code{nonascii-translation-table} and
+@code{nonascii-insert-offset}.
+@end defun
+
+@defun unibyte-char-to-multibyte char
+This convert the unibyte character @var{char} to a multibyte
+character, based on @code{nonascii-translation-table} and
+@code{nonascii-insert-offset}.
+@end defun
+
@node Selecting a Representation
@section Selecting a Representation
0 through 127 are completely legitimate in both representations.
@defun char-valid-p charcode &optional genericp
-This returns @code{t} if @var{charcode} is valid for either one of the two
-text representations.
+This returns @code{t} if @var{charcode} is valid (either for unibyte
+text or for multibyte text).
@example
(char-valid-p 65)
@end defun
@defun charset-plist charset
-@tindex charset-plist
This function returns the charset property list of the character set
@var{charset}. Although @var{charset} is a symbol, this is not the same
as the property list of that symbol. Charset properties are used for
special purposes within Emacs.
@end defun
+@deffn Command list-charset-chars charset
+This command displays a list of characters in the character set
+@var{charset}.
+@end deffn
+
@node Chars and Bytes
@section Characters and Bytes
@cindex bytes and characters
-@cindex introduction sequence
+@cindex introduction sequence (of character)
@cindex dimension (of character set)
In multibyte representation, each character occupies one or more
bytes. Each character set has an @dfn{introduction sequence}, which is
@end defun
@defun charset-bytes charset
-@tindex charset-bytes
This function returns the number of bytes used to represent a character
in character set @var{charset}.
@end defun
@node Splitting Characters
@section Splitting Characters
+@cindex character as bytes
The functions in this section convert between characters and the byte
values used to represent them. For most purposes, there is no need to
@end example
@end defun
+@cindex generate characters in charsets
@defun make-char charset &optional code1 code2
This function returns the character in character set @var{charset} whose
position codes are @var{code1} and @var{code2}. This is roughly the
coding systems (@pxref{Coding Systems}) are capable of representing all
of the text in question.
+@defun charset-after &optional pos
+This function return the charset of a character in the current buffer
+at position @var{pos}. If @var{pos} is omitted or @code{nil}, it
+defaults to the current value of point. If @var{pos} is out of range,
+the value is @code{nil}.
+@end defun
+
@defun find-charset-region beg end &optional translation
This function returns a list of the character sets that appear in the
current buffer between positions @var{beg} and @var{end}.
own particular translation tables; there are also default translation
tables which apply to all other coding systems.
+ For instance, the coding-system @code{utf-8} has a translation table
+that maps characters of various charsets (e.g.,
+@code{latin-iso8859-@var{x}}) into Unicode character sets. This way,
+it can encode Latin-2 characters into UTF-8. Meanwhile,
+@code{unify-8859-on-decoding-mode} operates by specifying
+@code{standard-translation-table-for-decode} to translate
+Latin-@var{x} characters into corresponding Unicode characters.
+
@defun make-translation-table &rest translations
This function returns a translation table based on the argument
@var{translations}. Each element of @var{translations} should be a
@defvar translation-table-for-input
Self-inserting characters are translated through this translation
-table before they are inserted. This variable automatically becomes
+table before they are inserted. Search commands also translate their
+input through this table, so they can compare more reliably with
+what's in the buffer.
+
+@code{set-buffer-file-coding-system} sets this variable so that your
+keyboard input gets translated into the character sets that the buffer
+is likely to contain. This variable automatically becomes
buffer-local when set.
@end defvar
conversion, but some of them leave the choice unspecified---to be chosen
heuristically for each file, based on the data.
-@cindex end of line conversion
+ In general, a coding system doesn't guarantee roundtrip identity:
+decoding a byte sequence using coding system, then encoding the
+resulting text in the same coding system, can produce a different byte
+sequence. However, the following coding systems do guarantee that the
+byte sequence will be the same as what you originally decoded:
+
+@quotation
+chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule
+greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3
+iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe
+japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text
+@end quotation
+
+ Encoding buffer text and then decoding the result can also fail to
+reproduce the original text. For instance, if you encode Latin-2
+characters with @code{utf-8} and decode the result using the same
+coding system, you'll get Unicode characters (of charset
+@code{mule-unicode-0100-24ff}). If you encode Unicode characters with
+@code{iso-latin-2} and decode the result with the same coding system,
+you'll get Latin-2 characters.
+
+@cindex EOL conversion
+@cindex end-of-line conversion
+@cindex line end conversion
@dfn{End of line conversion} handles three different conventions used
on various systems for representing end of line in files. The Unix
convention is to use the linefeed character (also called newline). The
uses one to encode the buffer contents.
You can specify the coding system to use either explicitly
-(@pxref{Specifying Coding Systems}), or implicitly using the defaulting
+(@pxref{Specifying Coding Systems}), or implicitly using a default
mechanism (@pxref{Default Coding Systems}). But these methods may not
completely specify what to do. For example, they may choose a coding
system such as @code{undefined} which leaves the character code
you will want to find out afterwards which coding system was chosen.
@defvar buffer-file-coding-system
-This variable records the coding system that was used for visiting the
-current buffer. It is used for saving the buffer, and for writing part
+This buffer-local variable records the coding system that was used to visit
+the current buffer. It is used for saving the buffer, and for writing part
of the buffer with @code{write-region}. If the text to be written
cannot be safely encoded using the coding system specified by this
variable, these operations select an alternative encoding by calling
The variable @code{selection-coding-system} specifies how to encode
selections for the window system. @xref{Window System Selections}.
+@defvar file-name-coding-system
+The variable @code{file-name-coding-system} specifies the coding
+system to use for encoding file names. Emacs encodes file names using
+that coding system for all file operations. If
+@code{file-name-coding-system} is @code{nil}, Emacs uses a default
+coding system determined by the selected language environment. In the
+default language environment, any non-@acronym{ASCII} characters in
+file names are not encoded specially; they appear in the file system
+using the internal Emacs representation.
+@end defvar
+
+ @strong{Warning:} if you change @code{file-name-coding-system} (or
+the language environment) in the middle of an Emacs session, problems
+can result if you have already visited files whose names were encoded
+using the earlier coding system and are handled differently under the
+new coding system. If you try to save one of these buffers under the
+visited file name, saving may use the wrong file name, or it may get
+an error. If such a problem happens, use @kbd{C-x C-w} to specify a
+new file name for that buffer.
+
@node Lisp and Coding Systems
@subsection Coding Systems in Lisp
Otherwise it signals an error with condition @code{coding-system-error}.
@end defun
+@defun coding-system-eol-type coding-system
+This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
+conversion used by @var{coding-system}. If @var{coding-system}
+specifies a certain eol conversion, the return value is an integer 0,
+1, or 2, standing for @code{unix}, @code{dos}, and @code{mac},
+respectively. If @var{coding-system} doesn't specify eol conversion
+explicitly, the return value is a vector of coding systems, each one
+with one of the possible eol conversion types, like this:
+
+@lisp
+(coding-system-eol-type 'latin-1)
+ @result{} [latin-1-unix latin-1-dos latin-1-mac]
+@end lisp
+
+@noindent
+If this function returns a vector, Emacs will decide, as part of the
+text encoding or decoding process, what eol conversion to use. For
+decoding, the end-of-line format of the text is auto-detected, and the
+eol conversion is set to match it (e.g., DOS-style CRLF format will
+imply @code{dos} eol conversion). For encoding, the eol conversion is
+taken from the appropriate default coding system (e.g.,
+@code{default-buffer-file-coding-system} for
+@code{buffer-file-coding-system}), or from the default eol conversion
+appropriate for the underlying platform.
+@end defun
+
@defun coding-system-change-eol-conversion coding-system eol-type
This function returns a coding system which is like @var{coding-system}
except for its eol conversion, which is specified by @code{eol-type}.
the end-of-line conversion from the data.
@var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
-@code{dos} and @code{mac}, respectively.
+@code{dos} and @code{mac}, respectively.
@end defun
@defun coding-system-change-text-conversion eol-coding text-coding
return value is just one coding system, the one that is highest in
priority.
-If the region contains only @acronym{ASCII} characters, the value
-is @code{undecided} or @code{(undecided)}, or a variant specifying
+If the region contains only @acronym{ASCII} characters except for such
+ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
+@code{undecided} or @code{(undecided)}, or a variant specifying
end-of-line conversion, if that can be deduced from the text.
@end defun
@var{encoding-system} is the coding system for encoding (in case
@var{operation} does encoding).
-The argument @var{operation} should be a symbol, one of
-@code{insert-file-contents}, @code{write-region}, @code{call-process},
-@code{call-process-region}, @code{start-process}, or
-@code{open-network-stream}. These are the names of the Emacs I/O primitives
-that can do coding system conversion.
+The argument @var{operation} is a symbol, one of @code{write-region},
+@code{start-process}, @code{call-process}, @code{call-process-region},
+@code{insert-file-contents}, or @code{open-network-stream}. These are
+the names of the Emacs I/O primitives that can do character code and
+eol conversion.
The remaining arguments should be the same arguments that might be given
-to that I/O primitive. Depending on the primitive, one of those
-arguments is selected as the @dfn{target}. For example, if
+to the corresponding I/O primitive. Depending on the primitive, one
+of those arguments is selected as the @dfn{target}. For example, if
@var{operation} does file I/O, whichever argument specifies the file
name is the target. For subprocess primitives, the process name is the
target. For @code{open-network-stream}, the target is the service name
or port number.
-This function looks up the target in @code{file-coding-system-alist},
-@code{process-coding-system-alist}, or
-@code{network-coding-system-alist}, depending on @var{operation}.
+Depending on @var{operation}, this function looks up the target in
+@code{file-coding-system-alist}, @code{process-coding-system-alist},
+or @code{network-coding-system-alist}. If the target is found in the
+alist, @code{find-operation-coding-system} returns its association in
+the alist; otherwise it returns @code{nil}.
+
+If @var{operation} is @code{insert-file-contents}, the argument
+corresponding to the target may be a cons cell of the form
+@code{(@var{filename} . @var{buffer})}). In that case, @var{filename}
+is a file name to look up in @code{file-coding-system-alist}, and
+@var{buffer} is a buffer that contains the file's contents (not yet
+decoded). If @code{file-coding-system-alist} specifies a function to
+call for this file, and that function needs to examine the file's
+contents (as it usually does), it should examine the contents of
+@var{buffer} instead of reading the file.
@end defun
@node Specifying Coding Systems
@example
;; @r{Read the file with no character code conversion.}
;; @r{Assume @acronym{crlf} represents end-of-line.}
-(let ((coding-system-for-write 'emacs-mule-dos))
+(let ((coding-system-for-read 'emacs-mule-dos))
(insert-file-contents filename))
@end example
-When its value is non-@code{nil}, @code{coding-system-for-read} takes
-precedence over all other methods of specifying a coding system to use for
-input, including @code{file-coding-system-alist},
+When its value is non-@code{nil}, this variable takes precedence over
+all other methods of specifying a coding system to use for input,
+including @code{file-coding-system-alist},
@code{process-coding-system-alist} and
@code{network-coding-system-alist}.
@end defvar
@node Explicit Encoding
@subsection Explicit Encoding and Decoding
-@cindex encoding text
-@cindex decoding text
+@cindex encoding in coding systems
+@cindex decoding in coding systems
All the operations that transfer text in and out of Emacs have the
ability to use a coding system to encode or decode the text.
@code{no-conversion}.
Here are the functions to perform explicit encoding or decoding. The
-decoding functions produce sequences of bytes; the encoding functions
+encoding functions produce sequences of bytes; the decoding functions
are meant to operate on sequences of bytes. All of these functions
discard text properties.
how Emacs interacts with these features.
@defvar locale-coding-system
-@tindex locale-coding-system
@cindex keyboard input decoding on X
This variable specifies the coding system to use for decoding system
error messages and---on X Window system only---keyboard input, for
@end defvar
@defvar system-messages-locale
-@tindex system-messages-locale
This variable specifies the locale to use for generating system error
messages. Changing the locale can cause messages to come out in a
different language or in a different orthography. If the variable is
@end defvar
@defvar system-time-locale
-@tindex system-time-locale
This variable specifies the locale to use for formatting time values.
Changing the locale can cause messages to appear according to the
conventions of a different language. If the variable is @code{nil}, the