@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
-@c Copyright (C) 1998-1999, 2001-2012 Free Software Foundation, Inc.
+@c Copyright (C) 1998-1999, 2001-2013 Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@node Non-ASCII Characters
@chapter Non-@acronym{ASCII} Characters
@menu
* Text Representations:: How Emacs represents text.
+* Disabling Multibyte:: Controlling whether to use multibyte characters.
* Converting Representations:: Converting unibyte to multibyte and vice versa.
* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
* Character Codes:: How unibyte and multibyte relate to
@defun multibyte-string-p string
Return @code{t} if @var{string} is a multibyte string, @code{nil}
-otherwise.
+otherwise. This function also returns @code{nil} if @var{string} is
+some object other than a string.
@end defun
@defun string-bytes string
result a unibyte string.
@end defun
+@node Disabling Multibyte
+@section Disabling Multibyte Characters
+@cindex disabling multibyte
+
+ By default, Emacs starts in multibyte mode: it stores the contents
+of buffers and strings using an internal encoding that represents
+non-@acronym{ASCII} characters using multi-byte sequences. Multibyte
+mode allows you to use all the supported languages and scripts without
+limitations.
+
+@cindex turn multibyte support on or off
+ Under very special circumstances, you may want to disable multibyte
+character support, for a specific buffer.
+When multibyte characters are disabled in a buffer, we call
+that @dfn{unibyte mode}. In unibyte mode, each character in the
+buffer has a character code ranging from 0 through 255 (0377 octal); 0
+through 127 (0177 octal) represent @acronym{ASCII} characters, and 128
+(0200 octal) through 255 (0377 octal) represent non-@acronym{ASCII}
+characters.
+
+ To edit a particular file in unibyte representation, visit it using
+@code{find-file-literally}. @xref{Visiting Functions}. You can
+convert a multibyte buffer to unibyte by saving it to a file, killing
+the buffer, and visiting the file again with
+@code{find-file-literally}. Alternatively, you can use @kbd{C-x
+@key{RET} c} (@code{universal-coding-system-argument}) and specify
+@samp{raw-text} as the coding system with which to visit or save a
+file. @xref{Text Coding, , Specifying a Coding System for File Text,
+emacs, GNU Emacs Manual}. Unlike @code{find-file-literally}, finding
+a file as @samp{raw-text} doesn't disable format conversion,
+uncompression, or auto mode selection.
+
+@c See http://debbugs.gnu.org/11226 for lack of unibyte tooltip.
+@vindex enable-multibyte-characters
+The buffer-local variable @code{enable-multibyte-characters} is
+non-@code{nil} in multibyte buffers, and @code{nil} in unibyte ones.
+The mode line also indicates whether a buffer is multibyte or not.
+With a graphical display, in a multibyte buffer, the portion of the
+mode line that indicates the character set has a tooltip that (amongst
+other things) says that the buffer is multibyte. In a unibyte buffer,
+the character set indicator is absent. Thus, in a unibyte buffer
+(when using a graphical display) there is normally nothing before the
+indication of the visited file's end-of-line convention (colon,
+backslash, etc.), unless you are using an input method.
+
+@findex toggle-enable-multibyte-characters
+You can turn off multibyte support in a specific buffer by invoking the
+command @code{toggle-enable-multibyte-characters} in that buffer.
+
@node Converting Representations
@section Converting Text Representations
characters.
@end defun
+@c FIXME: Should `@var{character}' be `@var{byte}'?
@defun byte-to-string byte
@cindex byte to string
This function returns a unibyte string containing a single byte of
during text processing and display. Thus, character properties are an
important part of specifying the character's semantics.
+@c FIXME: Use the latest URI of this chapter?
+@c http://www.unicode.org/versions/latest/ch04.pdf
On the whole, Emacs follows the Unicode Standard in its implementation
of character properties. In particular, Emacs supports the
@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
Model}, and the Emacs character property database is derived from the
Unicode Character Database (@acronym{UCD}). See the
-@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
+@uref{http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf, Character
Properties chapter of the Unicode Standard}, for a detailed
description of Unicode character properties and their meaning. This
section assumes you are already familiar with that chapter of the
Corresponds to the @code{Name} Unicode property. The value is a
string consisting of upper-case Latin letters A to Z, digits, spaces,
and hyphen @samp{-} characters. For unassigned codepoints, the value
-is an empty string.
+is @code{nil}.
@cindex unicode general category
@item general-category
may be a symbol representing a compatibility formatting tag, such as
@code{small}@footnote{The Unicode specification writes these tag names
inside @samp{<..>} brackets, but the tag names in Emacs do not include
-the brackets; e.g.@: Unicode specifies @samp{<small>} where Emacs uses
+the brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses
@samp{small}. }; the other elements are characters that give the
compatibility decomposition sequence of this character. For
unassigned codepoints, the value is the character itself.
@item decimal-digit-value
Corresponds to the Unicode @code{Numeric_Value} property for
-characters whose @code{Numeric_Type} is @samp{Digit}. The value is an
-integer number. For unassigned codepoints, the value is @code{nil},
-which means @acronym{NaN}, or ``not-a-number''.
+characters whose @code{Numeric_Type} is @samp{Decimal}. The value is
+an integer number. For unassigned codepoints, the value is
+@code{nil}, which means @acronym{NaN}, or ``not-a-number''.
@item digit-value
Corresponds to the Unicode @code{Numeric_Value} property for
-characters whose @code{Numeric_Type} is @samp{Decimal}. The value is
-an integer number. Examples of such characters include compatibility
+characters whose @code{Numeric_Type} is @samp{Digit}. The value is an
+integer number. Examples of such characters include compatibility
subscript and superscript digits, for which the value is the
corresponding number. For unassigned codepoints, the value is
@code{nil}, which means @acronym{NaN}.
@item old-name
Corresponds to the Unicode @code{Unicode_1_Name} property. The value
-is a string. For unassigned codepoints, the value is an empty string.
+is a string. Unassigned codepoints, and characters that have no value
+for this property, the value is @code{nil}.
@item iso-10646-comment
Corresponds to the Unicode @code{ISO_Comment} property. The value is
@defun get-char-code-property char propname
This function returns the value of @var{char}'s @var{propname} property.
+@c FIXME: Use ‘?\s’ instead of ‘? ’ for the space character in the
+@c first example? --xfq
@example
@group
(get-char-code-property ? 'general-category)
@end defvar
@defvar char-script-table
+@cindex script symbols
The value of this variable is a char-table that specifies, for each
character, a symbol whose name is the script to which the character
belongs, according to the Unicode Standard classification of the
system (@pxref{Coding Systems}).
@end defun
+@c TODO: Explain the properties here and add indexes such as ‘charset property’.
@defun charset-plist charset
This function returns the property list of the character set
@var{charset}. Although @var{charset} is a symbol, this is not the
value of this variable, if non-@code{nil}, is applied after them.
@end defvar
+@c FIXME: This variable is obsolete since 23.1. We should mention
+@c that here or simply remove this defvar. --xfq
@defvar translation-table-for-input
Self-inserting characters are translated through this translation
table before they are inserted. Search commands also translate their
Each element of @var{alist} is of the form @code{(@var{from}
. @var{to})}, where @var{from} and @var{to} are either characters or
vectors specifying a sequence of characters. If @var{from} is a
-character, that character is translated to @var{to} (i.e.@: to a
+character, that character is translated to @var{to} (i.e., to a
character or a character sequence). If @var{from} is a vector of
characters, that sequence is translated to @var{to}. The returned
table has a translation table for reverse mapping in the first extra
for a single file operation.
* Explicit Encoding:: Encoding or decoding text without doing I/O.
* Terminal I/O Encoding:: Use of encoding for terminal I/O.
-* MS-DOS File Types:: How DOS "text" and "binary" files
- relate to coding systems.
@end menu
@node Coding System Basics
character (also called newline). The DOS convention, used on
MS-Windows and MS-DOS systems, is to use a carriage-return and a
linefeed at the end of a line. The Mac convention is to use just
-carriage-return.
+carriage-return. (This was the convention used on the Macintosh
+system prior to OS X.)
@cindex base coding system
@cindex variant coding system
as an alias for the coding system.
@end defun
+@cindex alias, for coding systems
@defun coding-system-aliases coding-system
This function returns the list of aliases of @var{coding-system}.
@end defun
an error. If such a problem happens, use @kbd{C-x C-w} to specify a
new file name for that buffer.
+@cindex file-name encoding, MS-Windows
+ On Windows 2000 and later, Emacs by default uses Unicode APIs to
+pass file names to the OS, so the value of
+@code{file-name-coding-system} is largely ignored. Lisp applications
+that need to encode or decode file names on the Lisp level should use
+@code{utf-8} coding-system when @code{system-type} is
+@code{windows-nt}; the conversion of UTF-8 encoded file names to the
+encoding appropriate for communicating with the OS is performed
+internally by Emacs.
+
@node Lisp and Coding Systems
@subsection Coding Systems in Lisp
@defun detect-coding-region start end &optional highest
This function chooses a plausible coding system for decoding the text
from @var{start} to @var{end}. This text should be a byte sequence,
-i.e.@: unibyte text or multibyte text with only @acronym{ASCII} and
+i.e., unibyte text or multibyte text with only @acronym{ASCII} and
eight-bit characters (@pxref{Explicit Encoding}).
Normally this function returns a list of coding systems that could
support too many character sets to list them all yield special values:
@itemize @bullet
@item
-If @var{coding-system} supports all the ISO-2022 charsets, the value
-is @code{iso-2022}.
-@item
If @var{coding-system} supports all Emacs characters, the value is
@code{(emacs)}.
@item
-If @var{coding-system} supports all emacs-mule characters, the value
-is @code{emacs-mule}.
-@item
If @var{coding-system} supports all Unicode characters, the value is
@code{(unicode)}.
+@item
+If @var{coding-system} supports all ISO-2022 charsets, the value is
+@code{iso-2022}.
+@item
+If @var{coding-system} supports all the characters in the internal
+coding system used by Emacs version 21 (prior to the implementation of
+internal Unicode support), the value is @code{emacs-mule}.
@end itemize
@end defun
If @var{operation} is @code{insert-file-contents}, the argument
corresponding to the target may be a cons cell of the form
-@code{(@var{filename} . @var{buffer})}). In that case, @var{filename}
+@code{(@var{filename} . @var{buffer})}. In that case, @var{filename}
is a file name to look up in @code{file-coding-system-alist}, and
@var{buffer} is a buffer that contains the file's contents (not yet
decoded). If @code{file-coding-system-alist} specifies a function to
@example
;; @r{Read the file with no character code conversion.}
-;; @r{Assume @acronym{crlf} represents end-of-line.}
-(let ((coding-system-for-read 'emacs-mule-dos))
+(let ((coding-system-for-read 'no-conversion))
(insert-file-contents filename))
@end example
@code{nil}, that means the currently selected frame's terminal.
@end deffn
-@node MS-DOS File Types
-@subsection MS-DOS File Types
-@cindex DOS file types
-@cindex MS-DOS file types
-@cindex Windows file types
-@cindex file types on MS-DOS and Windows
-@cindex text files and binary files
-@cindex binary files and text files
-
- On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
-end-of-line conversion for a file by looking at the file's name. This
-feature classifies files as @dfn{text files} and @dfn{binary files}. By
-``binary file'' we mean a file of literal byte values that are not
-necessarily meant to be characters; Emacs does no end-of-line conversion
-and no character code conversion for them. On the other hand, the bytes
-in a text file are intended to represent characters; when you create a
-new file whose name implies that it is a text file, Emacs uses DOS
-end-of-line conversion.
-
-@defvar buffer-file-type
-This variable, automatically buffer-local in each buffer, records the
-file type of the buffer's visited file. When a buffer does not specify
-a coding system with @code{buffer-file-coding-system}, this variable is
-used to determine which coding system to use when writing the contents
-of the buffer. It should be @code{nil} for text, @code{t} for binary.
-If it is @code{t}, the coding system is @code{no-conversion}.
-Otherwise, @code{undecided-dos} is used.
-
-Normally this variable is set by visiting a file; it is set to
-@code{nil} if the file was visited without any actual conversion.
-
-Its default value is used to decide how to handle files for which
-@code{file-name-buffer-file-type-alist} says nothing about the type:
-If the default value is non-@code{nil}, then these files are treated as
-binary: the coding system @code{no-conversion} is used. Otherwise,
-nothing special is done for them---the coding system is deduced solely
-from the file contents, in the usual Emacs fashion.
-@end defvar
-
-@defopt file-name-buffer-file-type-alist
-This variable holds an alist for recognizing text and binary files.
-Each element has the form (@var{regexp} . @var{type}), where
-@var{regexp} is matched against the file name, and @var{type} may be
-@code{nil} for text, @code{t} for binary, or a function to call to
-compute which. If it is a function, then it is called with a single
-argument (the file name) and should return @code{t} or @code{nil}.
-
-When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
-which coding system to use when reading a file. For a text file,
-@code{undecided-dos} is used. For a binary file, @code{no-conversion}
-is used.
-
-If no element in this alist matches a given file name, then
-the default value of @code{buffer-file-type} says how to treat the file.
-@end defopt
-
@node Input Methods
@section Input Methods
@cindex input methods