@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
-@c 2005, 2006, 2007, 2008 Free Software Foundation, Inc.
+@c 2005, 2006, 2007, 2008, 2009, 2010 Free Software Foundation, Inc.
@c See the file elisp.texi for copying conditions.
@setfilename ../../info/characters
@node Non-ASCII Characters, Searching and Matching, Text, Top
* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
* Character Codes:: How unibyte and multibyte relate to
codes of individual characters.
+* Character Properties:: Character attributes that define their
+ behavior and handling.
* Character Sets:: The space of possible character codes
is divided into various character sets.
* Scanning Charsets:: Which character sets are used in a buffer?
@cindex text representation
Emacs buffers and strings support a large repertoire of characters
-from many different scripts. This is so users could type and display
-text in most any known written language.
+from many different scripts, allowing users to type and display text
+in almost any known written language.
@cindex character codepoint
@cindex codespace
follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
unique number, called a @dfn{codepoint}, to each and every character.
The range of codepoints defined by Unicode, or the Unicode
-@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs
-extends this range with codepoints in the range @code{110000..3FFFFF},
-which it uses for representing characters that are not unified with
-Unicode and raw 8-bit bytes that cannot be interpreted as characters
-(the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a
-character codepoint in Emacs is a 22-bit integer number.
+@dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation),
+inclusive. Emacs extends this range with codepoints in the range
+@code{#x110000..#x3FFFFF}, which it uses for representing characters
+that are not unified with Unicode and @dfn{raw 8-bit bytes} that
+cannot be interpreted as characters. Thus, a character codepoint in
+Emacs is a 22-bit integer number.
@cindex internal representation of characters
@cindex characters, representation in buffers and strings
by the Unicode Standard, called @dfn{UTF-8}, for representing any
Unicode codepoint, but Emacs extends UTF-8 to represent the additional
codepoints it uses for raw 8-bit bytes and characters not unified with
-Unicode.}.
-For example, any @acronym{ASCII} character takes up only 1 byte, a
-Latin-1 character takes up 2 bytes, etc. We call this representation
-of text @dfn{multibyte}, because it uses several bytes for each
-character.
+Unicode.}. For example, any @acronym{ASCII} character takes up only 1
+byte, a Latin-1 character takes up 2 bytes, etc. We call this
+representation of text @dfn{multibyte}.
Outside Emacs, characters can be represented in many different
encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
-between these external encodings and the internal representation, as
+between these external encodings and its internal representation, as
appropriate, when it reads text into a buffer or a string, or when it
writes text to a disk file or passes it to some other process.
Encoded text is not really text, as far as Emacs is concerned, but
rather a sequence of raw 8-bit bytes. We call buffers and strings
that hold encoded text @dfn{unibyte} buffers and strings, because
-Emacs treats them as a sequence of individual bytes. In particular,
-Emacs usually displays unibyte buffers and strings as octal codes such
-as @code{\237}. We recommend that you never use unibyte buffers and
+Emacs treats them as a sequence of individual bytes. Usually, Emacs
+displays unibyte buffers and strings as octal codes such as
+@code{\237}. We recommend that you never use unibyte buffers and
strings except for manipulating encoded text or binary non-text data.
In a buffer, the buffer-local value of the variable
You cannot set this variable directly; instead, use the function
@code{set-buffer-multibyte} to change a buffer's representation.
-@end defvar
-
-@defvar default-enable-multibyte-characters
-This variable's value is entirely equivalent to @code{(default-value
-'enable-multibyte-characters)}, and setting this variable changes that
-default value. Setting the local binding of
-@code{enable-multibyte-characters} in a specific buffer is not allowed,
-but changing the default value is supported, and it is a reasonable
-thing to do, because it has no effect on existing buffers.
The @samp{--unibyte} command line option does its job by setting the
default value to @code{nil} early in startup.
text from several strings together in one string. You can also
explicitly convert a string's contents to either representation.
- Emacs chooses the representation for a string based on the text that
-it is constructed from. The general rule is to convert unibyte text to
-multibyte text when combining it with other multibyte text, because the
-multibyte representation is more general and can hold whatever
+ Emacs chooses the representation for a string based on the text from
+which it is constructed. The general rule is to convert unibyte text
+to multibyte text when combining it with other multibyte text, because
+the multibyte representation is more general and can hold whatever
characters the unibyte text has.
When inserting text into a buffer, Emacs converts the text to the
acceptable because the buffer's representation is a choice made by the
user that cannot be overridden automatically.
- Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
-unchanged, and converts bytes with codes 128 through 159 to the
-multibyte representation of raw eight-bit bytes.
+ Converting unibyte text to multibyte text leaves @acronym{ASCII}
+characters unchanged, and converts bytes with codes 128 through 159 to
+the multibyte representation of raw eight-bit bytes.
Converting multibyte text to unibyte converts all @acronym{ASCII}
and eight-bit characters to their single-byte form, but loses
it is returned unchanged. The function assumes that @var{string}
includes only @acronym{ASCII} characters and raw 8-bit bytes; the
latter are converted to their multibyte representation corresponding
-to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text
-Representations, codepoints}).
+to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
+(@pxref{Text Representations, codepoints}).
@end defun
@defun string-to-unibyte string
@end defun
@defun multibyte-char-to-unibyte char
-This convert the multibyte character @var{char} to a unibyte
-character. If @var{char} is a character that is neither
-@acronym{ASCII} nor eight-bit, the value is -1.
+This converts the multibyte character @var{char} to a unibyte
+character, and returns that character. If @var{char} is neither
+@acronym{ASCII} nor eight-bit, the function returns -1.
@end defun
@defun unibyte-char-to-multibyte char
This function leaves the buffer contents unchanged when viewed as a
sequence of bytes. As a consequence, it can change the contents
-viewed as characters; a sequence of three bytes which is treated as
-one character in multibyte representation will count as three
-characters in unibyte representation. Eight-bit characters
+viewed as characters; for instance, a sequence of three bytes which is
+treated as one character in multibyte representation will count as
+three characters in unibyte representation. Eight-bit characters
representing raw bytes are an exception. They are represented by one
byte in a unibyte buffer, but when the buffer is set to multibyte,
they are converted to two-byte sequences, and vice versa.
@end defun
@defun string-as-unibyte string
-This function returns a string with the same bytes as @var{string} but
-treating each byte as a character. This means that the value may have
-more characters than @var{string} has. Eight-bit characters
-representing raw bytes are an exception: each one of them is converted
-to a single byte.
-
-If @var{string} is already a unibyte string, then the value is
-@var{string} itself. Otherwise it is a newly created string, with no
+If @var{string} is already a unibyte string, this function returns
+@var{string} itself. Otherwise, it returns a new string with the same
+bytes as @var{string}, but treating each byte as a separate character
+(so that the value may have more characters than @var{string}); as an
+exception, each eight-bit character representing a raw byte is
+converted into a single byte. The newly-created string contains no
text properties.
@end defun
@defun string-as-multibyte string
-This function returns a string with the same bytes as @var{string} but
-treating each multibyte sequence as one character. This means that
-the value may have fewer characters than @var{string} has. If a byte
-sequence in @var{string} is invalid as a multibyte representation of a
-single character, each byte in the sequence is treated as raw 8-bit
-byte.
-
-If @var{string} is already a multibyte string, then the value is
-@var{string} itself. Otherwise it is a newly created string, with no
-text properties.
+If @var{string} is a multibyte string, this function returns
+@var{string} itself. Otherwise, it returns a new string with the same
+bytes as @var{string}, but treating each multibyte sequence as one
+character. This means that the value may have fewer characters than
+@var{string} has. If a byte sequence in @var{string} is invalid as a
+multibyte representation of a single character, each byte in the
+sequence is treated as a raw 8-bit byte. The newly-created string
+contains no text properties.
@end defun
@node Character Codes
The unibyte and multibyte text representations use different
character codes. The valid character codes for unibyte representation
-range from 0 to 255---the values that can fit in one byte. The valid
-character codes for multibyte representation range from 0 to 4194303
-(#x3FFFFF). In this code space, values 0 through 127 are for
-@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
-are for non-@acronym{ASCII} characters. Values 0 through 1114111
-(#10FFFF) corresponds to Unicode characters of the same codepoint,
-while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for
-representing eight-bit raw bytes.
+range from 0 to @code{#xFF} (255)---the values that can fit in one
+byte. The valid character codes for multibyte representation range
+from 0 to @code{#x3FFFFF}. In this code space, values 0 through
+@code{#x7F} (127) are for @acronym{ASCII} characters, and values
+@code{#x80} (128) through @code{#x3FFF7F} (4194175) are for
+non-@acronym{ASCII} characters.
+
+ Emacs character codes are a superset of the Unicode standard.
+Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode
+characters of the same codepoint; values @code{#x110000} (1114112)
+through @code{#x3FFF7F} (4194175) represent characters that are not
+unified with Unicode; and values @code{#x3FFF80} (4194176) through
+@code{#x3FFFFF} (4194303) represent eight-bit raw bytes.
@defun characterp charcode
This returns @code{t} if @var{charcode} is a valid character, and
@end example
@end defun
-@defun get-byte pos &optional string
-This function returns the byte at current buffer's character position
-@var{pos}. If the current buffer is unibyte, this is literally the
-byte at that position. If the buffer is multibyte, byte values of
+@defun get-byte &optional pos string
+This function returns the byte at character position @var{pos} in the
+current buffer. If the current buffer is unibyte, this is literally
+the byte at that position. If the buffer is multibyte, byte values of
@acronym{ASCII} characters are the same as character codepoints,
whereas eight-bit raw bytes are converted to their 8-bit codes. The
function signals an error if the character at @var{pos} is
string instead of the current buffer.
@end defun
+@node Character Properties
+@section Character Properties
+@cindex character properties
+A @dfn{character property} is a named attribute of a character that
+specifies how the character behaves and how it should be handled
+during text processing and display. Thus, character properties are an
+important part of specifying the character's semantics.
+
+ On the whole, Emacs follows the Unicode Standard in its implementation
+of character properties. In particular, Emacs supports the
+@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
+Model}, and the Emacs character property database is derived from the
+Unicode Character Database (@acronym{UCD}). See the
+@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
+Properties chapter of the Unicode Standard}, for a detailed
+description of Unicode character properties and their meaning. This
+section assumes you are already familiar with that chapter of the
+Unicode Standard, and want to apply that knowledge to Emacs Lisp
+programs.
+
+ In Emacs, each property has a name, which is a symbol, and a set of
+possible values, whose types depend on the property; if a character
+does not have a certain property, the value is @code{nil}. As a
+general rule, the names of character properties in Emacs are produced
+from the corresponding Unicode properties by downcasing them and
+replacing each @samp{_} character with a dash @samp{-}. For example,
+@code{Canonical_Combining_Class} becomes
+@code{canonical-combining-class}. However, sometimes we shorten the
+names to make their use easier.
+
+ Here is the full list of value types for all the character
+properties that Emacs knows about:
+
+@table @code
+@item name
+This property corresponds to the Unicode @code{Name} property. The
+value is a string consisting of upper-case Latin letters A to Z,
+digits, spaces, and hyphen @samp{-} characters.
+
+@cindex unicode general category
+@item general-category
+This property corresponds to the Unicode @code{General_Category}
+property. The value is a symbol whose name is a 2-letter abbreviation
+of the character's classification.
+
+@item canonical-combining-class
+Corresponds to the Unicode @code{Canonical_Combining_Class} property.
+The value is an integer number.
+
+@item bidi-class
+Corresponds to the Unicode @code{Bidi_Class} property. The value is a
+symbol whose name is the Unicode @dfn{directional type} of the
+character.
+
+@item decomposition
+Corresponds to the Unicode @code{Decomposition_Type} and
+@code{Decomposition_Value} properties. The value is a list, whose
+first element may be a symbol representing a compatibility formatting
+tag, such as @code{small}@footnote{
+Note that the Unicode spec writes these tag names inside
+@samp{<..>} brackets. The tag names in Emacs do not include the
+brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses
+@samp{small}.
+}; the other elements are characters that give the compatibility
+decomposition sequence of this character.
+
+@item decimal-digit-value
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Digit}. The value is an
+integer number.
+
+@item digit
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Decimal}. The value is
+an integer number. Examples of such characters include compatibility
+subscript and superscript digits, for which the value is the
+corresponding number.
+
+@item numeric-value
+Corresponds to the Unicode @code{Numeric_Value} property for
+characters whose @code{Numeric_Type} is @samp{Numeric}. The value of
+this property is an integer or a floating-point number. Examples of
+characters that have this property include fractions, subscripts,
+superscripts, Roman numerals, currency numerators, and encircled
+numbers. For example, the value of this property for the character
+@code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}.
+
+@item mirrored
+Corresponds to the Unicode @code{Bidi_Mirrored} property. The value
+of this property is a symbol, either @code{Y} or @code{N}.
+
+@item old-name
+Corresponds to the Unicode @code{Unicode_1_Name} property. The value
+is a string.
+
+@item iso-10646-comment
+Corresponds to the Unicode @code{ISO_Comment} property. The value is
+a string.
+
+@item uppercase
+Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property.
+The value of this property is a single character.
+
+@item lowercase
+Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property.
+The value of this property is a single character.
+
+@item titlecase
+Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
+@dfn{Title case} is a special form of a character used when the first
+character of a word needs to be capitalized. The value of this
+property is a single character.
+@end table
+
+@defun get-char-code-property char propname
+This function returns the value of @var{char}'s @var{propname} property.
+
+@example
+@group
+(get-char-code-property ? 'general-category)
+ @result{} Zs
+@end group
+@group
+(get-char-code-property ?1 'general-category)
+ @result{} Nd
+@end group
+@group
+(get-char-code-property ?\u2084 'digit-value) ; subscript 4
+ @result{} 4
+@end group
+@group
+(get-char-code-property ?\u2155 'numeric-value) ; one fifth
+ @result{} 1/5
+@end group
+@group
+(get-char-code-property ?\u2163 'numeric-value) ; Roman IV
+ @result{} \4
+@end group
+@end example
+@end defun
+
+@defun char-code-property-description prop value
+This function returns the description string of property @var{prop}'s
+@var{value}, or @code{nil} if @var{value} has no description.
+
+@example
+@group
+(char-code-property-description 'general-category 'Zs)
+ @result{} "Separator, Space"
+@end group
+@group
+(char-code-property-description 'general-category 'Nd)
+ @result{} "Number, Decimal Digit"
+@end group
+@group
+(char-code-property-description 'numeric-value '1/5)
+ @result{} nil
+@end group
+@end example
+@end defun
+
+@defun put-char-code-property char propname value
+This function stores @var{value} as the value of the property
+@var{propname} for the character @var{char}.
+@end defun
+
+@defvar unicode-category-table
+The value of this variable is a char-table (@pxref{Char-Tables}) that
+specifies, for each character, its Unicode @code{General_Category}
+property as a symbol.
+@end defvar
+
+@defvar char-script-table
+The value of this variable is a char-table that specifies, for each
+character, a symbol whose name is the script to which the character
+belongs, according to the Unicode Standard classification of the
+Unicode code space into script-specific blocks. This char-table has a
+single extra slot whose value is the list of all script symbols.
+@end defvar
+
+@defvar char-width-table
+The value of this variable is a char-table that specifies the width of
+each character in columns that it will occupy on the screen.
+@end defvar
+
+@defvar printable-chars
+The value of this variable is a char-table that specifies, for each
+character, whether it is printable or not. That is, if evaluating
+@code{(aref printable-chars char)} results in @code{t}, the character
+is printable, and if it results in @code{nil}, it is not.
+@end defvar
+
@node Character Sets
@section Character Sets
@cindex character sets
@cindex coded character set
An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
in which each character is assigned a numeric code point. (The
-Unicode standard calls this a @dfn{coded character set}.) Each Emacs
+Unicode Standard calls this a @dfn{coded character set}.) Each Emacs
charset has a name which is a symbol. A single character can belong
to any number of different character sets, but it will generally have
a different code point in each charset. Examples of character sets
@cindex @code{eight-bit}, a charset
Emacs defines several special character sets. The character set
@code{unicode} includes all the characters whose Emacs code points are
-in the range @code{0..10FFFF}. The character set @code{emacs}
+in the range @code{0..#x10FFFF}. The character set @code{emacs}
includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
Emacs uses it to represent raw bytes encountered in text.
This function makes @var{charsets} the highest priority character sets.
@end defun
-@defun char-charset character
+@defun char-charset character &optional restriction
This function returns the name of the character set of highest
priority that @var{character} belongs to. @acronym{ASCII} characters
are an exception: for them, this function always returns @code{ascii}.
+
+If @var{restriction} is non-@code{nil}, it should be a list of
+charsets to search. Alternatively, it can be a coding system, in
+which case the returned charset must be supported by that coding
+system (@pxref{Coding Systems}).
@end defun
@defun charset-plist charset
that fits the second argument of @code{decode-char} above. If
@var{charset} doesn't have a codepoint for @var{char}, the value is
@code{nil}.
+@end defun
+
+ The following function comes in handy for applying a certain
+function to all or part of the characters in a charset:
+
+@defun map-charset-chars function charset &optional arg from-code to-code
+Call @var{function} for characters in @var{charset}. @var{function}
+is called with two arguments. The first one is a cons cell
+@code{(@var{from} . @var{to})}, where @var{from} and @var{to}
+indicate a range of characters contained in charset. The second
+argument passed to @var{function} is @var{arg}.
+
+By default, the range of codepoints passed to @var{function} includes
+all the characters in @var{charset}, but optional arguments
+@var{from-code} and @var{to-code} limit that to the range of
+characters between these two codepoints of @var{charset}. If either
+of them is @code{nil}, it defaults to the first or last codepoint of
+@var{charset}, respectively.
@end defun
@node Scanning Charsets
@section Scanning for Character Sets
- Sometimes it is useful to find out, for characters that appear in a
-certain part of a buffer or a string, to which character sets they
-belong. One use for this is in determining which coding systems
-(@pxref{Coding Systems}) are capable of representing all of the text
-in question; another is to determine the font(s) for displaying that
-text.
+ Sometimes it is useful to find out which character set a particular
+character belongs to. One use for this is in determining which coding
+systems (@pxref{Coding Systems}) are capable of representing all of
+the text in question; another is to determine the font(s) for
+displaying that text.
@defun charset-after &optional pos
This function returns the charset of highest priority containing the
-character in the current buffer at position @var{pos}. If @var{pos}
+character at position @var{pos} in the current buffer. If @var{pos}
is omitted or @code{nil}, it defaults to the current value of point.
If @var{pos} is out of range, the value is @code{nil}.
@end defun
that contain characters in the current buffer between positions
@var{beg} and @var{end}.
-The optional argument @var{translation} specifies a translation table to
-be used in scanning the text (@pxref{Translation of Characters}). If it
-is non-@code{nil}, then each character in the region is translated
+The optional argument @var{translation} specifies a translation table
+to use for scanning the text (@pxref{Translation of Characters}). If
+it is non-@code{nil}, then each character in the region is translated
through this table, and the value returned describes the translated
characters instead of the characters actually in the buffer.
@end defun
@defun find-charset-string string &optional translation
-This function returns a list of the character sets of highest priority
+This function returns a list of character sets of highest priority
that contain characters in @var{string}. It is just like
@code{find-charset-region}, except that it applies to the contents of
@var{string} instead of part of the current buffer.
During decoding, the translation table's translations are applied to
the characters that result from ordinary decoding. If a coding system
-has property @code{:decode-translation-table}, that specifies the
+has the property @code{:decode-translation-table}, that specifies the
translation table to use, or a list of translation tables to apply in
sequence. (This is a property of the coding system, as returned by
@code{coding-system-get}, not a property of the symbol that is the
value of this variable, if non-@code{nil}, is applied after them.
@end defvar
+@defvar translation-table-for-input
+Self-inserting characters are translated through this translation
+table before they are inserted. Search commands also translate their
+input through this table, so they can compare more reliably with
+what's in the buffer.
+
+This variable automatically becomes buffer-local when set.
+@end defvar
+
@defun make-translation-table-from-vector vec
This function returns a translation table made from @var{vec} that is
-an array of 256 elements to map byte values 0 through 255 to
+an array of 256 elements to map bytes (values 0 through #xFF) to
characters. Elements may be @code{nil} for untranslated bytes. The
returned table has a translation table for reverse mapping in the
first extra slot, and the value @code{1} in the second extra slot.
This function is similar to @code{make-translation-table} but returns
a complex translation table rather than a simple one-to-one mapping.
Each element of @var{alist} is of the form @code{(@var{from}
-. @var{to})}, where @var{from} and @var{to} are either a character or
-a vector specifying a sequence of characters. If @var{from} is a
+. @var{to})}, where @var{from} and @var{to} are either characters or
+vectors specifying a sequence of characters. If @var{from} is a
character, that character is translated to @var{to} (i.e.@: to a
character or a character sequence). If @var{from} is a vector of
characters, that sequence is translated to @var{to}. The returned
three coding systems for the Cyrillic (Russian) alphabet: ISO,
Alternativnyj, and KOI8.
-@c I think this paragraph is no longer correct.
-@ignore
- Most coding systems specify a particular character code for
-conversion, but some of them leave the choice unspecified---to be chosen
-heuristically for each file, based on the data.
-@end ignore
+ Every coding system specifies a particular set of character code
+conversions, but the coding system @code{undecided} is special: it
+leaves the choice unspecified, to be chosen heuristically for each
+file, based on the file's data.
In general, a coding system doesn't guarantee roundtrip identity:
decoding a byte sequence using coding system, then encoding the
well. Most base coding systems have three corresponding variants whose
names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.
+@vindex raw-text@r{ coding system}
The coding system @code{raw-text} is special in that it prevents
-character code conversion, and causes the buffer visited with that
-coding system to be a unibyte buffer. It does not specify the
-end-of-line conversion, allowing that to be determined as usual by the
-data, and has the usual three variants which specify the end-of-line
-conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}:
-it specifies no conversion of either character codes or end-of-line.
+character code conversion, and causes the buffer visited with this
+coding system to be a unibyte buffer. For historical reasons, you can
+save both unibyte and multibyte text with this coding system. When
+you use @code{raw-text} to encode multibyte text, it does perform one
+character code conversion: it converts eight-bit characters to their
+single-byte external representation. @code{raw-text} does not specify
+the end-of-line conversion, allowing that to be determined as usual by
+the data, and has the usual three variants which specify the
+end-of-line conversion.
+
+@vindex no-conversion@r{ coding system}
+@vindex binary@r{ coding system}
+ @code{no-conversion} (and its alias @code{binary}) is equivalent to
+@code{raw-text-unix}: it specifies no conversion of either character
+codes or end-of-line.
@vindex emacs-internal@r{ coding system}
- The coding system @code{emacs-internal} specifies that the data is
-represented in the internal Emacs encoding. This is like
-@code{raw-text} in that no code conversion happens, but different in
-that the result is multibyte data.
+@vindex utf-8-emacs@r{ coding system}
+ The coding system @code{utf-8-emacs} specifies that the data is
+represented in the internal Emacs encoding (@pxref{Text
+Representations}). This is like @code{raw-text} in that no code
+conversion happens, but different in that the result is multibyte
+data. The name @code{emacs-internal} is an alias for
+@code{utf-8-emacs}.
@defun coding-system-get coding-system property
This function returns the specified property of the coding system
as an alias for the coding system.
@end defun
+@defun coding-system-aliases coding-system
+This function returns the list of aliases of @var{coding-system}.
+@end defun
+
@node Encoding and I/O
@subsection Encoding and I/O
The principal purpose of coding systems is for use in reading and
-writing files. The function @code{insert-file-contents} uses
-a coding system for decoding the file data, and @code{write-region}
-uses one to encode the buffer contents.
+writing files. The function @code{insert-file-contents} uses a coding
+system to decode the file data, and @code{write-region} uses one to
+encode the buffer contents.
You can specify the coding system to use either explicitly
(@pxref{Specifying Coding Systems}), or implicitly using a default
Here are the Lisp facilities for working with coding systems:
+@cindex list all coding systems
@defun coding-system-list &optional base-only
This function returns a list of all coding system names (symbols). If
@var{base-only} is non-@code{nil}, the value includes only the
name or @code{nil}.
@end defun
+@cindex validity of coding system
+@cindex coding system, validity check
@defun check-coding-system coding-system
This function checks the validity of @var{coding-system}. If that is
valid, it returns @var{coding-system}. If @var{coding-system} is
(@pxref{Signaling Errors, signal}).
@end defun
+@cindex eol type of coding system
@defun coding-system-eol-type coding-system
This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
conversion used by @var{coding-system}. If @var{coding-system}
eol conversion is set to match it (e.g., DOS-style CRLF format will
imply @code{dos} eol conversion). For encoding, the eol conversion is
taken from the appropriate default coding system (e.g.,
-@code{default-buffer-file-coding-system} for
+default value of @code{buffer-file-coding-system} for
@code{buffer-file-coding-system}), or from the default eol conversion
appropriate for the underlying platform.
@end defun
+@cindex eol conversion of coding system
@defun coding-system-change-eol-conversion coding-system eol-type
This function returns a coding system which is like @var{coding-system}
except for its eol conversion, which is specified by @code{eol-type}.
@code{dos} and @code{mac}, respectively.
@end defun
+@cindex text conversion of coding system
@defun coding-system-change-text-conversion eol-coding text-coding
This function returns a coding system which uses the end-of-line
conversion of @var{eol-coding}, and the text conversion of
@code{undecided}, or one of its variants according to @var{eol-coding}.
@end defun
+@cindex safely encode region
+@cindex coding systems for encoding region
@defun find-coding-systems-region from to
This function returns a list of coding systems that could be used to
encode a text between @var{from} and @var{to}. All coding systems in
list @code{(undecided)}.
@end defun
+@cindex safely encode a string
+@cindex coding systems for encoding a string
@defun find-coding-systems-string string
This function returns a list of coding systems that could be used to
encode the text of @var{string}. All coding systems in the list can
@code{(undecided)}.
@end defun
+@cindex charset, coding systems to encode
+@cindex safely encode characters in a charset
@defun find-coding-systems-for-charsets charsets
This function returns a list of coding systems that could be used to
encode all the character sets in the list @var{charsets}.
@end defun
+@defun check-coding-systems-region start end coding-system-list
+This function checks whether coding systems in the list
+@code{coding-system-list} can encode all the characters in the region
+between @var{start} and @var{end}. If all of the coding systems in
+the list can encode the specified text, the function returns
+@code{nil}. If some coding systems cannot encode some of the
+characters, the value is an alist, each element of which has the form
+@code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning
+that @var{coding-system1} cannot encode characters at buffer positions
+@var{pos1}, @var{pos2}, @enddots{}.
+
+@var{start} may be a string, in which case @var{end} is ignored and
+the returned value references string indices instead of buffer
+positions.
+@end defun
+
@defun detect-coding-region start end &optional highest
This function chooses a plausible coding system for decoding the text
from @var{start} to @var{end}. This text should be a byte sequence,
ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
@code{undecided} or @code{(undecided)}, or a variant specifying
end-of-line conversion, if that can be deduced from the text.
+
+If the region contains null bytes, the value is @code{no-conversion},
+even if the region contains text encoded in some coding system.
@end defun
@defun detect-coding-string string &optional highest
This function is like @code{detect-coding-region} except that it
operates on the contents of @var{string} instead of bytes in the buffer.
+@end defun
+
+@cindex null bytes, and decoding text
+@defvar inhibit-null-byte-detection
+If this variable has a non-@code{nil} value, null bytes are ignored
+when detecting the encoding of a region or a string. This allows to
+correctly detect the encoding of text that contains null bytes, such
+as Info files with Index nodes.
+@end defvar
+
+@defvar inhibit-iso-escape-detection
+If this variable has a non-@code{nil} value, ISO-2022 escape sequences
+are ignored when detecting the encoding of a region or a string. The
+result is that no text is ever detected as encoded in some ISO-2022
+encoding, and all escape sequences become visible in a buffer.
+@strong{Warning:} @emph{Use this variable with extreme caution,
+because many files in the Emacs distribution use ISO-2022 encoding.}
+@end defvar
+
+@cindex charsets supported by a coding system
+@defun coding-system-charset-list coding-system
+This function returns the list of character sets (@pxref{Character
+Sets}) supported by @var{coding-system}. Some coding systems that
+support too many character sets to list them all yield special values:
+@itemize @bullet
+@item
+If @var{coding-system} supports all the ISO-2022 charsets, the value
+is @code{iso-2022}.
+@item
+If @var{coding-system} supports all Emacs characters, the value is
+@code{(emacs)}.
+@item
+If @var{coding-system} supports all emacs-mule characters, the value
+is @code{emacs-mule}.
+@item
+If @var{coding-system} supports all Unicode characters, the value is
+@code{(unicode)}.
+@end itemize
@end defun
@xref{Coding systems for a subprocess,, Process Information}, in
@var{from} is a string, the string specifies the text to encode, and
@var{to} is ignored.
+If the specified text includes raw bytes (@pxref{Text
+Representations}), @code{select-safe-coding-system} suggests
+@code{raw-text} for its encoding.
+
If @var{default-coding-system} is non-@code{nil}, that is the first
coding system to try; if that can handle the text,
@code{select-safe-coding-system} returns that coding system. It can
also be a list of coding systems; then the function tries each of them
one by one. After trying all of them, it next tries the current
buffer's value of @code{buffer-file-coding-system} (if it is not
-@code{undecided}), then the value of
-@code{default-buffer-file-coding-system} and finally the user's most
+@code{undecided}), then the default value of
+@code{buffer-file-coding-system} and finally the user's most
preferred coding system, which the user can set using the command
@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
Coding Systems, emacs, The GNU Emacs Manual}).
@vindex select-safe-coding-system-accept-default-p
If the variable @code{select-safe-coding-system-accept-default-p} is
-non-@code{nil}, its value overrides the value of
-@var{accept-default-p}.
+non-@code{nil}, it should be a function taking a single argument.
+It is used in place of @var{accept-default-p}, overriding any
+value supplied for this argument.
As a final step, before returning the chosen coding system,
@code{select-safe-coding-system} checks whether that coding system is
@node Default Coding Systems
@subsection Default Coding Systems
+@cindex default coding system
+@cindex coding system, automatically determined
This section describes variables that specify the default coding
system for certain files or when running certain subprograms, and the
@code{coding-system-for-read} and @code{coding-system-for-write}
(@pxref{Specifying Coding Systems}).
-@defvar auto-coding-regexp-alist
+@cindex file contents, and default coding system
+@defopt auto-coding-regexp-alist
This variable is an alist of text patterns and corresponding coding
systems. Each element has the form @code{(@var{regexp}
. @var{coding-system})}; a file whose first few kilobytes match
@code{file-coding-system-alist} (see below). The default value is set
so that Emacs automatically recognizes mail files in Babyl format and
reads them with no code conversions.
-@end defvar
+@end defopt
-@defvar file-coding-system-alist
+@cindex file name, and default coding system
+@defopt file-coding-system-alist
This variable is an alist that specifies the coding systems to use for
reading and writing particular files. Each element has the form
@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
If @var{coding} (or what returned by the above function) is
@code{undecided}, the normal code-detection is performed.
-@end defvar
+@end defopt
+
+@defopt auto-coding-alist
+This variable is an alist that specifies the coding systems to use for
+reading and writing particular files. Its form is like that of
+@code{file-coding-system-alist}, but, unlike the latter, this variable
+takes priority over any @code{coding:} tags in the file.
+@end defopt
+@cindex program name, and default coding system
@defvar process-coding-system-alist
This variable is an alist specifying which coding systems to use for a
subprocess, depending on which program is running in the subprocess. It
the end of line conversion---that is, one like @code{latin-1-unix},
rather than @code{undecided} or @code{latin-1}.
+@cindex port number, and default coding system
+@cindex network service name, and default coding system
@defvar network-coding-system-alist
This variable is an alist that specifies the coding system to use for
network streams. It works much like @code{file-coding-system-alist},
the subprocess, and @var{output-coding} applies to output to it.
@end defvar
-@defvar auto-coding-functions
+@cindex default coding system, functions to determine
+@defopt auto-coding-functions
This variable holds a list of functions that try to determine a
coding system for a file based on its undecoded contents.
If a file has a @samp{coding:} tag, that takes precedence, so these
functions won't be called.
-@end defvar
+@end defopt
+
+@defun find-auto-coding filename size
+This function tries to determine a suitable coding system for
+@var{filename}. It examines the buffer visiting the named file, using
+the variables documented above in sequence, until it finds a match for
+one of the rules specified by these variables. It then returns a cons
+cell of the form @code{(@var{coding} . @var{source})}, where
+@var{coding} is the coding system to use and @var{source} is a symbol,
+one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist},
+@code{:coding}, or @code{auto-coding-functions}, indicating which one
+supplied the matching rule. The value @code{:coding} means the coding
+system was specified by the @code{coding:} tag in the file
+(@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}).
+The order of looking for a matching rule is @code{auto-coding-alist}
+first, then @code{auto-coding-regexp-alist}, then the @code{coding:}
+tag, and lastly @code{auto-coding-functions}. If no matching rule was
+found, the function returns @code{nil}.
+
+The second argument @var{size} is the size of text, in characters,
+following point. The function examines text only within @var{size}
+characters after point. Normally, the buffer should be positioned at
+the beginning when this function is called, because one of the places
+for the @code{coding:} tag is the first one or two lines of the file;
+in that case, @var{size} should be the size of the buffer.
+@end defun
+
+@defun set-auto-coding filename size
+This function returns a suitable coding system for file
+@var{filename}. It uses @code{find-auto-coding} to find the coding
+system. If no coding system could be determined, the function returns
+@code{nil}. The meaning of the argument @var{size} is like in
+@code{find-auto-coding}.
+@end defun
@defun find-operation-coding-system operation &rest arguments
This function returns the coding system to use (by default) for
affect it.
@end defvar
-@defvar inhibit-eol-conversion
+@defopt inhibit-eol-conversion
When this variable is non-@code{nil}, no end-of-line conversion is done,
no matter which coding system is specified. This applies to all the
Emacs I/O and subprocess primitives, and to the explicit encoding and
decoding functions (@pxref{Explicit Encoding}).
-@end defvar
+@end defopt
+
+@cindex priority order of coding systems
+@cindex coding systems, priority
+ Sometimes, you need to prefer several coding systems for some
+operation, rather than fix a single one. Emacs lets you specify a
+priority order for using coding systems. This ordering affects the
+sorting of lists of coding sysems returned by functions such as
+@code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}).
+
+@defun coding-system-priority-list &optional highestp
+This function returns the list of coding systems in the order of their
+current priorities. Optional argument @var{highestp}, if
+non-@code{nil}, means return only the highest priority coding system.
+@end defun
+
+@defun set-coding-system-priority &rest coding-systems
+This function puts @var{coding-systems} at the beginning of the
+priority list for coding systems, thus making their priority higher
+than all the rest.
+@end defun
+
+@defmac with-coding-priority coding-systems &rest body@dots{}
+This macro execute @var{body}, like @code{progn} does
+(@pxref{Sequencing, progn}), with @var{coding-systems} at the front of
+the priority list for coding systems. @var{coding-systems} should be
+a list of coding systems to prefer during execution of @var{body}.
+@end defmac
@node Explicit Encoding
@subsection Explicit Encoding and Decoding
text. They logically consist of a series of byte values; that is, a
series of @acronym{ASCII} and eight-bit characters. In unibyte
buffers and strings, these characters have codes in the range 0
-through 255. In a multibyte buffer or string, eight-bit characters
-have character codes higher than 255 (@pxref{Text Representations}),
-but Emacs transparently converts them to their single-byte values when
-you encode or decode such text.
+through #xFF (255). In a multibyte buffer or string, eight-bit
+characters have character codes higher than #xFF (@pxref{Text
+Representations}), but Emacs transparently converts them to their
+single-byte values when you encode or decode such text.
The usual way to read a file into a buffer as a sequence of bytes, so
you can decode the contents explicitly, is with
buffer remains multibyte if it was multibyte before, and any 8-bit
bytes are converted to their multibyte representation (@pxref{Text
Representations}).
+
+@cindex @code{undecided} coding-system, when encoding
+Do @emph{not} use @code{undecided} for @var{coding-system} when
+encoding text, since that may lead to unexpected results. Instead,
+use @code{select-safe-coding-system} (@pxref{User-Chosen Coding
+Systems, select-safe-coding-system}) to suggest a suitable encoding,
+if there's no obvious pertinent value for @var{coding-system}.
@end deffn
@defun encode-coding-string string coding-system &optional nocopy buffer
operation is trivial. The result of encoding is a unibyte string.
@end defun
-@deffn Command decode-coding-region start end coding-system destination
+@deffn Command decode-coding-region start end coding-system &optional destination
This command decodes the text from @var{start} to @var{end} according
to coding system @var{coding-system}. To make explicit decoding
useful, the text before decoding ought to be a sequence of byte
If decoded text is inserted in some buffer, this command returns the
length of the decoded text.
+
+This command puts a @code{charset} text property on the decoded text.
+The value of the property states the character set used to decode the
+original text.
@end deffn
@defun decode-coding-string string coding-system &optional nocopy buffer
If optional argument @var{buffer} specifies a buffer, the decoded text
is inserted in that buffer after point (point does not move). In this
case, the return value is the length of the decoded text.
+
+@cindex @code{charset}, text property
+This function puts a @code{charset} text property on the decoded text.
+The value of the property states the character set used to decode the
+original text:
+
+@example
+@group
+(decode-coding-string "Gr\374ss Gott" 'latin-1)
+ @result{} #("Gr@"uss Gott" 0 9 (charset iso-8859-1))
+@end group
+@end example
@end defun
@defun decode-coding-inserted-region from to filename &optional visit beg end replace
not set @code{last-coding-system-used} for encoding or decoding of
terminal I/O.
-@defun keyboard-coding-system
+@defun keyboard-coding-system &optional terminal
This function returns the coding system that is in use for decoding
-keyboard input---or @code{nil} if no coding system is to be used.
+keyboard input from @var{terminal}---or @code{nil} if no coding system
+is to be used for that terminal. If @var{terminal} is omitted or
+@code{nil}, it means the selected frame's terminal. @xref{Multiple
+Terminals}.
@end defun
-@deffn Command set-keyboard-coding-system coding-system
-This command specifies @var{coding-system} as the coding system to
-use for decoding keyboard input. If @var{coding-system} is @code{nil},
-that means do not decode keyboard input.
+@deffn Command set-keyboard-coding-system coding-system &optional terminal
+This command specifies @var{coding-system} as the coding system to use
+for decoding keyboard input from @var{terminal}. If
+@var{coding-system} is @code{nil}, that means do not decode keyboard
+input. If @var{terminal} is a frame, it means that frame's terminal;
+if it is @code{nil}, that means the currently selected frame's
+terminal. @xref{Multiple Terminals}.
@end deffn
-@defun terminal-coding-system
+@defun terminal-coding-system &optional terminal
This function returns the coding system that is in use for encoding
-terminal output---or @code{nil} for no encoding.
+terminal output from @var{terminal}---or @code{nil} if the output is
+not encoded. If @var{terminal} is a frame, it means that frame's
+terminal; if it is @code{nil}, that means the currently selected
+frame's terminal.
@end defun
-@deffn Command set-terminal-coding-system coding-system
+@deffn Command set-terminal-coding-system coding-system &optional terminal
This command specifies @var{coding-system} as the coding system to use
-for encoding terminal output. If @var{coding-system} is @code{nil},
-that means do not encode terminal output.
+for encoding terminal output from @var{terminal}. If
+@var{coding-system} is @code{nil}, terminal output is not encoded. If
+@var{terminal} is a frame, it means that frame's terminal; if it is
+@code{nil}, that means the currently selected frame's terminal.
@end deffn
@node MS-DOS File Types
Normally this variable is set by visiting a file; it is set to
@code{nil} if the file was visited without any actual conversion.
+
+Its default value is used to decide how to handle files for which
+@code{file-name-buffer-file-type-alist} says nothing about the type:
+If the default value is non-@code{nil}, then these files are treated as
+binary: the coding system @code{no-conversion} is used. Otherwise,
+nothing special is done for them---the coding system is deduced solely
+from the file contents, in the usual Emacs fashion.
@end defvar
@defopt file-name-buffer-file-type-alist
is used.
If no element in this alist matches a given file name, then
-@code{default-buffer-file-type} says how to treat the file.
-@end defopt
-
-@defopt default-buffer-file-type
-This variable says how to handle files for which
-@code{file-name-buffer-file-type-alist} says nothing about the type.
-
-If this variable is non-@code{nil}, then these files are treated as
-binary: the coding system @code{no-conversion} is used. Otherwise,
-nothing special is done for them---the coding system is deduced solely
-from the file contents, in the usual Emacs fashion.
+the default value of @code{buffer-file-type} says how to treat the file.
@end defopt
@node Input Methods