| 1 | @c -*-texinfo-*- |
| 2 | @c This is part of the GNU Emacs Lisp Reference Manual. |
| 3 | @c Copyright (C) 1998-1999, 2001-2013 Free Software Foundation, Inc. |
| 4 | @c See the file elisp.texi for copying conditions. |
| 5 | @node Non-ASCII Characters |
| 6 | @chapter Non-@acronym{ASCII} Characters |
| 7 | @cindex multibyte characters |
| 8 | @cindex characters, multi-byte |
| 9 | @cindex non-@acronym{ASCII} characters |
| 10 | |
| 11 | This chapter covers the special issues relating to characters and |
| 12 | how they are stored in strings and buffers. |
| 13 | |
| 14 | @menu |
| 15 | * Text Representations:: How Emacs represents text. |
| 16 | * Disabling Multibyte:: Controlling whether to use multibyte characters. |
| 17 | * Converting Representations:: Converting unibyte to multibyte and vice versa. |
| 18 | * Selecting a Representation:: Treating a byte sequence as unibyte or multi. |
| 19 | * Character Codes:: How unibyte and multibyte relate to |
| 20 | codes of individual characters. |
| 21 | * Character Properties:: Character attributes that define their |
| 22 | behavior and handling. |
| 23 | * Character Sets:: The space of possible character codes |
| 24 | is divided into various character sets. |
| 25 | * Scanning Charsets:: Which character sets are used in a buffer? |
| 26 | * Translation of Characters:: Translation tables are used for conversion. |
| 27 | * Coding Systems:: Coding systems are conversions for saving files. |
| 28 | * Input Methods:: Input methods allow users to enter various |
| 29 | non-ASCII characters without special keyboards. |
| 30 | * Locales:: Interacting with the POSIX locale. |
| 31 | @end menu |
| 32 | |
| 33 | @node Text Representations |
| 34 | @section Text Representations |
| 35 | @cindex text representation |
| 36 | |
| 37 | Emacs buffers and strings support a large repertoire of characters |
| 38 | from many different scripts, allowing users to type and display text |
| 39 | in almost any known written language. |
| 40 | |
| 41 | @cindex character codepoint |
| 42 | @cindex codespace |
| 43 | @cindex Unicode |
| 44 | To support this multitude of characters and scripts, Emacs closely |
| 45 | follows the @dfn{Unicode Standard}. The Unicode Standard assigns a |
| 46 | unique number, called a @dfn{codepoint}, to each and every character. |
| 47 | The range of codepoints defined by Unicode, or the Unicode |
| 48 | @dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation), |
| 49 | inclusive. Emacs extends this range with codepoints in the range |
| 50 | @code{#x110000..#x3FFFFF}, which it uses for representing characters |
| 51 | that are not unified with Unicode and @dfn{raw 8-bit bytes} that |
| 52 | cannot be interpreted as characters. Thus, a character codepoint in |
| 53 | Emacs is a 22-bit integer number. |
| 54 | |
| 55 | @cindex internal representation of characters |
| 56 | @cindex characters, representation in buffers and strings |
| 57 | @cindex multibyte text |
| 58 | To conserve memory, Emacs does not hold fixed-length 22-bit numbers |
| 59 | that are codepoints of text characters within buffers and strings. |
| 60 | Rather, Emacs uses a variable-length internal representation of |
| 61 | characters, that stores each character as a sequence of 1 to 5 8-bit |
| 62 | bytes, depending on the magnitude of its codepoint@footnote{ |
| 63 | This internal representation is based on one of the encodings defined |
| 64 | by the Unicode Standard, called @dfn{UTF-8}, for representing any |
| 65 | Unicode codepoint, but Emacs extends UTF-8 to represent the additional |
| 66 | codepoints it uses for raw 8-bit bytes and characters not unified with |
| 67 | Unicode.}. For example, any @acronym{ASCII} character takes up only 1 |
| 68 | byte, a Latin-1 character takes up 2 bytes, etc. We call this |
| 69 | representation of text @dfn{multibyte}. |
| 70 | |
| 71 | Outside Emacs, characters can be represented in many different |
| 72 | encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts |
| 73 | between these external encodings and its internal representation, as |
| 74 | appropriate, when it reads text into a buffer or a string, or when it |
| 75 | writes text to a disk file or passes it to some other process. |
| 76 | |
| 77 | Occasionally, Emacs needs to hold and manipulate encoded text or |
| 78 | binary non-text data in its buffers or strings. For example, when |
| 79 | Emacs visits a file, it first reads the file's text verbatim into a |
| 80 | buffer, and only then converts it to the internal representation. |
| 81 | Before the conversion, the buffer holds encoded text. |
| 82 | |
| 83 | @cindex unibyte text |
| 84 | Encoded text is not really text, as far as Emacs is concerned, but |
| 85 | rather a sequence of raw 8-bit bytes. We call buffers and strings |
| 86 | that hold encoded text @dfn{unibyte} buffers and strings, because |
| 87 | Emacs treats them as a sequence of individual bytes. Usually, Emacs |
| 88 | displays unibyte buffers and strings as octal codes such as |
| 89 | @code{\237}. We recommend that you never use unibyte buffers and |
| 90 | strings except for manipulating encoded text or binary non-text data. |
| 91 | |
| 92 | In a buffer, the buffer-local value of the variable |
| 93 | @code{enable-multibyte-characters} specifies the representation used. |
| 94 | The representation for a string is determined and recorded in the string |
| 95 | when the string is constructed. |
| 96 | |
| 97 | @defvar enable-multibyte-characters |
| 98 | This variable specifies the current buffer's text representation. |
| 99 | If it is non-@code{nil}, the buffer contains multibyte text; otherwise, |
| 100 | it contains unibyte encoded text or binary non-text data. |
| 101 | |
| 102 | You cannot set this variable directly; instead, use the function |
| 103 | @code{set-buffer-multibyte} to change a buffer's representation. |
| 104 | @end defvar |
| 105 | |
| 106 | @defun position-bytes position |
| 107 | Buffer positions are measured in character units. This function |
| 108 | returns the byte-position corresponding to buffer position |
| 109 | @var{position} in the current buffer. This is 1 at the start of the |
| 110 | buffer, and counts upward in bytes. If @var{position} is out of |
| 111 | range, the value is @code{nil}. |
| 112 | @end defun |
| 113 | |
| 114 | @defun byte-to-position byte-position |
| 115 | Return the buffer position, in character units, corresponding to given |
| 116 | @var{byte-position} in the current buffer. If @var{byte-position} is |
| 117 | out of range, the value is @code{nil}. In a multibyte buffer, an |
| 118 | arbitrary value of @var{byte-position} can be not at character |
| 119 | boundary, but inside a multibyte sequence representing a single |
| 120 | character; in this case, this function returns the buffer position of |
| 121 | the character whose multibyte sequence includes @var{byte-position}. |
| 122 | In other words, the value does not change for all byte positions that |
| 123 | belong to the same character. |
| 124 | @end defun |
| 125 | |
| 126 | @defun multibyte-string-p string |
| 127 | Return @code{t} if @var{string} is a multibyte string, @code{nil} |
| 128 | otherwise. This function also returns @code{nil} if @var{string} is |
| 129 | some object other than a string. |
| 130 | @end defun |
| 131 | |
| 132 | @defun string-bytes string |
| 133 | @cindex string, number of bytes |
| 134 | This function returns the number of bytes in @var{string}. |
| 135 | If @var{string} is a multibyte string, this can be greater than |
| 136 | @code{(length @var{string})}. |
| 137 | @end defun |
| 138 | |
| 139 | @defun unibyte-string &rest bytes |
| 140 | This function concatenates all its argument @var{bytes} and makes the |
| 141 | result a unibyte string. |
| 142 | @end defun |
| 143 | |
| 144 | @node Disabling Multibyte |
| 145 | @section Disabling Multibyte Characters |
| 146 | @cindex disabling multibyte |
| 147 | |
| 148 | By default, Emacs starts in multibyte mode: it stores the contents |
| 149 | of buffers and strings using an internal encoding that represents |
| 150 | non-@acronym{ASCII} characters using multi-byte sequences. Multibyte |
| 151 | mode allows you to use all the supported languages and scripts without |
| 152 | limitations. |
| 153 | |
| 154 | @cindex turn multibyte support on or off |
| 155 | Under very special circumstances, you may want to disable multibyte |
| 156 | character support, for a specific buffer. |
| 157 | When multibyte characters are disabled in a buffer, we call |
| 158 | that @dfn{unibyte mode}. In unibyte mode, each character in the |
| 159 | buffer has a character code ranging from 0 through 255 (0377 octal); 0 |
| 160 | through 127 (0177 octal) represent @acronym{ASCII} characters, and 128 |
| 161 | (0200 octal) through 255 (0377 octal) represent non-@acronym{ASCII} |
| 162 | characters. |
| 163 | |
| 164 | To edit a particular file in unibyte representation, visit it using |
| 165 | @code{find-file-literally}. @xref{Visiting Functions}. You can |
| 166 | convert a multibyte buffer to unibyte by saving it to a file, killing |
| 167 | the buffer, and visiting the file again with |
| 168 | @code{find-file-literally}. Alternatively, you can use @kbd{C-x |
| 169 | @key{RET} c} (@code{universal-coding-system-argument}) and specify |
| 170 | @samp{raw-text} as the coding system with which to visit or save a |
| 171 | file. @xref{Text Coding, , Specifying a Coding System for File Text, |
| 172 | emacs, GNU Emacs Manual}. Unlike @code{find-file-literally}, finding |
| 173 | a file as @samp{raw-text} doesn't disable format conversion, |
| 174 | uncompression, or auto mode selection. |
| 175 | |
| 176 | @c See http://debbugs.gnu.org/11226 for lack of unibyte tooltip. |
| 177 | @vindex enable-multibyte-characters |
| 178 | The buffer-local variable @code{enable-multibyte-characters} is |
| 179 | non-@code{nil} in multibyte buffers, and @code{nil} in unibyte ones. |
| 180 | The mode line also indicates whether a buffer is multibyte or not. |
| 181 | With a graphical display, in a multibyte buffer, the portion of the |
| 182 | mode line that indicates the character set has a tooltip that (amongst |
| 183 | other things) says that the buffer is multibyte. In a unibyte buffer, |
| 184 | the character set indicator is absent. Thus, in a unibyte buffer |
| 185 | (when using a graphical display) there is normally nothing before the |
| 186 | indication of the visited file's end-of-line convention (colon, |
| 187 | backslash, etc.), unless you are using an input method. |
| 188 | |
| 189 | @findex toggle-enable-multibyte-characters |
| 190 | You can turn off multibyte support in a specific buffer by invoking the |
| 191 | command @code{toggle-enable-multibyte-characters} in that buffer. |
| 192 | |
| 193 | @node Converting Representations |
| 194 | @section Converting Text Representations |
| 195 | |
| 196 | Emacs can convert unibyte text to multibyte; it can also convert |
| 197 | multibyte text to unibyte, provided that the multibyte text contains |
| 198 | only @acronym{ASCII} and 8-bit raw bytes. In general, these |
| 199 | conversions happen when inserting text into a buffer, or when putting |
| 200 | text from several strings together in one string. You can also |
| 201 | explicitly convert a string's contents to either representation. |
| 202 | |
| 203 | Emacs chooses the representation for a string based on the text from |
| 204 | which it is constructed. The general rule is to convert unibyte text |
| 205 | to multibyte text when combining it with other multibyte text, because |
| 206 | the multibyte representation is more general and can hold whatever |
| 207 | characters the unibyte text has. |
| 208 | |
| 209 | When inserting text into a buffer, Emacs converts the text to the |
| 210 | buffer's representation, as specified by |
| 211 | @code{enable-multibyte-characters} in that buffer. In particular, when |
| 212 | you insert multibyte text into a unibyte buffer, Emacs converts the text |
| 213 | to unibyte, even though this conversion cannot in general preserve all |
| 214 | the characters that might be in the multibyte text. The other natural |
| 215 | alternative, to convert the buffer contents to multibyte, is not |
| 216 | acceptable because the buffer's representation is a choice made by the |
| 217 | user that cannot be overridden automatically. |
| 218 | |
| 219 | Converting unibyte text to multibyte text leaves @acronym{ASCII} |
| 220 | characters unchanged, and converts bytes with codes 128 through 255 to |
| 221 | the multibyte representation of raw eight-bit bytes. |
| 222 | |
| 223 | Converting multibyte text to unibyte converts all @acronym{ASCII} |
| 224 | and eight-bit characters to their single-byte form, but loses |
| 225 | information for non-@acronym{ASCII} characters by discarding all but |
| 226 | the low 8 bits of each character's codepoint. Converting unibyte text |
| 227 | to multibyte and back to unibyte reproduces the original unibyte text. |
| 228 | |
| 229 | The next two functions either return the argument @var{string}, or a |
| 230 | newly created string with no text properties. |
| 231 | |
| 232 | @defun string-to-multibyte string |
| 233 | This function returns a multibyte string containing the same sequence |
| 234 | of characters as @var{string}. If @var{string} is a multibyte string, |
| 235 | it is returned unchanged. The function assumes that @var{string} |
| 236 | includes only @acronym{ASCII} characters and raw 8-bit bytes; the |
| 237 | latter are converted to their multibyte representation corresponding |
| 238 | to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive |
| 239 | (@pxref{Text Representations, codepoints}). |
| 240 | @end defun |
| 241 | |
| 242 | @defun string-to-unibyte string |
| 243 | This function returns a unibyte string containing the same sequence of |
| 244 | characters as @var{string}. It signals an error if @var{string} |
| 245 | contains a non-@acronym{ASCII} character. If @var{string} is a |
| 246 | unibyte string, it is returned unchanged. Use this function for |
| 247 | @var{string} arguments that contain only @acronym{ASCII} and eight-bit |
| 248 | characters. |
| 249 | @end defun |
| 250 | |
| 251 | @c FIXME: Should `@var{character}' be `@var{byte}'? |
| 252 | @defun byte-to-string byte |
| 253 | @cindex byte to string |
| 254 | This function returns a unibyte string containing a single byte of |
| 255 | character data, @var{character}. It signals an error if |
| 256 | @var{character} is not an integer between 0 and 255. |
| 257 | @end defun |
| 258 | |
| 259 | @defun multibyte-char-to-unibyte char |
| 260 | This converts the multibyte character @var{char} to a unibyte |
| 261 | character, and returns that character. If @var{char} is neither |
| 262 | @acronym{ASCII} nor eight-bit, the function returns -1. |
| 263 | @end defun |
| 264 | |
| 265 | @defun unibyte-char-to-multibyte char |
| 266 | This convert the unibyte character @var{char} to a multibyte |
| 267 | character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit |
| 268 | byte. |
| 269 | @end defun |
| 270 | |
| 271 | @node Selecting a Representation |
| 272 | @section Selecting a Representation |
| 273 | |
| 274 | Sometimes it is useful to examine an existing buffer or string as |
| 275 | multibyte when it was unibyte, or vice versa. |
| 276 | |
| 277 | @defun set-buffer-multibyte multibyte |
| 278 | Set the representation type of the current buffer. If @var{multibyte} |
| 279 | is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte} |
| 280 | is @code{nil}, the buffer becomes unibyte. |
| 281 | |
| 282 | This function leaves the buffer contents unchanged when viewed as a |
| 283 | sequence of bytes. As a consequence, it can change the contents |
| 284 | viewed as characters; for instance, a sequence of three bytes which is |
| 285 | treated as one character in multibyte representation will count as |
| 286 | three characters in unibyte representation. Eight-bit characters |
| 287 | representing raw bytes are an exception. They are represented by one |
| 288 | byte in a unibyte buffer, but when the buffer is set to multibyte, |
| 289 | they are converted to two-byte sequences, and vice versa. |
| 290 | |
| 291 | This function sets @code{enable-multibyte-characters} to record which |
| 292 | representation is in use. It also adjusts various data in the buffer |
| 293 | (including overlays, text properties and markers) so that they cover the |
| 294 | same text as they did before. |
| 295 | |
| 296 | This function signals an error if the buffer is narrowed, since the |
| 297 | narrowing might have occurred in the middle of multibyte character |
| 298 | sequences. |
| 299 | |
| 300 | This function also signals an error if the buffer is an indirect |
| 301 | buffer. An indirect buffer always inherits the representation of its |
| 302 | base buffer. |
| 303 | @end defun |
| 304 | |
| 305 | @defun string-as-unibyte string |
| 306 | If @var{string} is already a unibyte string, this function returns |
| 307 | @var{string} itself. Otherwise, it returns a new string with the same |
| 308 | bytes as @var{string}, but treating each byte as a separate character |
| 309 | (so that the value may have more characters than @var{string}); as an |
| 310 | exception, each eight-bit character representing a raw byte is |
| 311 | converted into a single byte. The newly-created string contains no |
| 312 | text properties. |
| 313 | @end defun |
| 314 | |
| 315 | @defun string-as-multibyte string |
| 316 | If @var{string} is a multibyte string, this function returns |
| 317 | @var{string} itself. Otherwise, it returns a new string with the same |
| 318 | bytes as @var{string}, but treating each multibyte sequence as one |
| 319 | character. This means that the value may have fewer characters than |
| 320 | @var{string} has. If a byte sequence in @var{string} is invalid as a |
| 321 | multibyte representation of a single character, each byte in the |
| 322 | sequence is treated as a raw 8-bit byte. The newly-created string |
| 323 | contains no text properties. |
| 324 | @end defun |
| 325 | |
| 326 | @node Character Codes |
| 327 | @section Character Codes |
| 328 | @cindex character codes |
| 329 | |
| 330 | The unibyte and multibyte text representations use different |
| 331 | character codes. The valid character codes for unibyte representation |
| 332 | range from 0 to @code{#xFF} (255)---the values that can fit in one |
| 333 | byte. The valid character codes for multibyte representation range |
| 334 | from 0 to @code{#x3FFFFF}. In this code space, values 0 through |
| 335 | @code{#x7F} (127) are for @acronym{ASCII} characters, and values |
| 336 | @code{#x80} (128) through @code{#x3FFF7F} (4194175) are for |
| 337 | non-@acronym{ASCII} characters. |
| 338 | |
| 339 | Emacs character codes are a superset of the Unicode standard. |
| 340 | Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode |
| 341 | characters of the same codepoint; values @code{#x110000} (1114112) |
| 342 | through @code{#x3FFF7F} (4194175) represent characters that are not |
| 343 | unified with Unicode; and values @code{#x3FFF80} (4194176) through |
| 344 | @code{#x3FFFFF} (4194303) represent eight-bit raw bytes. |
| 345 | |
| 346 | @defun characterp charcode |
| 347 | This returns @code{t} if @var{charcode} is a valid character, and |
| 348 | @code{nil} otherwise. |
| 349 | |
| 350 | @example |
| 351 | @group |
| 352 | (characterp 65) |
| 353 | @result{} t |
| 354 | @end group |
| 355 | @group |
| 356 | (characterp 4194303) |
| 357 | @result{} t |
| 358 | @end group |
| 359 | @group |
| 360 | (characterp 4194304) |
| 361 | @result{} nil |
| 362 | @end group |
| 363 | @end example |
| 364 | @end defun |
| 365 | |
| 366 | @cindex maximum value of character codepoint |
| 367 | @cindex codepoint, largest value |
| 368 | @defun max-char |
| 369 | This function returns the largest value that a valid character |
| 370 | codepoint can have. |
| 371 | |
| 372 | @example |
| 373 | @group |
| 374 | (characterp (max-char)) |
| 375 | @result{} t |
| 376 | @end group |
| 377 | @group |
| 378 | (characterp (1+ (max-char))) |
| 379 | @result{} nil |
| 380 | @end group |
| 381 | @end example |
| 382 | @end defun |
| 383 | |
| 384 | @defun get-byte &optional pos string |
| 385 | This function returns the byte at character position @var{pos} in the |
| 386 | current buffer. If the current buffer is unibyte, this is literally |
| 387 | the byte at that position. If the buffer is multibyte, byte values of |
| 388 | @acronym{ASCII} characters are the same as character codepoints, |
| 389 | whereas eight-bit raw bytes are converted to their 8-bit codes. The |
| 390 | function signals an error if the character at @var{pos} is |
| 391 | non-@acronym{ASCII}. |
| 392 | |
| 393 | The optional argument @var{string} means to get a byte value from that |
| 394 | string instead of the current buffer. |
| 395 | @end defun |
| 396 | |
| 397 | @node Character Properties |
| 398 | @section Character Properties |
| 399 | @cindex character properties |
| 400 | A @dfn{character property} is a named attribute of a character that |
| 401 | specifies how the character behaves and how it should be handled |
| 402 | during text processing and display. Thus, character properties are an |
| 403 | important part of specifying the character's semantics. |
| 404 | |
| 405 | @c FIXME: Use the latest URI of this chapter? |
| 406 | @c http://www.unicode.org/versions/latest/ch04.pdf |
| 407 | On the whole, Emacs follows the Unicode Standard in its implementation |
| 408 | of character properties. In particular, Emacs supports the |
| 409 | @uref{http://www.unicode.org/reports/tr23/, Unicode Character Property |
| 410 | Model}, and the Emacs character property database is derived from the |
| 411 | Unicode Character Database (@acronym{UCD}). See the |
| 412 | @uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character |
| 413 | Properties chapter of the Unicode Standard}, for a detailed |
| 414 | description of Unicode character properties and their meaning. This |
| 415 | section assumes you are already familiar with that chapter of the |
| 416 | Unicode Standard, and want to apply that knowledge to Emacs Lisp |
| 417 | programs. |
| 418 | |
| 419 | In Emacs, each property has a name, which is a symbol, and a set of |
| 420 | possible values, whose types depend on the property; if a character |
| 421 | does not have a certain property, the value is @code{nil}. As a |
| 422 | general rule, the names of character properties in Emacs are produced |
| 423 | from the corresponding Unicode properties by downcasing them and |
| 424 | replacing each @samp{_} character with a dash @samp{-}. For example, |
| 425 | @code{Canonical_Combining_Class} becomes |
| 426 | @code{canonical-combining-class}. However, sometimes we shorten the |
| 427 | names to make their use easier. |
| 428 | |
| 429 | @cindex unassigned character codepoints |
| 430 | Some codepoints are left @dfn{unassigned} by the |
| 431 | @acronym{UCD}---they don't correspond to any character. The Unicode |
| 432 | Standard defines default values of properties for such codepoints; |
| 433 | they are mentioned below for each property. |
| 434 | |
| 435 | Here is the full list of value types for all the character |
| 436 | properties that Emacs knows about: |
| 437 | |
| 438 | @table @code |
| 439 | @item name |
| 440 | Corresponds to the @code{Name} Unicode property. The value is a |
| 441 | string consisting of upper-case Latin letters A to Z, digits, spaces, |
| 442 | and hyphen @samp{-} characters. For unassigned codepoints, the value |
| 443 | is an empty string. |
| 444 | |
| 445 | @cindex unicode general category |
| 446 | @item general-category |
| 447 | Corresponds to the @code{General_Category} Unicode property. The |
| 448 | value is a symbol whose name is a 2-letter abbreviation of the |
| 449 | character's classification. For unassigned codepoints, the value |
| 450 | is @code{Cn}. |
| 451 | |
| 452 | @item canonical-combining-class |
| 453 | Corresponds to the @code{Canonical_Combining_Class} Unicode property. |
| 454 | The value is an integer number. For unassigned codepoints, the value |
| 455 | is zero. |
| 456 | |
| 457 | @cindex bidirectional class of characters |
| 458 | @item bidi-class |
| 459 | Corresponds to the Unicode @code{Bidi_Class} property. The value is a |
| 460 | symbol whose name is the Unicode @dfn{directional type} of the |
| 461 | character. Emacs uses this property when it reorders bidirectional |
| 462 | text for display (@pxref{Bidirectional Display}). For unassigned |
| 463 | codepoints, the value depends on the code blocks to which the |
| 464 | codepoint belongs: most unassigned codepoints get the value of |
| 465 | @code{L} (strong L), but some get values of @code{AL} (Arabic letter) |
| 466 | or @code{R} (strong R). |
| 467 | |
| 468 | @item decomposition |
| 469 | Corresponds to the Unicode properties @code{Decomposition_Type} and |
| 470 | @code{Decomposition_Value}. The value is a list, whose first element |
| 471 | may be a symbol representing a compatibility formatting tag, such as |
| 472 | @code{small}@footnote{The Unicode specification writes these tag names |
| 473 | inside @samp{<..>} brackets, but the tag names in Emacs do not include |
| 474 | the brackets; e.g., Unicode specifies @samp{<small>} where Emacs uses |
| 475 | @samp{small}. }; the other elements are characters that give the |
| 476 | compatibility decomposition sequence of this character. For |
| 477 | unassigned codepoints, the value is the character itself. |
| 478 | |
| 479 | @item decimal-digit-value |
| 480 | Corresponds to the Unicode @code{Numeric_Value} property for |
| 481 | characters whose @code{Numeric_Type} is @samp{Decimal}. The value is |
| 482 | an integer number. For unassigned codepoints, the value is |
| 483 | @code{nil}, which means @acronym{NaN}, or ``not-a-number''. |
| 484 | |
| 485 | @item digit-value |
| 486 | Corresponds to the Unicode @code{Numeric_Value} property for |
| 487 | characters whose @code{Numeric_Type} is @samp{Digit}. The value is an |
| 488 | integer number. Examples of such characters include compatibility |
| 489 | subscript and superscript digits, for which the value is the |
| 490 | corresponding number. For unassigned codepoints, the value is |
| 491 | @code{nil}, which means @acronym{NaN}. |
| 492 | |
| 493 | @item numeric-value |
| 494 | Corresponds to the Unicode @code{Numeric_Value} property for |
| 495 | characters whose @code{Numeric_Type} is @samp{Numeric}. The value of |
| 496 | this property is an integer or a floating-point number. Examples of |
| 497 | characters that have this property include fractions, subscripts, |
| 498 | superscripts, Roman numerals, currency numerators, and encircled |
| 499 | numbers. For example, the value of this property for the character |
| 500 | @code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}. For |
| 501 | unassigned codepoints, the value is @code{nil}, which means |
| 502 | @acronym{NaN}. |
| 503 | |
| 504 | @cindex mirroring of characters |
| 505 | @item mirrored |
| 506 | Corresponds to the Unicode @code{Bidi_Mirrored} property. The value |
| 507 | of this property is a symbol, either @code{Y} or @code{N}. For |
| 508 | unassigned codepoints, the value is @code{N}. |
| 509 | |
| 510 | @item mirroring |
| 511 | Corresponds to the Unicode @code{Bidi_Mirroring_Glyph} property. The |
| 512 | value of this property is a character whose glyph represents the |
| 513 | mirror image of the character's glyph, or @code{nil} if there's no |
| 514 | defined mirroring glyph. All the characters whose @code{mirrored} |
| 515 | property is @code{N} have @code{nil} as their @code{mirroring} |
| 516 | property; however, some characters whose @code{mirrored} property is |
| 517 | @code{Y} also have @code{nil} for @code{mirroring}, because no |
| 518 | appropriate characters exist with mirrored glyphs. Emacs uses this |
| 519 | property to display mirror images of characters when appropriate |
| 520 | (@pxref{Bidirectional Display}). For unassigned codepoints, the value |
| 521 | is @code{nil}. |
| 522 | |
| 523 | @item old-name |
| 524 | Corresponds to the Unicode @code{Unicode_1_Name} property. The value |
| 525 | is a string. For unassigned codepoints, the value is an empty string. |
| 526 | |
| 527 | @item iso-10646-comment |
| 528 | Corresponds to the Unicode @code{ISO_Comment} property. The value is |
| 529 | a string. For unassigned codepoints, the value is an empty string. |
| 530 | |
| 531 | @item uppercase |
| 532 | Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property. |
| 533 | The value of this property is a single character. For unassigned |
| 534 | codepoints, the value is @code{nil}, which means the character itself. |
| 535 | |
| 536 | @item lowercase |
| 537 | Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property. |
| 538 | The value of this property is a single character. For unassigned |
| 539 | codepoints, the value is @code{nil}, which means the character itself. |
| 540 | |
| 541 | @item titlecase |
| 542 | Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property. |
| 543 | @dfn{Title case} is a special form of a character used when the first |
| 544 | character of a word needs to be capitalized. The value of this |
| 545 | property is a single character. For unassigned codepoints, the value |
| 546 | is @code{nil}, which means the character itself. |
| 547 | @end table |
| 548 | |
| 549 | @defun get-char-code-property char propname |
| 550 | This function returns the value of @var{char}'s @var{propname} property. |
| 551 | |
| 552 | @c FIXME: Use ‘?\s’ instead of ‘? ’ for the space character in the |
| 553 | @c first example? --xfq |
| 554 | @example |
| 555 | @group |
| 556 | (get-char-code-property ? 'general-category) |
| 557 | @result{} Zs |
| 558 | @end group |
| 559 | @group |
| 560 | (get-char-code-property ?1 'general-category) |
| 561 | @result{} Nd |
| 562 | @end group |
| 563 | @group |
| 564 | ;; subscript 4 |
| 565 | (get-char-code-property ?\u2084 'digit-value) |
| 566 | @result{} 4 |
| 567 | @end group |
| 568 | @group |
| 569 | ;; one fifth |
| 570 | (get-char-code-property ?\u2155 'numeric-value) |
| 571 | @result{} 0.2 |
| 572 | @end group |
| 573 | @group |
| 574 | ;; Roman IV |
| 575 | (get-char-code-property ?\u2163 'numeric-value) |
| 576 | @result{} 4 |
| 577 | @end group |
| 578 | @end example |
| 579 | @end defun |
| 580 | |
| 581 | @defun char-code-property-description prop value |
| 582 | This function returns the description string of property @var{prop}'s |
| 583 | @var{value}, or @code{nil} if @var{value} has no description. |
| 584 | |
| 585 | @example |
| 586 | @group |
| 587 | (char-code-property-description 'general-category 'Zs) |
| 588 | @result{} "Separator, Space" |
| 589 | @end group |
| 590 | @group |
| 591 | (char-code-property-description 'general-category 'Nd) |
| 592 | @result{} "Number, Decimal Digit" |
| 593 | @end group |
| 594 | @group |
| 595 | (char-code-property-description 'numeric-value '1/5) |
| 596 | @result{} nil |
| 597 | @end group |
| 598 | @end example |
| 599 | @end defun |
| 600 | |
| 601 | @defun put-char-code-property char propname value |
| 602 | This function stores @var{value} as the value of the property |
| 603 | @var{propname} for the character @var{char}. |
| 604 | @end defun |
| 605 | |
| 606 | @defvar unicode-category-table |
| 607 | The value of this variable is a char-table (@pxref{Char-Tables}) that |
| 608 | specifies, for each character, its Unicode @code{General_Category} |
| 609 | property as a symbol. |
| 610 | @end defvar |
| 611 | |
| 612 | @defvar char-script-table |
| 613 | The value of this variable is a char-table that specifies, for each |
| 614 | character, a symbol whose name is the script to which the character |
| 615 | belongs, according to the Unicode Standard classification of the |
| 616 | Unicode code space into script-specific blocks. This char-table has a |
| 617 | single extra slot whose value is the list of all script symbols. |
| 618 | @end defvar |
| 619 | |
| 620 | @defvar char-width-table |
| 621 | The value of this variable is a char-table that specifies the width of |
| 622 | each character in columns that it will occupy on the screen. |
| 623 | @end defvar |
| 624 | |
| 625 | @defvar printable-chars |
| 626 | The value of this variable is a char-table that specifies, for each |
| 627 | character, whether it is printable or not. That is, if evaluating |
| 628 | @code{(aref printable-chars char)} results in @code{t}, the character |
| 629 | is printable, and if it results in @code{nil}, it is not. |
| 630 | @end defvar |
| 631 | |
| 632 | @node Character Sets |
| 633 | @section Character Sets |
| 634 | @cindex character sets |
| 635 | |
| 636 | @cindex charset |
| 637 | @cindex coded character set |
| 638 | An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters |
| 639 | in which each character is assigned a numeric code point. (The |
| 640 | Unicode Standard calls this a @dfn{coded character set}.) Each Emacs |
| 641 | charset has a name which is a symbol. A single character can belong |
| 642 | to any number of different character sets, but it will generally have |
| 643 | a different code point in each charset. Examples of character sets |
| 644 | include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and |
| 645 | @code{windows-1255}. The code point assigned to a character in a |
| 646 | charset is usually different from its code point used in Emacs buffers |
| 647 | and strings. |
| 648 | |
| 649 | @cindex @code{emacs}, a charset |
| 650 | @cindex @code{unicode}, a charset |
| 651 | @cindex @code{eight-bit}, a charset |
| 652 | Emacs defines several special character sets. The character set |
| 653 | @code{unicode} includes all the characters whose Emacs code points are |
| 654 | in the range @code{0..#x10FFFF}. The character set @code{emacs} |
| 655 | includes all @acronym{ASCII} and non-@acronym{ASCII} characters. |
| 656 | Finally, the @code{eight-bit} charset includes the 8-bit raw bytes; |
| 657 | Emacs uses it to represent raw bytes encountered in text. |
| 658 | |
| 659 | @defun charsetp object |
| 660 | Returns @code{t} if @var{object} is a symbol that names a character set, |
| 661 | @code{nil} otherwise. |
| 662 | @end defun |
| 663 | |
| 664 | @defvar charset-list |
| 665 | The value is a list of all defined character set names. |
| 666 | @end defvar |
| 667 | |
| 668 | @defun charset-priority-list &optional highestp |
| 669 | This function returns a list of all defined character sets ordered by |
| 670 | their priority. If @var{highestp} is non-@code{nil}, the function |
| 671 | returns a single character set of the highest priority. |
| 672 | @end defun |
| 673 | |
| 674 | @defun set-charset-priority &rest charsets |
| 675 | This function makes @var{charsets} the highest priority character sets. |
| 676 | @end defun |
| 677 | |
| 678 | @defun char-charset character &optional restriction |
| 679 | This function returns the name of the character set of highest |
| 680 | priority that @var{character} belongs to. @acronym{ASCII} characters |
| 681 | are an exception: for them, this function always returns @code{ascii}. |
| 682 | |
| 683 | If @var{restriction} is non-@code{nil}, it should be a list of |
| 684 | charsets to search. Alternatively, it can be a coding system, in |
| 685 | which case the returned charset must be supported by that coding |
| 686 | system (@pxref{Coding Systems}). |
| 687 | @end defun |
| 688 | |
| 689 | @c TODO: Explain the properties here and add indexes such as ‘charset property’. |
| 690 | @defun charset-plist charset |
| 691 | This function returns the property list of the character set |
| 692 | @var{charset}. Although @var{charset} is a symbol, this is not the |
| 693 | same as the property list of that symbol. Charset properties include |
| 694 | important information about the charset, such as its documentation |
| 695 | string, short name, etc. |
| 696 | @end defun |
| 697 | |
| 698 | @defun put-charset-property charset propname value |
| 699 | This function sets the @var{propname} property of @var{charset} to the |
| 700 | given @var{value}. |
| 701 | @end defun |
| 702 | |
| 703 | @defun get-charset-property charset propname |
| 704 | This function returns the value of @var{charset}s property |
| 705 | @var{propname}. |
| 706 | @end defun |
| 707 | |
| 708 | @deffn Command list-charset-chars charset |
| 709 | This command displays a list of characters in the character set |
| 710 | @var{charset}. |
| 711 | @end deffn |
| 712 | |
| 713 | Emacs can convert between its internal representation of a character |
| 714 | and the character's codepoint in a specific charset. The following |
| 715 | two functions support these conversions. |
| 716 | |
| 717 | @c FIXME: decode-char and encode-char accept and ignore an additional |
| 718 | @c argument @var{restriction}. When that argument actually makes a |
| 719 | @c difference, it should be documented here. |
| 720 | @defun decode-char charset code-point |
| 721 | This function decodes a character that is assigned a @var{code-point} |
| 722 | in @var{charset}, to the corresponding Emacs character, and returns |
| 723 | it. If @var{charset} doesn't contain a character of that code point, |
| 724 | the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp |
| 725 | integer (@pxref{Integer Basics, most-positive-fixnum}), it can be |
| 726 | specified as a cons cell @code{(@var{high} . @var{low})}, where |
| 727 | @var{low} are the lower 16 bits of the value and @var{high} are the |
| 728 | high 16 bits. |
| 729 | @end defun |
| 730 | |
| 731 | @defun encode-char char charset |
| 732 | This function returns the code point assigned to the character |
| 733 | @var{char} in @var{charset}. If the result does not fit in a Lisp |
| 734 | integer, it is returned as a cons cell @code{(@var{high} . @var{low})} |
| 735 | that fits the second argument of @code{decode-char} above. If |
| 736 | @var{charset} doesn't have a codepoint for @var{char}, the value is |
| 737 | @code{nil}. |
| 738 | @end defun |
| 739 | |
| 740 | The following function comes in handy for applying a certain |
| 741 | function to all or part of the characters in a charset: |
| 742 | |
| 743 | @defun map-charset-chars function charset &optional arg from-code to-code |
| 744 | Call @var{function} for characters in @var{charset}. @var{function} |
| 745 | is called with two arguments. The first one is a cons cell |
| 746 | @code{(@var{from} . @var{to})}, where @var{from} and @var{to} |
| 747 | indicate a range of characters contained in charset. The second |
| 748 | argument passed to @var{function} is @var{arg}. |
| 749 | |
| 750 | By default, the range of codepoints passed to @var{function} includes |
| 751 | all the characters in @var{charset}, but optional arguments |
| 752 | @var{from-code} and @var{to-code} limit that to the range of |
| 753 | characters between these two codepoints of @var{charset}. If either |
| 754 | of them is @code{nil}, it defaults to the first or last codepoint of |
| 755 | @var{charset}, respectively. |
| 756 | @end defun |
| 757 | |
| 758 | @node Scanning Charsets |
| 759 | @section Scanning for Character Sets |
| 760 | |
| 761 | Sometimes it is useful to find out which character set a particular |
| 762 | character belongs to. One use for this is in determining which coding |
| 763 | systems (@pxref{Coding Systems}) are capable of representing all of |
| 764 | the text in question; another is to determine the font(s) for |
| 765 | displaying that text. |
| 766 | |
| 767 | @defun charset-after &optional pos |
| 768 | This function returns the charset of highest priority containing the |
| 769 | character at position @var{pos} in the current buffer. If @var{pos} |
| 770 | is omitted or @code{nil}, it defaults to the current value of point. |
| 771 | If @var{pos} is out of range, the value is @code{nil}. |
| 772 | @end defun |
| 773 | |
| 774 | @defun find-charset-region beg end &optional translation |
| 775 | This function returns a list of the character sets of highest priority |
| 776 | that contain characters in the current buffer between positions |
| 777 | @var{beg} and @var{end}. |
| 778 | |
| 779 | The optional argument @var{translation} specifies a translation table |
| 780 | to use for scanning the text (@pxref{Translation of Characters}). If |
| 781 | it is non-@code{nil}, then each character in the region is translated |
| 782 | through this table, and the value returned describes the translated |
| 783 | characters instead of the characters actually in the buffer. |
| 784 | @end defun |
| 785 | |
| 786 | @defun find-charset-string string &optional translation |
| 787 | This function returns a list of character sets of highest priority |
| 788 | that contain characters in @var{string}. It is just like |
| 789 | @code{find-charset-region}, except that it applies to the contents of |
| 790 | @var{string} instead of part of the current buffer. |
| 791 | @end defun |
| 792 | |
| 793 | @node Translation of Characters |
| 794 | @section Translation of Characters |
| 795 | @cindex character translation tables |
| 796 | @cindex translation tables |
| 797 | |
| 798 | A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that |
| 799 | specifies a mapping of characters into characters. These tables are |
| 800 | used in encoding and decoding, and for other purposes. Some coding |
| 801 | systems specify their own particular translation tables; there are |
| 802 | also default translation tables which apply to all other coding |
| 803 | systems. |
| 804 | |
| 805 | A translation table has two extra slots. The first is either |
| 806 | @code{nil} or a translation table that performs the reverse |
| 807 | translation; the second is the maximum number of characters to look up |
| 808 | for translating sequences of characters (see the description of |
| 809 | @code{make-translation-table-from-alist} below). |
| 810 | |
| 811 | @defun make-translation-table &rest translations |
| 812 | This function returns a translation table based on the argument |
| 813 | @var{translations}. Each element of @var{translations} should be a |
| 814 | list of elements of the form @code{(@var{from} . @var{to})}; this says |
| 815 | to translate the character @var{from} into @var{to}. |
| 816 | |
| 817 | The arguments and the forms in each argument are processed in order, |
| 818 | and if a previous form already translates @var{to} to some other |
| 819 | character, say @var{to-alt}, @var{from} is also translated to |
| 820 | @var{to-alt}. |
| 821 | @end defun |
| 822 | |
| 823 | During decoding, the translation table's translations are applied to |
| 824 | the characters that result from ordinary decoding. If a coding system |
| 825 | has the property @code{:decode-translation-table}, that specifies the |
| 826 | translation table to use, or a list of translation tables to apply in |
| 827 | sequence. (This is a property of the coding system, as returned by |
| 828 | @code{coding-system-get}, not a property of the symbol that is the |
| 829 | coding system's name. @xref{Coding System Basics,, Basic Concepts of |
| 830 | Coding Systems}.) Finally, if |
| 831 | @code{standard-translation-table-for-decode} is non-@code{nil}, the |
| 832 | resulting characters are translated by that table. |
| 833 | |
| 834 | During encoding, the translation table's translations are applied to |
| 835 | the characters in the buffer, and the result of translation is |
| 836 | actually encoded. If a coding system has property |
| 837 | @code{:encode-translation-table}, that specifies the translation table |
| 838 | to use, or a list of translation tables to apply in sequence. In |
| 839 | addition, if the variable @code{standard-translation-table-for-encode} |
| 840 | is non-@code{nil}, it specifies the translation table to use for |
| 841 | translating the result. |
| 842 | |
| 843 | @defvar standard-translation-table-for-decode |
| 844 | This is the default translation table for decoding. If a coding |
| 845 | systems specifies its own translation tables, the table that is the |
| 846 | value of this variable, if non-@code{nil}, is applied after them. |
| 847 | @end defvar |
| 848 | |
| 849 | @defvar standard-translation-table-for-encode |
| 850 | This is the default translation table for encoding. If a coding |
| 851 | systems specifies its own translation tables, the table that is the |
| 852 | value of this variable, if non-@code{nil}, is applied after them. |
| 853 | @end defvar |
| 854 | |
| 855 | @c FIXME: This variable is obsolete since 23.1. We should mention |
| 856 | @c that here or simply remove this defvar. --xfq |
| 857 | @defvar translation-table-for-input |
| 858 | Self-inserting characters are translated through this translation |
| 859 | table before they are inserted. Search commands also translate their |
| 860 | input through this table, so they can compare more reliably with |
| 861 | what's in the buffer. |
| 862 | |
| 863 | This variable automatically becomes buffer-local when set. |
| 864 | @end defvar |
| 865 | |
| 866 | @defun make-translation-table-from-vector vec |
| 867 | This function returns a translation table made from @var{vec} that is |
| 868 | an array of 256 elements to map bytes (values 0 through #xFF) to |
| 869 | characters. Elements may be @code{nil} for untranslated bytes. The |
| 870 | returned table has a translation table for reverse mapping in the |
| 871 | first extra slot, and the value @code{1} in the second extra slot. |
| 872 | |
| 873 | This function provides an easy way to make a private coding system |
| 874 | that maps each byte to a specific character. You can specify the |
| 875 | returned table and the reverse translation table using the properties |
| 876 | @code{:decode-translation-table} and @code{:encode-translation-table} |
| 877 | respectively in the @var{props} argument to |
| 878 | @code{define-coding-system}. |
| 879 | @end defun |
| 880 | |
| 881 | @defun make-translation-table-from-alist alist |
| 882 | This function is similar to @code{make-translation-table} but returns |
| 883 | a complex translation table rather than a simple one-to-one mapping. |
| 884 | Each element of @var{alist} is of the form @code{(@var{from} |
| 885 | . @var{to})}, where @var{from} and @var{to} are either characters or |
| 886 | vectors specifying a sequence of characters. If @var{from} is a |
| 887 | character, that character is translated to @var{to} (i.e., to a |
| 888 | character or a character sequence). If @var{from} is a vector of |
| 889 | characters, that sequence is translated to @var{to}. The returned |
| 890 | table has a translation table for reverse mapping in the first extra |
| 891 | slot, and the maximum length of all the @var{from} character sequences |
| 892 | in the second extra slot. |
| 893 | @end defun |
| 894 | |
| 895 | @node Coding Systems |
| 896 | @section Coding Systems |
| 897 | |
| 898 | @cindex coding system |
| 899 | When Emacs reads or writes a file, and when Emacs sends text to a |
| 900 | subprocess or receives text from a subprocess, it normally performs |
| 901 | character code conversion and end-of-line conversion as specified |
| 902 | by a particular @dfn{coding system}. |
| 903 | |
| 904 | How to define a coding system is an arcane matter, and is not |
| 905 | documented here. |
| 906 | |
| 907 | @menu |
| 908 | * Coding System Basics:: Basic concepts. |
| 909 | * Encoding and I/O:: How file I/O functions handle coding systems. |
| 910 | * Lisp and Coding Systems:: Functions to operate on coding system names. |
| 911 | * User-Chosen Coding Systems:: Asking the user to choose a coding system. |
| 912 | * Default Coding Systems:: Controlling the default choices. |
| 913 | * Specifying Coding Systems:: Requesting a particular coding system |
| 914 | for a single file operation. |
| 915 | * Explicit Encoding:: Encoding or decoding text without doing I/O. |
| 916 | * Terminal I/O Encoding:: Use of encoding for terminal I/O. |
| 917 | @end menu |
| 918 | |
| 919 | @node Coding System Basics |
| 920 | @subsection Basic Concepts of Coding Systems |
| 921 | |
| 922 | @cindex character code conversion |
| 923 | @dfn{Character code conversion} involves conversion between the |
| 924 | internal representation of characters used inside Emacs and some other |
| 925 | encoding. Emacs supports many different encodings, in that it can |
| 926 | convert to and from them. For example, it can convert text to or from |
| 927 | encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and |
| 928 | several variants of ISO 2022. In some cases, Emacs supports several |
| 929 | alternative encodings for the same characters; for example, there are |
| 930 | three coding systems for the Cyrillic (Russian) alphabet: ISO, |
| 931 | Alternativnyj, and KOI8. |
| 932 | |
| 933 | Every coding system specifies a particular set of character code |
| 934 | conversions, but the coding system @code{undecided} is special: it |
| 935 | leaves the choice unspecified, to be chosen heuristically for each |
| 936 | file, based on the file's data. |
| 937 | |
| 938 | In general, a coding system doesn't guarantee roundtrip identity: |
| 939 | decoding a byte sequence using coding system, then encoding the |
| 940 | resulting text in the same coding system, can produce a different byte |
| 941 | sequence. But some coding systems do guarantee that the byte sequence |
| 942 | will be the same as what you originally decoded. Here are a few |
| 943 | examples: |
| 944 | |
| 945 | @quotation |
| 946 | iso-8859-1, utf-8, big5, shift_jis, euc-jp |
| 947 | @end quotation |
| 948 | |
| 949 | Encoding buffer text and then decoding the result can also fail to |
| 950 | reproduce the original text. For instance, if you encode a character |
| 951 | with a coding system which does not support that character, the result |
| 952 | is unpredictable, and thus decoding it using the same coding system |
| 953 | may produce a different text. Currently, Emacs can't report errors |
| 954 | that result from encoding unsupported characters. |
| 955 | |
| 956 | @cindex EOL conversion |
| 957 | @cindex end-of-line conversion |
| 958 | @cindex line end conversion |
| 959 | @dfn{End of line conversion} handles three different conventions |
| 960 | used on various systems for representing end of line in files. The |
| 961 | Unix convention, used on GNU and Unix systems, is to use the linefeed |
| 962 | character (also called newline). The DOS convention, used on |
| 963 | MS-Windows and MS-DOS systems, is to use a carriage-return and a |
| 964 | linefeed at the end of a line. The Mac convention is to use just |
| 965 | carriage-return. (This was the convention used on the Macintosh |
| 966 | system prior to OS X.) |
| 967 | |
| 968 | @cindex base coding system |
| 969 | @cindex variant coding system |
| 970 | @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line |
| 971 | conversion unspecified, to be chosen based on the data. @dfn{Variant |
| 972 | coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and |
| 973 | @code{latin-1-mac} specify the end-of-line conversion explicitly as |
| 974 | well. Most base coding systems have three corresponding variants whose |
| 975 | names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}. |
| 976 | |
| 977 | @vindex raw-text@r{ coding system} |
| 978 | The coding system @code{raw-text} is special in that it prevents |
| 979 | character code conversion, and causes the buffer visited with this |
| 980 | coding system to be a unibyte buffer. For historical reasons, you can |
| 981 | save both unibyte and multibyte text with this coding system. When |
| 982 | you use @code{raw-text} to encode multibyte text, it does perform one |
| 983 | character code conversion: it converts eight-bit characters to their |
| 984 | single-byte external representation. @code{raw-text} does not specify |
| 985 | the end-of-line conversion, allowing that to be determined as usual by |
| 986 | the data, and has the usual three variants which specify the |
| 987 | end-of-line conversion. |
| 988 | |
| 989 | @vindex no-conversion@r{ coding system} |
| 990 | @vindex binary@r{ coding system} |
| 991 | @code{no-conversion} (and its alias @code{binary}) is equivalent to |
| 992 | @code{raw-text-unix}: it specifies no conversion of either character |
| 993 | codes or end-of-line. |
| 994 | |
| 995 | @vindex emacs-internal@r{ coding system} |
| 996 | @vindex utf-8-emacs@r{ coding system} |
| 997 | The coding system @code{utf-8-emacs} specifies that the data is |
| 998 | represented in the internal Emacs encoding (@pxref{Text |
| 999 | Representations}). This is like @code{raw-text} in that no code |
| 1000 | conversion happens, but different in that the result is multibyte |
| 1001 | data. The name @code{emacs-internal} is an alias for |
| 1002 | @code{utf-8-emacs}. |
| 1003 | |
| 1004 | @defun coding-system-get coding-system property |
| 1005 | This function returns the specified property of the coding system |
| 1006 | @var{coding-system}. Most coding system properties exist for internal |
| 1007 | purposes, but one that you might find useful is @code{:mime-charset}. |
| 1008 | That property's value is the name used in MIME for the character coding |
| 1009 | which this coding system can read and write. Examples: |
| 1010 | |
| 1011 | @example |
| 1012 | (coding-system-get 'iso-latin-1 :mime-charset) |
| 1013 | @result{} iso-8859-1 |
| 1014 | (coding-system-get 'iso-2022-cn :mime-charset) |
| 1015 | @result{} iso-2022-cn |
| 1016 | (coding-system-get 'cyrillic-koi8 :mime-charset) |
| 1017 | @result{} koi8-r |
| 1018 | @end example |
| 1019 | |
| 1020 | The value of the @code{:mime-charset} property is also defined |
| 1021 | as an alias for the coding system. |
| 1022 | @end defun |
| 1023 | |
| 1024 | @cindex alias, for coding systems |
| 1025 | @defun coding-system-aliases coding-system |
| 1026 | This function returns the list of aliases of @var{coding-system}. |
| 1027 | @end defun |
| 1028 | |
| 1029 | @node Encoding and I/O |
| 1030 | @subsection Encoding and I/O |
| 1031 | |
| 1032 | The principal purpose of coding systems is for use in reading and |
| 1033 | writing files. The function @code{insert-file-contents} uses a coding |
| 1034 | system to decode the file data, and @code{write-region} uses one to |
| 1035 | encode the buffer contents. |
| 1036 | |
| 1037 | You can specify the coding system to use either explicitly |
| 1038 | (@pxref{Specifying Coding Systems}), or implicitly using a default |
| 1039 | mechanism (@pxref{Default Coding Systems}). But these methods may not |
| 1040 | completely specify what to do. For example, they may choose a coding |
| 1041 | system such as @code{undefined} which leaves the character code |
| 1042 | conversion to be determined from the data. In these cases, the I/O |
| 1043 | operation finishes the job of choosing a coding system. Very often |
| 1044 | you will want to find out afterwards which coding system was chosen. |
| 1045 | |
| 1046 | @defvar buffer-file-coding-system |
| 1047 | This buffer-local variable records the coding system used for saving the |
| 1048 | buffer and for writing part of the buffer with @code{write-region}. If |
| 1049 | the text to be written cannot be safely encoded using the coding system |
| 1050 | specified by this variable, these operations select an alternative |
| 1051 | encoding by calling the function @code{select-safe-coding-system} |
| 1052 | (@pxref{User-Chosen Coding Systems}). If selecting a different encoding |
| 1053 | requires to ask the user to specify a coding system, |
| 1054 | @code{buffer-file-coding-system} is updated to the newly selected coding |
| 1055 | system. |
| 1056 | |
| 1057 | @code{buffer-file-coding-system} does @emph{not} affect sending text |
| 1058 | to a subprocess. |
| 1059 | @end defvar |
| 1060 | |
| 1061 | @defvar save-buffer-coding-system |
| 1062 | This variable specifies the coding system for saving the buffer (by |
| 1063 | overriding @code{buffer-file-coding-system}). Note that it is not used |
| 1064 | for @code{write-region}. |
| 1065 | |
| 1066 | When a command to save the buffer starts out to use |
| 1067 | @code{buffer-file-coding-system} (or @code{save-buffer-coding-system}), |
| 1068 | and that coding system cannot handle |
| 1069 | the actual text in the buffer, the command asks the user to choose |
| 1070 | another coding system (by calling @code{select-safe-coding-system}). |
| 1071 | After that happens, the command also updates |
| 1072 | @code{buffer-file-coding-system} to represent the coding system that |
| 1073 | the user specified. |
| 1074 | @end defvar |
| 1075 | |
| 1076 | @defvar last-coding-system-used |
| 1077 | I/O operations for files and subprocesses set this variable to the |
| 1078 | coding system name that was used. The explicit encoding and decoding |
| 1079 | functions (@pxref{Explicit Encoding}) set it too. |
| 1080 | |
| 1081 | @strong{Warning:} Since receiving subprocess output sets this variable, |
| 1082 | it can change whenever Emacs waits; therefore, you should copy the |
| 1083 | value shortly after the function call that stores the value you are |
| 1084 | interested in. |
| 1085 | @end defvar |
| 1086 | |
| 1087 | The variable @code{selection-coding-system} specifies how to encode |
| 1088 | selections for the window system. @xref{Window System Selections}. |
| 1089 | |
| 1090 | @defvar file-name-coding-system |
| 1091 | The variable @code{file-name-coding-system} specifies the coding |
| 1092 | system to use for encoding file names. Emacs encodes file names using |
| 1093 | that coding system for all file operations. If |
| 1094 | @code{file-name-coding-system} is @code{nil}, Emacs uses a default |
| 1095 | coding system determined by the selected language environment. In the |
| 1096 | default language environment, any non-@acronym{ASCII} characters in |
| 1097 | file names are not encoded specially; they appear in the file system |
| 1098 | using the internal Emacs representation. |
| 1099 | @end defvar |
| 1100 | |
| 1101 | @strong{Warning:} if you change @code{file-name-coding-system} (or |
| 1102 | the language environment) in the middle of an Emacs session, problems |
| 1103 | can result if you have already visited files whose names were encoded |
| 1104 | using the earlier coding system and are handled differently under the |
| 1105 | new coding system. If you try to save one of these buffers under the |
| 1106 | visited file name, saving may use the wrong file name, or it may get |
| 1107 | an error. If such a problem happens, use @kbd{C-x C-w} to specify a |
| 1108 | new file name for that buffer. |
| 1109 | |
| 1110 | @node Lisp and Coding Systems |
| 1111 | @subsection Coding Systems in Lisp |
| 1112 | |
| 1113 | Here are the Lisp facilities for working with coding systems: |
| 1114 | |
| 1115 | @cindex list all coding systems |
| 1116 | @defun coding-system-list &optional base-only |
| 1117 | This function returns a list of all coding system names (symbols). If |
| 1118 | @var{base-only} is non-@code{nil}, the value includes only the |
| 1119 | base coding systems. Otherwise, it includes alias and variant coding |
| 1120 | systems as well. |
| 1121 | @end defun |
| 1122 | |
| 1123 | @defun coding-system-p object |
| 1124 | This function returns @code{t} if @var{object} is a coding system |
| 1125 | name or @code{nil}. |
| 1126 | @end defun |
| 1127 | |
| 1128 | @cindex validity of coding system |
| 1129 | @cindex coding system, validity check |
| 1130 | @defun check-coding-system coding-system |
| 1131 | This function checks the validity of @var{coding-system}. If that is |
| 1132 | valid, it returns @var{coding-system}. If @var{coding-system} is |
| 1133 | @code{nil}, the function return @code{nil}. For any other values, it |
| 1134 | signals an error whose @code{error-symbol} is @code{coding-system-error} |
| 1135 | (@pxref{Signaling Errors, signal}). |
| 1136 | @end defun |
| 1137 | |
| 1138 | @cindex eol type of coding system |
| 1139 | @defun coding-system-eol-type coding-system |
| 1140 | This function returns the type of end-of-line (a.k.a.@: @dfn{eol}) |
| 1141 | conversion used by @var{coding-system}. If @var{coding-system} |
| 1142 | specifies a certain eol conversion, the return value is an integer 0, |
| 1143 | 1, or 2, standing for @code{unix}, @code{dos}, and @code{mac}, |
| 1144 | respectively. If @var{coding-system} doesn't specify eol conversion |
| 1145 | explicitly, the return value is a vector of coding systems, each one |
| 1146 | with one of the possible eol conversion types, like this: |
| 1147 | |
| 1148 | @lisp |
| 1149 | (coding-system-eol-type 'latin-1) |
| 1150 | @result{} [latin-1-unix latin-1-dos latin-1-mac] |
| 1151 | @end lisp |
| 1152 | |
| 1153 | @noindent |
| 1154 | If this function returns a vector, Emacs will decide, as part of the |
| 1155 | text encoding or decoding process, what eol conversion to use. For |
| 1156 | decoding, the end-of-line format of the text is auto-detected, and the |
| 1157 | eol conversion is set to match it (e.g., DOS-style CRLF format will |
| 1158 | imply @code{dos} eol conversion). For encoding, the eol conversion is |
| 1159 | taken from the appropriate default coding system (e.g., |
| 1160 | default value of @code{buffer-file-coding-system} for |
| 1161 | @code{buffer-file-coding-system}), or from the default eol conversion |
| 1162 | appropriate for the underlying platform. |
| 1163 | @end defun |
| 1164 | |
| 1165 | @cindex eol conversion of coding system |
| 1166 | @defun coding-system-change-eol-conversion coding-system eol-type |
| 1167 | This function returns a coding system which is like @var{coding-system} |
| 1168 | except for its eol conversion, which is specified by @code{eol-type}. |
| 1169 | @var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or |
| 1170 | @code{nil}. If it is @code{nil}, the returned coding system determines |
| 1171 | the end-of-line conversion from the data. |
| 1172 | |
| 1173 | @var{eol-type} may also be 0, 1 or 2, standing for @code{unix}, |
| 1174 | @code{dos} and @code{mac}, respectively. |
| 1175 | @end defun |
| 1176 | |
| 1177 | @cindex text conversion of coding system |
| 1178 | @defun coding-system-change-text-conversion eol-coding text-coding |
| 1179 | This function returns a coding system which uses the end-of-line |
| 1180 | conversion of @var{eol-coding}, and the text conversion of |
| 1181 | @var{text-coding}. If @var{text-coding} is @code{nil}, it returns |
| 1182 | @code{undecided}, or one of its variants according to @var{eol-coding}. |
| 1183 | @end defun |
| 1184 | |
| 1185 | @cindex safely encode region |
| 1186 | @cindex coding systems for encoding region |
| 1187 | @defun find-coding-systems-region from to |
| 1188 | This function returns a list of coding systems that could be used to |
| 1189 | encode a text between @var{from} and @var{to}. All coding systems in |
| 1190 | the list can safely encode any multibyte characters in that portion of |
| 1191 | the text. |
| 1192 | |
| 1193 | If the text contains no multibyte characters, the function returns the |
| 1194 | list @code{(undecided)}. |
| 1195 | @end defun |
| 1196 | |
| 1197 | @cindex safely encode a string |
| 1198 | @cindex coding systems for encoding a string |
| 1199 | @defun find-coding-systems-string string |
| 1200 | This function returns a list of coding systems that could be used to |
| 1201 | encode the text of @var{string}. All coding systems in the list can |
| 1202 | safely encode any multibyte characters in @var{string}. If the text |
| 1203 | contains no multibyte characters, this returns the list |
| 1204 | @code{(undecided)}. |
| 1205 | @end defun |
| 1206 | |
| 1207 | @cindex charset, coding systems to encode |
| 1208 | @cindex safely encode characters in a charset |
| 1209 | @defun find-coding-systems-for-charsets charsets |
| 1210 | This function returns a list of coding systems that could be used to |
| 1211 | encode all the character sets in the list @var{charsets}. |
| 1212 | @end defun |
| 1213 | |
| 1214 | @defun check-coding-systems-region start end coding-system-list |
| 1215 | This function checks whether coding systems in the list |
| 1216 | @code{coding-system-list} can encode all the characters in the region |
| 1217 | between @var{start} and @var{end}. If all of the coding systems in |
| 1218 | the list can encode the specified text, the function returns |
| 1219 | @code{nil}. If some coding systems cannot encode some of the |
| 1220 | characters, the value is an alist, each element of which has the form |
| 1221 | @code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning |
| 1222 | that @var{coding-system1} cannot encode characters at buffer positions |
| 1223 | @var{pos1}, @var{pos2}, @enddots{}. |
| 1224 | |
| 1225 | @var{start} may be a string, in which case @var{end} is ignored and |
| 1226 | the returned value references string indices instead of buffer |
| 1227 | positions. |
| 1228 | @end defun |
| 1229 | |
| 1230 | @defun detect-coding-region start end &optional highest |
| 1231 | This function chooses a plausible coding system for decoding the text |
| 1232 | from @var{start} to @var{end}. This text should be a byte sequence, |
| 1233 | i.e., unibyte text or multibyte text with only @acronym{ASCII} and |
| 1234 | eight-bit characters (@pxref{Explicit Encoding}). |
| 1235 | |
| 1236 | Normally this function returns a list of coding systems that could |
| 1237 | handle decoding the text that was scanned. They are listed in order of |
| 1238 | decreasing priority. But if @var{highest} is non-@code{nil}, then the |
| 1239 | return value is just one coding system, the one that is highest in |
| 1240 | priority. |
| 1241 | |
| 1242 | If the region contains only @acronym{ASCII} characters except for such |
| 1243 | ISO-2022 control characters ISO-2022 as @code{ESC}, the value is |
| 1244 | @code{undecided} or @code{(undecided)}, or a variant specifying |
| 1245 | end-of-line conversion, if that can be deduced from the text. |
| 1246 | |
| 1247 | If the region contains null bytes, the value is @code{no-conversion}, |
| 1248 | even if the region contains text encoded in some coding system. |
| 1249 | @end defun |
| 1250 | |
| 1251 | @defun detect-coding-string string &optional highest |
| 1252 | This function is like @code{detect-coding-region} except that it |
| 1253 | operates on the contents of @var{string} instead of bytes in the buffer. |
| 1254 | @end defun |
| 1255 | |
| 1256 | @cindex null bytes, and decoding text |
| 1257 | @defvar inhibit-null-byte-detection |
| 1258 | If this variable has a non-@code{nil} value, null bytes are ignored |
| 1259 | when detecting the encoding of a region or a string. This allows to |
| 1260 | correctly detect the encoding of text that contains null bytes, such |
| 1261 | as Info files with Index nodes. |
| 1262 | @end defvar |
| 1263 | |
| 1264 | @defvar inhibit-iso-escape-detection |
| 1265 | If this variable has a non-@code{nil} value, ISO-2022 escape sequences |
| 1266 | are ignored when detecting the encoding of a region or a string. The |
| 1267 | result is that no text is ever detected as encoded in some ISO-2022 |
| 1268 | encoding, and all escape sequences become visible in a buffer. |
| 1269 | @strong{Warning:} @emph{Use this variable with extreme caution, |
| 1270 | because many files in the Emacs distribution use ISO-2022 encoding.} |
| 1271 | @end defvar |
| 1272 | |
| 1273 | @cindex charsets supported by a coding system |
| 1274 | @defun coding-system-charset-list coding-system |
| 1275 | This function returns the list of character sets (@pxref{Character |
| 1276 | Sets}) supported by @var{coding-system}. Some coding systems that |
| 1277 | support too many character sets to list them all yield special values: |
| 1278 | @itemize @bullet |
| 1279 | @item |
| 1280 | If @var{coding-system} supports all the ISO-2022 charsets, the value |
| 1281 | is @code{iso-2022}. |
| 1282 | @item |
| 1283 | If @var{coding-system} supports all Emacs characters, the value is |
| 1284 | @code{(emacs)}. |
| 1285 | @item |
| 1286 | If @var{coding-system} supports all emacs-mule characters, the value |
| 1287 | is @code{emacs-mule}. |
| 1288 | @item |
| 1289 | If @var{coding-system} supports all Unicode characters, the value is |
| 1290 | @code{(unicode)}. |
| 1291 | @end itemize |
| 1292 | @end defun |
| 1293 | |
| 1294 | @xref{Coding systems for a subprocess,, Process Information}, in |
| 1295 | particular the description of the functions |
| 1296 | @code{process-coding-system} and @code{set-process-coding-system}, for |
| 1297 | how to examine or set the coding systems used for I/O to a subprocess. |
| 1298 | |
| 1299 | @node User-Chosen Coding Systems |
| 1300 | @subsection User-Chosen Coding Systems |
| 1301 | |
| 1302 | @cindex select safe coding system |
| 1303 | @defun select-safe-coding-system from to &optional default-coding-system accept-default-p file |
| 1304 | This function selects a coding system for encoding specified text, |
| 1305 | asking the user to choose if necessary. Normally the specified text |
| 1306 | is the text in the current buffer between @var{from} and @var{to}. If |
| 1307 | @var{from} is a string, the string specifies the text to encode, and |
| 1308 | @var{to} is ignored. |
| 1309 | |
| 1310 | If the specified text includes raw bytes (@pxref{Text |
| 1311 | Representations}), @code{select-safe-coding-system} suggests |
| 1312 | @code{raw-text} for its encoding. |
| 1313 | |
| 1314 | If @var{default-coding-system} is non-@code{nil}, that is the first |
| 1315 | coding system to try; if that can handle the text, |
| 1316 | @code{select-safe-coding-system} returns that coding system. It can |
| 1317 | also be a list of coding systems; then the function tries each of them |
| 1318 | one by one. After trying all of them, it next tries the current |
| 1319 | buffer's value of @code{buffer-file-coding-system} (if it is not |
| 1320 | @code{undecided}), then the default value of |
| 1321 | @code{buffer-file-coding-system} and finally the user's most |
| 1322 | preferred coding system, which the user can set using the command |
| 1323 | @code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing |
| 1324 | Coding Systems, emacs, The GNU Emacs Manual}). |
| 1325 | |
| 1326 | If one of those coding systems can safely encode all the specified |
| 1327 | text, @code{select-safe-coding-system} chooses it and returns it. |
| 1328 | Otherwise, it asks the user to choose from a list of coding systems |
| 1329 | which can encode all the text, and returns the user's choice. |
| 1330 | |
| 1331 | @var{default-coding-system} can also be a list whose first element is |
| 1332 | t and whose other elements are coding systems. Then, if no coding |
| 1333 | system in the list can handle the text, @code{select-safe-coding-system} |
| 1334 | queries the user immediately, without trying any of the three |
| 1335 | alternatives described above. |
| 1336 | |
| 1337 | The optional argument @var{accept-default-p}, if non-@code{nil}, |
| 1338 | should be a function to determine whether a coding system selected |
| 1339 | without user interaction is acceptable. @code{select-safe-coding-system} |
| 1340 | calls this function with one argument, the base coding system of the |
| 1341 | selected coding system. If @var{accept-default-p} returns @code{nil}, |
| 1342 | @code{select-safe-coding-system} rejects the silently selected coding |
| 1343 | system, and asks the user to select a coding system from a list of |
| 1344 | possible candidates. |
| 1345 | |
| 1346 | @vindex select-safe-coding-system-accept-default-p |
| 1347 | If the variable @code{select-safe-coding-system-accept-default-p} is |
| 1348 | non-@code{nil}, it should be a function taking a single argument. |
| 1349 | It is used in place of @var{accept-default-p}, overriding any |
| 1350 | value supplied for this argument. |
| 1351 | |
| 1352 | As a final step, before returning the chosen coding system, |
| 1353 | @code{select-safe-coding-system} checks whether that coding system is |
| 1354 | consistent with what would be selected if the contents of the region |
| 1355 | were read from a file. (If not, this could lead to data corruption in |
| 1356 | a file subsequently re-visited and edited.) Normally, |
| 1357 | @code{select-safe-coding-system} uses @code{buffer-file-name} as the |
| 1358 | file for this purpose, but if @var{file} is non-@code{nil}, it uses |
| 1359 | that file instead (this can be relevant for @code{write-region} and |
| 1360 | similar functions). If it detects an apparent inconsistency, |
| 1361 | @code{select-safe-coding-system} queries the user before selecting the |
| 1362 | coding system. |
| 1363 | @end defun |
| 1364 | |
| 1365 | Here are two functions you can use to let the user specify a coding |
| 1366 | system, with completion. @xref{Completion}. |
| 1367 | |
| 1368 | @defun read-coding-system prompt &optional default |
| 1369 | This function reads a coding system using the minibuffer, prompting with |
| 1370 | string @var{prompt}, and returns the coding system name as a symbol. If |
| 1371 | the user enters null input, @var{default} specifies which coding system |
| 1372 | to return. It should be a symbol or a string. |
| 1373 | @end defun |
| 1374 | |
| 1375 | @defun read-non-nil-coding-system prompt |
| 1376 | This function reads a coding system using the minibuffer, prompting with |
| 1377 | string @var{prompt}, and returns the coding system name as a symbol. If |
| 1378 | the user tries to enter null input, it asks the user to try again. |
| 1379 | @xref{Coding Systems}. |
| 1380 | @end defun |
| 1381 | |
| 1382 | @node Default Coding Systems |
| 1383 | @subsection Default Coding Systems |
| 1384 | @cindex default coding system |
| 1385 | @cindex coding system, automatically determined |
| 1386 | |
| 1387 | This section describes variables that specify the default coding |
| 1388 | system for certain files or when running certain subprograms, and the |
| 1389 | function that I/O operations use to access them. |
| 1390 | |
| 1391 | The idea of these variables is that you set them once and for all to the |
| 1392 | defaults you want, and then do not change them again. To specify a |
| 1393 | particular coding system for a particular operation in a Lisp program, |
| 1394 | don't change these variables; instead, override them using |
| 1395 | @code{coding-system-for-read} and @code{coding-system-for-write} |
| 1396 | (@pxref{Specifying Coding Systems}). |
| 1397 | |
| 1398 | @cindex file contents, and default coding system |
| 1399 | @defopt auto-coding-regexp-alist |
| 1400 | This variable is an alist of text patterns and corresponding coding |
| 1401 | systems. Each element has the form @code{(@var{regexp} |
| 1402 | . @var{coding-system})}; a file whose first few kilobytes match |
| 1403 | @var{regexp} is decoded with @var{coding-system} when its contents are |
| 1404 | read into a buffer. The settings in this alist take priority over |
| 1405 | @code{coding:} tags in the files and the contents of |
| 1406 | @code{file-coding-system-alist} (see below). The default value is set |
| 1407 | so that Emacs automatically recognizes mail files in Babyl format and |
| 1408 | reads them with no code conversions. |
| 1409 | @end defopt |
| 1410 | |
| 1411 | @cindex file name, and default coding system |
| 1412 | @defopt file-coding-system-alist |
| 1413 | This variable is an alist that specifies the coding systems to use for |
| 1414 | reading and writing particular files. Each element has the form |
| 1415 | @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular |
| 1416 | expression that matches certain file names. The element applies to file |
| 1417 | names that match @var{pattern}. |
| 1418 | |
| 1419 | The @sc{cdr} of the element, @var{coding}, should be either a coding |
| 1420 | system, a cons cell containing two coding systems, or a function name (a |
| 1421 | symbol with a function definition). If @var{coding} is a coding system, |
| 1422 | that coding system is used for both reading the file and writing it. If |
| 1423 | @var{coding} is a cons cell containing two coding systems, its @sc{car} |
| 1424 | specifies the coding system for decoding, and its @sc{cdr} specifies the |
| 1425 | coding system for encoding. |
| 1426 | |
| 1427 | If @var{coding} is a function name, the function should take one |
| 1428 | argument, a list of all arguments passed to |
| 1429 | @code{find-operation-coding-system}. It must return a coding system |
| 1430 | or a cons cell containing two coding systems. This value has the same |
| 1431 | meaning as described above. |
| 1432 | |
| 1433 | If @var{coding} (or what returned by the above function) is |
| 1434 | @code{undecided}, the normal code-detection is performed. |
| 1435 | @end defopt |
| 1436 | |
| 1437 | @defopt auto-coding-alist |
| 1438 | This variable is an alist that specifies the coding systems to use for |
| 1439 | reading and writing particular files. Its form is like that of |
| 1440 | @code{file-coding-system-alist}, but, unlike the latter, this variable |
| 1441 | takes priority over any @code{coding:} tags in the file. |
| 1442 | @end defopt |
| 1443 | |
| 1444 | @cindex program name, and default coding system |
| 1445 | @defvar process-coding-system-alist |
| 1446 | This variable is an alist specifying which coding systems to use for a |
| 1447 | subprocess, depending on which program is running in the subprocess. It |
| 1448 | works like @code{file-coding-system-alist}, except that @var{pattern} is |
| 1449 | matched against the program name used to start the subprocess. The coding |
| 1450 | system or systems specified in this alist are used to initialize the |
| 1451 | coding systems used for I/O to the subprocess, but you can specify |
| 1452 | other coding systems later using @code{set-process-coding-system}. |
| 1453 | @end defvar |
| 1454 | |
| 1455 | @strong{Warning:} Coding systems such as @code{undecided}, which |
| 1456 | determine the coding system from the data, do not work entirely reliably |
| 1457 | with asynchronous subprocess output. This is because Emacs handles |
| 1458 | asynchronous subprocess output in batches, as it arrives. If the coding |
| 1459 | system leaves the character code conversion unspecified, or leaves the |
| 1460 | end-of-line conversion unspecified, Emacs must try to detect the proper |
| 1461 | conversion from one batch at a time, and this does not always work. |
| 1462 | |
| 1463 | Therefore, with an asynchronous subprocess, if at all possible, use a |
| 1464 | coding system which determines both the character code conversion and |
| 1465 | the end of line conversion---that is, one like @code{latin-1-unix}, |
| 1466 | rather than @code{undecided} or @code{latin-1}. |
| 1467 | |
| 1468 | @cindex port number, and default coding system |
| 1469 | @cindex network service name, and default coding system |
| 1470 | @defvar network-coding-system-alist |
| 1471 | This variable is an alist that specifies the coding system to use for |
| 1472 | network streams. It works much like @code{file-coding-system-alist}, |
| 1473 | with the difference that the @var{pattern} in an element may be either a |
| 1474 | port number or a regular expression. If it is a regular expression, it |
| 1475 | is matched against the network service name used to open the network |
| 1476 | stream. |
| 1477 | @end defvar |
| 1478 | |
| 1479 | @defvar default-process-coding-system |
| 1480 | This variable specifies the coding systems to use for subprocess (and |
| 1481 | network stream) input and output, when nothing else specifies what to |
| 1482 | do. |
| 1483 | |
| 1484 | The value should be a cons cell of the form @code{(@var{input-coding} |
| 1485 | . @var{output-coding})}. Here @var{input-coding} applies to input from |
| 1486 | the subprocess, and @var{output-coding} applies to output to it. |
| 1487 | @end defvar |
| 1488 | |
| 1489 | @cindex default coding system, functions to determine |
| 1490 | @defopt auto-coding-functions |
| 1491 | This variable holds a list of functions that try to determine a |
| 1492 | coding system for a file based on its undecoded contents. |
| 1493 | |
| 1494 | Each function in this list should be written to look at text in the |
| 1495 | current buffer, but should not modify it in any way. The buffer will |
| 1496 | contain undecoded text of parts of the file. Each function should |
| 1497 | take one argument, @var{size}, which tells it how many characters to |
| 1498 | look at, starting from point. If the function succeeds in determining |
| 1499 | a coding system for the file, it should return that coding system. |
| 1500 | Otherwise, it should return @code{nil}. |
| 1501 | |
| 1502 | If a file has a @samp{coding:} tag, that takes precedence, so these |
| 1503 | functions won't be called. |
| 1504 | @end defopt |
| 1505 | |
| 1506 | @defun find-auto-coding filename size |
| 1507 | This function tries to determine a suitable coding system for |
| 1508 | @var{filename}. It examines the buffer visiting the named file, using |
| 1509 | the variables documented above in sequence, until it finds a match for |
| 1510 | one of the rules specified by these variables. It then returns a cons |
| 1511 | cell of the form @code{(@var{coding} . @var{source})}, where |
| 1512 | @var{coding} is the coding system to use and @var{source} is a symbol, |
| 1513 | one of @code{auto-coding-alist}, @code{auto-coding-regexp-alist}, |
| 1514 | @code{:coding}, or @code{auto-coding-functions}, indicating which one |
| 1515 | supplied the matching rule. The value @code{:coding} means the coding |
| 1516 | system was specified by the @code{coding:} tag in the file |
| 1517 | (@pxref{Specify Coding,, coding tag, emacs, The GNU Emacs Manual}). |
| 1518 | The order of looking for a matching rule is @code{auto-coding-alist} |
| 1519 | first, then @code{auto-coding-regexp-alist}, then the @code{coding:} |
| 1520 | tag, and lastly @code{auto-coding-functions}. If no matching rule was |
| 1521 | found, the function returns @code{nil}. |
| 1522 | |
| 1523 | The second argument @var{size} is the size of text, in characters, |
| 1524 | following point. The function examines text only within @var{size} |
| 1525 | characters after point. Normally, the buffer should be positioned at |
| 1526 | the beginning when this function is called, because one of the places |
| 1527 | for the @code{coding:} tag is the first one or two lines of the file; |
| 1528 | in that case, @var{size} should be the size of the buffer. |
| 1529 | @end defun |
| 1530 | |
| 1531 | @defun set-auto-coding filename size |
| 1532 | This function returns a suitable coding system for file |
| 1533 | @var{filename}. It uses @code{find-auto-coding} to find the coding |
| 1534 | system. If no coding system could be determined, the function returns |
| 1535 | @code{nil}. The meaning of the argument @var{size} is like in |
| 1536 | @code{find-auto-coding}. |
| 1537 | @end defun |
| 1538 | |
| 1539 | @defun find-operation-coding-system operation &rest arguments |
| 1540 | This function returns the coding system to use (by default) for |
| 1541 | performing @var{operation} with @var{arguments}. The value has this |
| 1542 | form: |
| 1543 | |
| 1544 | @example |
| 1545 | (@var{decoding-system} . @var{encoding-system}) |
| 1546 | @end example |
| 1547 | |
| 1548 | The first element, @var{decoding-system}, is the coding system to use |
| 1549 | for decoding (in case @var{operation} does decoding), and |
| 1550 | @var{encoding-system} is the coding system for encoding (in case |
| 1551 | @var{operation} does encoding). |
| 1552 | |
| 1553 | The argument @var{operation} is a symbol; it should be one of |
| 1554 | @code{write-region}, @code{start-process}, @code{call-process}, |
| 1555 | @code{call-process-region}, @code{insert-file-contents}, or |
| 1556 | @code{open-network-stream}. These are the names of the Emacs I/O |
| 1557 | primitives that can do character code and eol conversion. |
| 1558 | |
| 1559 | The remaining arguments should be the same arguments that might be given |
| 1560 | to the corresponding I/O primitive. Depending on the primitive, one |
| 1561 | of those arguments is selected as the @dfn{target}. For example, if |
| 1562 | @var{operation} does file I/O, whichever argument specifies the file |
| 1563 | name is the target. For subprocess primitives, the process name is the |
| 1564 | target. For @code{open-network-stream}, the target is the service name |
| 1565 | or port number. |
| 1566 | |
| 1567 | Depending on @var{operation}, this function looks up the target in |
| 1568 | @code{file-coding-system-alist}, @code{process-coding-system-alist}, |
| 1569 | or @code{network-coding-system-alist}. If the target is found in the |
| 1570 | alist, @code{find-operation-coding-system} returns its association in |
| 1571 | the alist; otherwise it returns @code{nil}. |
| 1572 | |
| 1573 | If @var{operation} is @code{insert-file-contents}, the argument |
| 1574 | corresponding to the target may be a cons cell of the form |
| 1575 | @code{(@var{filename} . @var{buffer})}. In that case, @var{filename} |
| 1576 | is a file name to look up in @code{file-coding-system-alist}, and |
| 1577 | @var{buffer} is a buffer that contains the file's contents (not yet |
| 1578 | decoded). If @code{file-coding-system-alist} specifies a function to |
| 1579 | call for this file, and that function needs to examine the file's |
| 1580 | contents (as it usually does), it should examine the contents of |
| 1581 | @var{buffer} instead of reading the file. |
| 1582 | @end defun |
| 1583 | |
| 1584 | @node Specifying Coding Systems |
| 1585 | @subsection Specifying a Coding System for One Operation |
| 1586 | |
| 1587 | You can specify the coding system for a specific operation by binding |
| 1588 | the variables @code{coding-system-for-read} and/or |
| 1589 | @code{coding-system-for-write}. |
| 1590 | |
| 1591 | @defvar coding-system-for-read |
| 1592 | If this variable is non-@code{nil}, it specifies the coding system to |
| 1593 | use for reading a file, or for input from a synchronous subprocess. |
| 1594 | |
| 1595 | It also applies to any asynchronous subprocess or network stream, but in |
| 1596 | a different way: the value of @code{coding-system-for-read} when you |
| 1597 | start the subprocess or open the network stream specifies the input |
| 1598 | decoding method for that subprocess or network stream. It remains in |
| 1599 | use for that subprocess or network stream unless and until overridden. |
| 1600 | |
| 1601 | The right way to use this variable is to bind it with @code{let} for a |
| 1602 | specific I/O operation. Its global value is normally @code{nil}, and |
| 1603 | you should not globally set it to any other value. Here is an example |
| 1604 | of the right way to use the variable: |
| 1605 | |
| 1606 | @example |
| 1607 | ;; @r{Read the file with no character code conversion.} |
| 1608 | ;; @r{Assume @acronym{crlf} represents end-of-line.} |
| 1609 | (let ((coding-system-for-read 'emacs-mule-dos)) |
| 1610 | (insert-file-contents filename)) |
| 1611 | @end example |
| 1612 | |
| 1613 | When its value is non-@code{nil}, this variable takes precedence over |
| 1614 | all other methods of specifying a coding system to use for input, |
| 1615 | including @code{file-coding-system-alist}, |
| 1616 | @code{process-coding-system-alist} and |
| 1617 | @code{network-coding-system-alist}. |
| 1618 | @end defvar |
| 1619 | |
| 1620 | @defvar coding-system-for-write |
| 1621 | This works much like @code{coding-system-for-read}, except that it |
| 1622 | applies to output rather than input. It affects writing to files, |
| 1623 | as well as sending output to subprocesses and net connections. |
| 1624 | |
| 1625 | When a single operation does both input and output, as do |
| 1626 | @code{call-process-region} and @code{start-process}, both |
| 1627 | @code{coding-system-for-read} and @code{coding-system-for-write} |
| 1628 | affect it. |
| 1629 | @end defvar |
| 1630 | |
| 1631 | @defopt inhibit-eol-conversion |
| 1632 | When this variable is non-@code{nil}, no end-of-line conversion is done, |
| 1633 | no matter which coding system is specified. This applies to all the |
| 1634 | Emacs I/O and subprocess primitives, and to the explicit encoding and |
| 1635 | decoding functions (@pxref{Explicit Encoding}). |
| 1636 | @end defopt |
| 1637 | |
| 1638 | @cindex priority order of coding systems |
| 1639 | @cindex coding systems, priority |
| 1640 | Sometimes, you need to prefer several coding systems for some |
| 1641 | operation, rather than fix a single one. Emacs lets you specify a |
| 1642 | priority order for using coding systems. This ordering affects the |
| 1643 | sorting of lists of coding systems returned by functions such as |
| 1644 | @code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}). |
| 1645 | |
| 1646 | @defun coding-system-priority-list &optional highestp |
| 1647 | This function returns the list of coding systems in the order of their |
| 1648 | current priorities. Optional argument @var{highestp}, if |
| 1649 | non-@code{nil}, means return only the highest priority coding system. |
| 1650 | @end defun |
| 1651 | |
| 1652 | @defun set-coding-system-priority &rest coding-systems |
| 1653 | This function puts @var{coding-systems} at the beginning of the |
| 1654 | priority list for coding systems, thus making their priority higher |
| 1655 | than all the rest. |
| 1656 | @end defun |
| 1657 | |
| 1658 | @defmac with-coding-priority coding-systems &rest body@dots{} |
| 1659 | This macro execute @var{body}, like @code{progn} does |
| 1660 | (@pxref{Sequencing, progn}), with @var{coding-systems} at the front of |
| 1661 | the priority list for coding systems. @var{coding-systems} should be |
| 1662 | a list of coding systems to prefer during execution of @var{body}. |
| 1663 | @end defmac |
| 1664 | |
| 1665 | @node Explicit Encoding |
| 1666 | @subsection Explicit Encoding and Decoding |
| 1667 | @cindex encoding in coding systems |
| 1668 | @cindex decoding in coding systems |
| 1669 | |
| 1670 | All the operations that transfer text in and out of Emacs have the |
| 1671 | ability to use a coding system to encode or decode the text. |
| 1672 | You can also explicitly encode and decode text using the functions |
| 1673 | in this section. |
| 1674 | |
| 1675 | The result of encoding, and the input to decoding, are not ordinary |
| 1676 | text. They logically consist of a series of byte values; that is, a |
| 1677 | series of @acronym{ASCII} and eight-bit characters. In unibyte |
| 1678 | buffers and strings, these characters have codes in the range 0 |
| 1679 | through #xFF (255). In a multibyte buffer or string, eight-bit |
| 1680 | characters have character codes higher than #xFF (@pxref{Text |
| 1681 | Representations}), but Emacs transparently converts them to their |
| 1682 | single-byte values when you encode or decode such text. |
| 1683 | |
| 1684 | The usual way to read a file into a buffer as a sequence of bytes, so |
| 1685 | you can decode the contents explicitly, is with |
| 1686 | @code{insert-file-contents-literally} (@pxref{Reading from Files}); |
| 1687 | alternatively, specify a non-@code{nil} @var{rawfile} argument when |
| 1688 | visiting a file with @code{find-file-noselect}. These methods result in |
| 1689 | a unibyte buffer. |
| 1690 | |
| 1691 | The usual way to use the byte sequence that results from explicitly |
| 1692 | encoding text is to copy it to a file or process---for example, to write |
| 1693 | it with @code{write-region} (@pxref{Writing to Files}), and suppress |
| 1694 | encoding by binding @code{coding-system-for-write} to |
| 1695 | @code{no-conversion}. |
| 1696 | |
| 1697 | Here are the functions to perform explicit encoding or decoding. The |
| 1698 | encoding functions produce sequences of bytes; the decoding functions |
| 1699 | are meant to operate on sequences of bytes. All of these functions |
| 1700 | discard text properties. They also set @code{last-coding-system-used} |
| 1701 | to the precise coding system they used. |
| 1702 | |
| 1703 | @deffn Command encode-coding-region start end coding-system &optional destination |
| 1704 | This command encodes the text from @var{start} to @var{end} according |
| 1705 | to coding system @var{coding-system}. Normally, the encoded text |
| 1706 | replaces the original text in the buffer, but the optional argument |
| 1707 | @var{destination} can change that. If @var{destination} is a buffer, |
| 1708 | the encoded text is inserted in that buffer after point (point does |
| 1709 | not move); if it is @code{t}, the command returns the encoded text as |
| 1710 | a unibyte string without inserting it. |
| 1711 | |
| 1712 | If encoded text is inserted in some buffer, this command returns the |
| 1713 | length of the encoded text. |
| 1714 | |
| 1715 | The result of encoding is logically a sequence of bytes, but the |
| 1716 | buffer remains multibyte if it was multibyte before, and any 8-bit |
| 1717 | bytes are converted to their multibyte representation (@pxref{Text |
| 1718 | Representations}). |
| 1719 | |
| 1720 | @cindex @code{undecided} coding-system, when encoding |
| 1721 | Do @emph{not} use @code{undecided} for @var{coding-system} when |
| 1722 | encoding text, since that may lead to unexpected results. Instead, |
| 1723 | use @code{select-safe-coding-system} (@pxref{User-Chosen Coding |
| 1724 | Systems, select-safe-coding-system}) to suggest a suitable encoding, |
| 1725 | if there's no obvious pertinent value for @var{coding-system}. |
| 1726 | @end deffn |
| 1727 | |
| 1728 | @defun encode-coding-string string coding-system &optional nocopy buffer |
| 1729 | This function encodes the text in @var{string} according to coding |
| 1730 | system @var{coding-system}. It returns a new string containing the |
| 1731 | encoded text, except when @var{nocopy} is non-@code{nil}, in which |
| 1732 | case the function may return @var{string} itself if the encoding |
| 1733 | operation is trivial. The result of encoding is a unibyte string. |
| 1734 | @end defun |
| 1735 | |
| 1736 | @deffn Command decode-coding-region start end coding-system &optional destination |
| 1737 | This command decodes the text from @var{start} to @var{end} according |
| 1738 | to coding system @var{coding-system}. To make explicit decoding |
| 1739 | useful, the text before decoding ought to be a sequence of byte |
| 1740 | values, but both multibyte and unibyte buffers are acceptable (in the |
| 1741 | multibyte case, the raw byte values should be represented as eight-bit |
| 1742 | characters). Normally, the decoded text replaces the original text in |
| 1743 | the buffer, but the optional argument @var{destination} can change |
| 1744 | that. If @var{destination} is a buffer, the decoded text is inserted |
| 1745 | in that buffer after point (point does not move); if it is @code{t}, |
| 1746 | the command returns the decoded text as a multibyte string without |
| 1747 | inserting it. |
| 1748 | |
| 1749 | If decoded text is inserted in some buffer, this command returns the |
| 1750 | length of the decoded text. |
| 1751 | |
| 1752 | This command puts a @code{charset} text property on the decoded text. |
| 1753 | The value of the property states the character set used to decode the |
| 1754 | original text. |
| 1755 | @end deffn |
| 1756 | |
| 1757 | @defun decode-coding-string string coding-system &optional nocopy buffer |
| 1758 | This function decodes the text in @var{string} according to |
| 1759 | @var{coding-system}. It returns a new string containing the decoded |
| 1760 | text, except when @var{nocopy} is non-@code{nil}, in which case the |
| 1761 | function may return @var{string} itself if the decoding operation is |
| 1762 | trivial. To make explicit decoding useful, the contents of |
| 1763 | @var{string} ought to be a unibyte string with a sequence of byte |
| 1764 | values, but a multibyte string is also acceptable (assuming it |
| 1765 | contains 8-bit bytes in their multibyte form). |
| 1766 | |
| 1767 | If optional argument @var{buffer} specifies a buffer, the decoded text |
| 1768 | is inserted in that buffer after point (point does not move). In this |
| 1769 | case, the return value is the length of the decoded text. |
| 1770 | |
| 1771 | @cindex @code{charset}, text property |
| 1772 | This function puts a @code{charset} text property on the decoded text. |
| 1773 | The value of the property states the character set used to decode the |
| 1774 | original text: |
| 1775 | |
| 1776 | @example |
| 1777 | @group |
| 1778 | (decode-coding-string "Gr\374ss Gott" 'latin-1) |
| 1779 | @result{} #("Gr@"uss Gott" 0 9 (charset iso-8859-1)) |
| 1780 | @end group |
| 1781 | @end example |
| 1782 | @end defun |
| 1783 | |
| 1784 | @defun decode-coding-inserted-region from to filename &optional visit beg end replace |
| 1785 | This function decodes the text from @var{from} to @var{to} as if |
| 1786 | it were being read from file @var{filename} using @code{insert-file-contents} |
| 1787 | using the rest of the arguments provided. |
| 1788 | |
| 1789 | The normal way to use this function is after reading text from a file |
| 1790 | without decoding, if you decide you would rather have decoded it. |
| 1791 | Instead of deleting the text and reading it again, this time with |
| 1792 | decoding, you can call this function. |
| 1793 | @end defun |
| 1794 | |
| 1795 | @node Terminal I/O Encoding |
| 1796 | @subsection Terminal I/O Encoding |
| 1797 | |
| 1798 | Emacs can decode keyboard input using a coding system, and encode |
| 1799 | terminal output. This is useful for terminals that transmit or |
| 1800 | display text using a particular encoding such as Latin-1. Emacs does |
| 1801 | not set @code{last-coding-system-used} for encoding or decoding of |
| 1802 | terminal I/O. |
| 1803 | |
| 1804 | @defun keyboard-coding-system &optional terminal |
| 1805 | This function returns the coding system that is in use for decoding |
| 1806 | keyboard input from @var{terminal}---or @code{nil} if no coding system |
| 1807 | is to be used for that terminal. If @var{terminal} is omitted or |
| 1808 | @code{nil}, it means the selected frame's terminal. @xref{Multiple |
| 1809 | Terminals}. |
| 1810 | @end defun |
| 1811 | |
| 1812 | @deffn Command set-keyboard-coding-system coding-system &optional terminal |
| 1813 | This command specifies @var{coding-system} as the coding system to use |
| 1814 | for decoding keyboard input from @var{terminal}. If |
| 1815 | @var{coding-system} is @code{nil}, that means do not decode keyboard |
| 1816 | input. If @var{terminal} is a frame, it means that frame's terminal; |
| 1817 | if it is @code{nil}, that means the currently selected frame's |
| 1818 | terminal. @xref{Multiple Terminals}. |
| 1819 | @end deffn |
| 1820 | |
| 1821 | @defun terminal-coding-system &optional terminal |
| 1822 | This function returns the coding system that is in use for encoding |
| 1823 | terminal output from @var{terminal}---or @code{nil} if the output is |
| 1824 | not encoded. If @var{terminal} is a frame, it means that frame's |
| 1825 | terminal; if it is @code{nil}, that means the currently selected |
| 1826 | frame's terminal. |
| 1827 | @end defun |
| 1828 | |
| 1829 | @deffn Command set-terminal-coding-system coding-system &optional terminal |
| 1830 | This command specifies @var{coding-system} as the coding system to use |
| 1831 | for encoding terminal output from @var{terminal}. If |
| 1832 | @var{coding-system} is @code{nil}, terminal output is not encoded. If |
| 1833 | @var{terminal} is a frame, it means that frame's terminal; if it is |
| 1834 | @code{nil}, that means the currently selected frame's terminal. |
| 1835 | @end deffn |
| 1836 | |
| 1837 | @node Input Methods |
| 1838 | @section Input Methods |
| 1839 | @cindex input methods |
| 1840 | |
| 1841 | @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII} |
| 1842 | characters from the keyboard. Unlike coding systems, which translate |
| 1843 | non-@acronym{ASCII} characters to and from encodings meant to be read by |
| 1844 | programs, input methods provide human-friendly commands. (@xref{Input |
| 1845 | Methods,,, emacs, The GNU Emacs Manual}, for information on how users |
| 1846 | use input methods to enter text.) How to define input methods is not |
| 1847 | yet documented in this manual, but here we describe how to use them. |
| 1848 | |
| 1849 | Each input method has a name, which is currently a string; |
| 1850 | in the future, symbols may also be usable as input method names. |
| 1851 | |
| 1852 | @defvar current-input-method |
| 1853 | This variable holds the name of the input method now active in the |
| 1854 | current buffer. (It automatically becomes local in each buffer when set |
| 1855 | in any fashion.) It is @code{nil} if no input method is active in the |
| 1856 | buffer now. |
| 1857 | @end defvar |
| 1858 | |
| 1859 | @defopt default-input-method |
| 1860 | This variable holds the default input method for commands that choose an |
| 1861 | input method. Unlike @code{current-input-method}, this variable is |
| 1862 | normally global. |
| 1863 | @end defopt |
| 1864 | |
| 1865 | @deffn Command set-input-method input-method |
| 1866 | This command activates input method @var{input-method} for the current |
| 1867 | buffer. It also sets @code{default-input-method} to @var{input-method}. |
| 1868 | If @var{input-method} is @code{nil}, this command deactivates any input |
| 1869 | method for the current buffer. |
| 1870 | @end deffn |
| 1871 | |
| 1872 | @defun read-input-method-name prompt &optional default inhibit-null |
| 1873 | This function reads an input method name with the minibuffer, prompting |
| 1874 | with @var{prompt}. If @var{default} is non-@code{nil}, that is returned |
| 1875 | by default, if the user enters empty input. However, if |
| 1876 | @var{inhibit-null} is non-@code{nil}, empty input signals an error. |
| 1877 | |
| 1878 | The returned value is a string. |
| 1879 | @end defun |
| 1880 | |
| 1881 | @defvar input-method-alist |
| 1882 | This variable defines all the supported input methods. |
| 1883 | Each element defines one input method, and should have the form: |
| 1884 | |
| 1885 | @example |
| 1886 | (@var{input-method} @var{language-env} @var{activate-func} |
| 1887 | @var{title} @var{description} @var{args}...) |
| 1888 | @end example |
| 1889 | |
| 1890 | Here @var{input-method} is the input method name, a string; |
| 1891 | @var{language-env} is another string, the name of the language |
| 1892 | environment this input method is recommended for. (That serves only for |
| 1893 | documentation purposes.) |
| 1894 | |
| 1895 | @var{activate-func} is a function to call to activate this method. The |
| 1896 | @var{args}, if any, are passed as arguments to @var{activate-func}. All |
| 1897 | told, the arguments to @var{activate-func} are @var{input-method} and |
| 1898 | the @var{args}. |
| 1899 | |
| 1900 | @var{title} is a string to display in the mode line while this method is |
| 1901 | active. @var{description} is a string describing this method and what |
| 1902 | it is good for. |
| 1903 | @end defvar |
| 1904 | |
| 1905 | The fundamental interface to input methods is through the |
| 1906 | variable @code{input-method-function}. @xref{Reading One Event}, |
| 1907 | and @ref{Invoking the Input Method}. |
| 1908 | |
| 1909 | @node Locales |
| 1910 | @section Locales |
| 1911 | @cindex locale |
| 1912 | |
| 1913 | POSIX defines a concept of ``locales'' which control which language |
| 1914 | to use in language-related features. These Emacs variables control |
| 1915 | how Emacs interacts with these features. |
| 1916 | |
| 1917 | @defvar locale-coding-system |
| 1918 | @cindex keyboard input decoding on X |
| 1919 | This variable specifies the coding system to use for decoding system |
| 1920 | error messages and---on X Window system only---keyboard input, for |
| 1921 | encoding the format argument to @code{format-time-string}, and for |
| 1922 | decoding the return value of @code{format-time-string}. |
| 1923 | @end defvar |
| 1924 | |
| 1925 | @defvar system-messages-locale |
| 1926 | This variable specifies the locale to use for generating system error |
| 1927 | messages. Changing the locale can cause messages to come out in a |
| 1928 | different language or in a different orthography. If the variable is |
| 1929 | @code{nil}, the locale is specified by environment variables in the |
| 1930 | usual POSIX fashion. |
| 1931 | @end defvar |
| 1932 | |
| 1933 | @defvar system-time-locale |
| 1934 | This variable specifies the locale to use for formatting time values. |
| 1935 | Changing the locale can cause messages to appear according to the |
| 1936 | conventions of a different language. If the variable is @code{nil}, the |
| 1937 | locale is specified by environment variables in the usual POSIX fashion. |
| 1938 | @end defvar |
| 1939 | |
| 1940 | @defun locale-info item |
| 1941 | This function returns locale data @var{item} for the current POSIX |
| 1942 | locale, if available. @var{item} should be one of these symbols: |
| 1943 | |
| 1944 | @table @code |
| 1945 | @item codeset |
| 1946 | Return the character set as a string (locale item @code{CODESET}). |
| 1947 | |
| 1948 | @item days |
| 1949 | Return a 7-element vector of day names (locale items |
| 1950 | @code{DAY_1} through @code{DAY_7}); |
| 1951 | |
| 1952 | @item months |
| 1953 | Return a 12-element vector of month names (locale items @code{MON_1} |
| 1954 | through @code{MON_12}). |
| 1955 | |
| 1956 | @item paper |
| 1957 | Return a list @code{(@var{width} @var{height})} for the default paper |
| 1958 | size measured in millimeters (locale items @code{PAPER_WIDTH} and |
| 1959 | @code{PAPER_HEIGHT}). |
| 1960 | @end table |
| 1961 | |
| 1962 | If the system can't provide the requested information, or if |
| 1963 | @var{item} is not one of those symbols, the value is @code{nil}. All |
| 1964 | strings in the return value are decoded using |
| 1965 | @code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual}, |
| 1966 | for more information about locales and locale items. |
| 1967 | @end defun |