HCoop Git - bpt/emacs.git/blame - doc/emacs/mule.texi

Commit	Line	Data
8cf51b2c	1	@c This is part of the Emacs manual.
ba318903	2	@c Copyright (C) 1997, 1999-2014 Free Software Foundation, Inc.
8cf51b2c	3	@c See file emacs.texi for copying conditions.
abb9615e	4	@node International
8cf51b2c	5	@chapter International Character Set Support
59eda47f RS	6	@c This node is referenced in the tutorial. When renaming or deleting
59eda47f RS	7	@c it, the tutorial needs to be adjusted. (TUTORIAL.de)
8cf51b2c GM	8	@cindex international scripts
	9	@cindex multibyte characters
	10	@cindex encoding of characters
	11
acc112c7 PE	12	@cindex Arabic
acc112c7 PE	13	@cindex Bengali
8cf51b2c GM	14	@cindex Chinese
8cf51b2c GM	15	@cindex Cyrillic
acc112c7	16	@cindex Han
8cf51b2c	17	@cindex Hindi
8cf51b2c	18	@cindex Ethiopic
acc112c7	19	@cindex Georgian
8cf51b2c	20	@cindex Greek
acc112c7	21	@cindex Hangul
8cf51b2c	22	@cindex Hebrew
acc112c7	23	@cindex Hindi
8cf51b2c GM	24	@cindex IPA
	25	@cindex Japanese
	26	@cindex Korean
8cf51b2c	27	@cindex Latin
8cf51b2c	28	@cindex Thai
8cf51b2c	29	@cindex Vietnamese
8cf51b2c GM	30	Emacs supports a wide variety of international character sets,
8cf51b2c GM	31	including European and Vietnamese variants of the Latin alphabet, as
acc112c7 PE	32	well as Arabic scripts, Brahmic scripts (for languages such as
	33	Bengali, Hindi, and Thai), Cyrillic, Ethiopic, Georgian, Greek, Han
	34	(for Chinese and Japanese), Hangul (for Korean), Hebrew and IPA@.
8edb942b	35	Emacs also supports various encodings of these characters that are used by
8cf51b2c GM	36	other internationalized software, such as word processors and mailers.
	37
	38	Emacs allows editing text with international characters by supporting
	39	all the related activities:
	40
	41	@itemize @bullet
	42	@item
	43	You can visit files with non-@acronym{ASCII} characters, save non-@acronym{ASCII} text, and
	44	pass non-@acronym{ASCII} text between Emacs and programs it invokes (such as
	45	compilers, spell-checkers, and mailers). Setting your language
	46	environment (@pxref{Language Environments}) takes care of setting up the
	47	coding systems and other options for a specific language or culture.
	48	Alternatively, you can specify how Emacs should encode or decode text
	49	for each command; see @ref{Text Coding}.
	50
	51	@item
	52	You can display non-@acronym{ASCII} characters encoded by the various
	53	scripts. This works by using appropriate fonts on graphics displays
0be641c0	54	(@pxref{Defining Fontsets}), and by sending special codes to text
8cf51b2c GM	55	displays (@pxref{Terminal Coding}). If some characters are displayed
	56	incorrectly, refer to @ref{Undisplayable Characters}, which describes
	57	possible problems and explains how to solve them.
	58
f4b6ba46 EZ	59	@item
	60	Characters from scripts whose natural ordering of text is from right
	61	to left are reordered for display (@pxref{Bidirectional Editing}).
	62	These scripts include Arabic, Hebrew, Syriac, Thaana, and a few
	63	others.
	64
8cf51b2c GM	65	@item
	66	You can insert non-@acronym{ASCII} characters or search for them. To do that,
	67	you can specify an input method (@pxref{Select Input Method}) suitable
8edb942b	68	for your language, or use the default input method set up when you chose
8cf51b2c GM	69	your language environment. If
	70	your keyboard can produce non-@acronym{ASCII} characters, you can select an
	71	appropriate keyboard coding system (@pxref{Terminal Coding}), and Emacs
	72	will accept those characters. Latin-1 characters can also be input by
	73	using the @kbd{C-x 8} prefix, see @ref{Unibyte Mode}.
	74
8edb942b	75	With the X Window System, your locale should be set to an appropriate
50b063c3	76	value to make sure Emacs interprets keyboard input correctly; see
8cf51b2c GM	77	@ref{Language Environments, locales}.
	78	@end itemize
	79
	80	The rest of this chapter describes these issues in detail.
	81
	82	@menu
	83	* International Chars:: Basic concepts of multibyte characters.
8cf51b2c GM	84	* Language Environments:: Setting things up for the language you use.
	85	* Input Methods:: Entering text characters not on your keyboard.
	86	* Select Input Method:: Specifying your choice of input methods.
8cf51b2c GM	87	* Coding Systems:: Character set conversion when you read and
	88	write files, and so on.
	89	* Recognize Coding:: How Emacs figures out which conversion to use.
	90	* Specify Coding:: Specifying a file's coding system explicitly.
	91	* Output Coding:: Choosing coding systems for output.
	92	* Text Coding:: Choosing conversion to use for file text.
	93	* Communication Coding:: Coding systems for interprocess communication.
	94	* File Name Coding:: Coding systems for file @emph{names}.
	95	* Terminal Coding:: Specifying coding systems for converting
	96	terminal input and output.
	97	* Fontsets:: Fontsets are collections of fonts
	98	that cover the whole spectrum of characters.
	99	* Defining Fontsets:: Defining a new fontset.
70bb6cac	100	* Modifying Fontsets:: Modifying an existing fontset.
8cf51b2c GM	101	* Undisplayable Characters:: When characters don't display.
	102	* Unibyte Mode:: You can pick one European character set
	103	to use without multibyte characters.
	104	* Charsets:: How Emacs groups its internal character codes.
f4b6ba46	105	* Bidirectional Editing:: Support for right-to-left scripts.
8cf51b2c GM	106	@end menu
	107
	108	@node International Chars
	109	@section Introduction to International Character Sets
	110
	111	The users of international character sets and scripts have
	112	established many more-or-less standard coding systems for storing
ad36c422 CY	113	files. These coding systems are typically @dfn{multibyte}, meaning
	114	that sequences of two or more bytes are used to represent individual
	115	non-@acronym{ASCII} characters.
	116
	117	@cindex Unicode
	118	Internally, Emacs uses its own multibyte character encoding, which
	119	is a superset of the @dfn{Unicode} standard. This internal encoding
	120	allows characters from almost every known script to be intermixed in a
	121	single buffer or string. Emacs translates between the multibyte
	122	character encoding and various other coding systems when reading and
	123	writing files, and when exchanging data with subprocesses.
8cf51b2c GM	124
	125	@kindex C-h h
	126	@findex view-hello-file
	127	@cindex undisplayable characters
	128	@cindex @samp{?} in display
	129	The command @kbd{C-h h} (@code{view-hello-file}) displays the file
66ecdc9e GM	130	@file{etc/HELLO}, which illustrates various scripts by showing
66ecdc9e GM	131	how to say ``hello'' in many languages. If some characters can't be
8cf51b2c GM	132	displayed on your terminal, they appear as @samp{?} or as hollow boxes
	133	(@pxref{Undisplayable Characters}).
	134
ad36c422 CY	135	Keyboards, even in the countries where these character sets are
	136	used, generally don't have keys for all the characters in them. You
	137	can insert characters that your keyboard does not support, using
	138	@kbd{C-q} (@code{quoted-insert}) or @kbd{C-x 8 @key{RET}}
9ea10cc3	139	(@code{insert-char}). @xref{Inserting Text}. Emacs also supports
ad36c422 CY	140	various @dfn{input methods}, typically one for each script or
	141	language, which make it easier to type characters in the script.
	142	@xref{Input Methods}.
8cf51b2c GM	143
	144	@kindex C-x RET
	145	The prefix key @kbd{C-x @key{RET}} is used for commands that pertain
	146	to multibyte characters, coding systems, and input methods.
	147
8087d399 CY	148	@kindex C-x =
	149	@findex what-cursor-position
	150	The command @kbd{C-x =} (@code{what-cursor-position}) shows
	151	information about the character at point. In addition to the
	152	character position, which was described in @ref{Position Info}, this
	153	command displays how the character is encoded. For instance, it
	154	displays the following line in the echo area for the character
	155	@samp{c}:
	156
	157	@smallexample
	158	Char: c (99, #o143, #x63) point=28062 of 36168 (78%) column=53
	159	@end smallexample
	160
	161	The four values after @samp{Char:} describe the character that
	162	follows point, first by showing it and then by giving its character
	163	code in decimal, octal and hex. For a non-@acronym{ASCII} multibyte
	164	character, these are followed by @samp{file} and the character's
	165	representation, in hex, in the buffer's coding system, if that coding
	166	system encodes the character safely and with a single byte
	167	(@pxref{Coding Systems}). If the character's encoding is longer than
	168	one byte, Emacs shows @samp{file ...}.
	169
ad36c422 CY	170	As a special case, if the character lies in the range 128 (0200
	171	octal) through 159 (0237 octal), it stands for a ``raw'' byte that
	172	does not correspond to any specific displayable character. Such a
	173	``character'' lies within the @code{eight-bit-control} character set,
	174	and is displayed as an escaped octal character code. In this case,
	175	@kbd{C-x =} shows @samp{part of display ...} instead of @samp{file}.
8087d399 CY	176
	177	@cindex character set of character at point
	178	@cindex font of character at point
	179	@cindex text properties at point
	180	@cindex face at point
	181	With a prefix argument (@kbd{C-u C-x =}), this command displays a
	182	detailed description of the character in a window:
	183
	184	@itemize @bullet
	185	@item
	186	The character set name, and the codes that identify the character
	187	within that character set; @acronym{ASCII} characters are identified
	188	as belonging to the @code{ascii} character set.
	189
	190	@item
7195b841	191	The character's script, syntax and categories.
8087d399 CY	192
	193	@item
	194	What keys to type to input the character in the current input method
	195	(if it supports the character).
	196
7195b841 PE	197	@item
	198	The character's encodings, both internally in the buffer, and externally
	199	if you were to save the file.
	200
8087d399 CY	201	@item
8087d399 CY	202	If you are running Emacs on a graphical display, the font name and
0be641c0	203	glyph code for the character. If you are running Emacs on a text
8087d399 CY	204	terminal, the code(s) sent to the terminal.
	205
	206	@item
	207	The character's text properties (@pxref{Text Properties,,,
	208	elisp, the Emacs Lisp Reference Manual}), including any non-default
	209	faces used to display the character, and any overlays containing it
	210	(@pxref{Overlays,,, elisp, the same manual}).
	211	@end itemize
	212
7195b841	213	Here's an example, with some lines folded to fit into this manual:
8087d399 CY	214
8087d399 CY	215	@smallexample
8edb942b	216	position: 1 of 1 (0%), column: 0
7195b841	217	character: @^e (displayed as @^e) (codepoint 234, #o352, #xea)
8edb942b	218	preferred charset: unicode (Unicode (ISO10646))
7195b841 PE	219	code point in charset: 0xEA
	220	script: latin
	221	syntax: w which means: word
	222	category: .:Base, L:Left-to-right (strong), c:Chinese,
8edb942b	223	j:Japanese, l:Latin, v:Viet
7195b841 PE	224	to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
	225	buffer code: #xC3 #xAA
	226	file code: #xC3 #xAA (encoded by coding system utf-8-unix)
8edb942b	227	display: by this font (glyph code)
ae742cb5	228	xft:-unknown-DejaVu Sans Mono-normal-normal-
7195b841	229	normal--15----m-0-iso10646-1 (#xAC)
8087d399 CY	230
8087d399 CY	231	Character code properties: customize what to show
7195b841 PE	232	name: LATIN SMALL LETTER E WITH CIRCUMFLEX
	233	old-name: LATIN SMALL LETTER E CIRCUMFLEX
	234	general-category: Ll (Letter, Lowercase)
	235	decomposition: (101 770) ('e' '^')
8087d399 CY	236	@end smallexample
8087d399 CY	237
8cf51b2c GM	238	@node Language Environments
	239	@section Language Environments
	240	@cindex language environments
	241
	242	All supported character sets are supported in Emacs buffers whenever
	243	multibyte characters are enabled; there is no need to select a
e0550cae GM	244	particular language in order to display its characters.
e0550cae GM	245	However, it is important to select a @dfn{language
ad36c422 CY	246	environment} in order to set various defaults. Roughly speaking, the
	247	language environment represents a choice of preferred script rather
	248	than a choice of language.
8cf51b2c GM	249
	250	The language environment controls which coding systems to recognize
	251	when reading text (@pxref{Recognize Coding}). This applies to files,
ad36c422 CY	252	incoming mail, and any other text you read into Emacs. It may also
	253	specify the default coding system to use when you create a file. Each
	254	language environment also specifies a default input method.
8cf51b2c GM	255
	256	@findex set-language-environment
	257	@vindex current-language-environment
ae742cb5	258	To select a language environment, customize
8cf51b2c GM	259	@code{current-language-environment} or use the command @kbd{M-x
8cf51b2c GM	260	set-language-environment}. It makes no difference which buffer is
ad36c422	261	current when you use this command, because the effects apply globally
acc112c7 PE	262	to the Emacs session. See the variable @code{language-info-alist} for
	263	the list of supported language environments, and use the command
	264	@kbd{C-h L @var{lang-env} @key{RET}} (@code{describe-language-environment})
	265	for more information about the language environment @var{lang-env}.
	266	Supported language environments include:
8cf51b2c	267
8cf51b2c	268	@quotation
acc112c7 PE	269	@cindex ASCII
	270	ASCII,
	271	@cindex Arabic
	272	Arabic,
	273	@cindex Belarusian
	274	Belarusian,
	275	@cindex Bengali
	276	Bengali,
	277	@cindex Brazilian Portuguese
	278	Brazilian Portuguese,
	279	@cindex Bulgarian
	280	Bulgarian,
	281	@cindex Burmese
	282	Burmese,
	283	@cindex Cham
	284	Cham,
	285	@cindex Chinese
	286	Chinese-BIG5, Chinese-CNS, Chinese-EUC-TW, Chinese-GB,
	287	Chinese-GB18030, Chinese-GBK,
	288	@cindex Croatian
	289	Croatian,
	290	@cindex Cyrillic
	291	Cyrillic-ALT, Cyrillic-ISO, Cyrillic-KOI8,
	292	@cindex Czech
	293	Czech,
	294	@cindex Devanagari
	295	Devanagari,
	296	@cindex Dutch
	297	Dutch,
	298	@cindex English
	299	English,
	300	@cindex Esperanto
	301	Esperanto,
	302	@cindex Ethiopic
	303	Ethiopic,
	304	@cindex French
	305	French,
	306	@cindex Georgian
	307	Georgian,
	308	@cindex German
	309	German,
	310	@cindex Greek
	311	Greek,
	312	@cindex Gujarati
	313	Gujarati,
	314	@cindex Hebrew
	315	Hebrew,
	316	@cindex IPA
	317	IPA,
	318	@cindex Italian
	319	Italian,
	320	@cindex Japanese
	321	Japanese,
	322	@cindex Kannada
	323	Kannada,
	324	@cindex Khmer
	325	Khmer,
	326	@cindex Korean
	327	Korean,
	328	@cindex Lao
	329	Lao,
	330	@cindex Latin
	331	Latin-1, Latin-2, Latin-3, Latin-4, Latin-5, Latin-6, Latin-7,
	332	Latin-8, Latin-9,
333	@cindex Latvian
334	Latvian,
335	@cindex Lithuanian
336	Lithuanian,
337	@cindex Malayalam
338	Malayalam,
339	@cindex Oriya
340	Oriya,
341	@cindex Persian
342	Persian,
343	@cindex Polish
344	Polish,
345	@cindex Punjabi
346	Punjabi,
347	@cindex Romanian
348	Romanian,
349	@cindex Russian
350	Russian,
351	@cindex Sinhala
352	Sinhala,
353	@cindex Slovak
354	Slovak,
355	@cindex Slovenian
356	Slovenian,
357	@cindex Spanish
358	Spanish,
359	@cindex Swedish
360	Swedish,
361	@cindex TaiViet
362	TaiViet,
363	@cindex Tajik
364	Tajik,
365	@cindex Tamil
366	Tamil,
367	@cindex Telugu
368	Telugu,
369	@cindex Thai
370	Thai,
371	@cindex Tibetan
372	Tibetan,
373	@cindex Turkish
374	Turkish,
375	@cindex UTF-8
376	UTF-8,
377	@cindex Ukrainian
378	Ukrainian,
379	@cindex Vietnamese
380	Vietnamese,
381	@cindex Welsh
382	Welsh, and
383	@cindex Windows-1255
384	Windows-1255.
8cf51b2c GM	385	@end quotation
8cf51b2c GM	386
8cf51b2c	387	To display the script(s) used by your language environment on a
05806f43	388	graphical display, you need to have suitable fonts.
8cf51b2c GM	389	@xref{Fontsets}, for more details about setting up your fonts.
	390
	391	@findex set-locale-environment
	392	@vindex locale-language-names
	393	@vindex locale-charset-language-names
	394	@cindex locales
	395	Some operating systems let you specify the character-set locale you
	396	are using by setting the locale environment variables @env{LC_ALL},
e0550cae	397	@env{LC_CTYPE}, or @env{LANG}. (If more than one of these is
8cf51b2c	398	set, the first one that is nonempty specifies your locale for this
e0550cae	399	purpose.) During startup, Emacs looks up your character-set locale's
8cf51b2c GM	400	name in the system locale alias table, matches its canonical name
8cf51b2c GM	401	against entries in the value of the variables
e0550cae GM	402	@code{locale-charset-language-names} and @code{locale-language-names}
e0550cae GM	403	(the former overrides the latter),
8cf51b2c	404	and selects the corresponding language environment if a match is found.
e0550cae	405	It also adjusts the display
8cf51b2c GM	406	table and terminal coding system, the locale coding system, the
	407	preferred coding system as needed for the locale, and---last but not
	408	least---the way Emacs decodes non-@acronym{ASCII} characters sent by your keyboard.
	409
e0550cae	410	@c This seems unlikely, doesn't it?
8cf51b2c	411	If you modify the @env{LC_ALL}, @env{LC_CTYPE}, or @env{LANG}
e0550cae GM	412	environment variables while running Emacs (by using @kbd{M-x setenv}),
	413	you may want to invoke the @code{set-locale-environment}
	414	function afterwards to readjust the language environment from the new
	415	locale.
8cf51b2c GM	416
	417	@vindex locale-preferred-coding-systems
	418	The @code{set-locale-environment} function normally uses the preferred
	419	coding system established by the language environment to decode system
	420	messages. But if your locale matches an entry in the variable
	421	@code{locale-preferred-coding-systems}, Emacs uses the corresponding
	422	coding system instead. For example, if the locale @samp{ja_JP.PCK}
	423	matches @code{japanese-shift-jis} in
	424	@code{locale-preferred-coding-systems}, Emacs uses that encoding even
	425	though it might normally use @code{japanese-iso-8bit}.
	426
	427	You can override the language environment chosen at startup with
	428	explicit use of the command @code{set-language-environment}, or with
	429	customization of @code{current-language-environment} in your init
	430	file.
	431
	432	@kindex C-h L
	433	@findex describe-language-environment
	434	To display information about the effects of a certain language
	435	environment @var{lang-env}, use the command @kbd{C-h L @var{lang-env}
	436	@key{RET}} (@code{describe-language-environment}). This tells you
	437	which languages this language environment is useful for, and lists the
	438	character sets, coding systems, and input methods that go with it. It
	439	also shows some sample text to illustrate scripts used in this
	440	language environment. If you give an empty input for @var{lang-env},
	441	this command describes the chosen language environment.
	442
	443	@vindex set-language-environment-hook
	444	You can customize any language environment with the normal hook
	445	@code{set-language-environment-hook}. The command
	446	@code{set-language-environment} runs that hook after setting up the new
	447	language environment. The hook functions can test for a specific
	448	language environment by checking the variable
	449	@code{current-language-environment}. This hook is where you should
e0550cae	450	put non-default settings for specific language environments, such as
8cf51b2c GM	451	coding systems for keyboard input and terminal output, the default
	452	input method, etc.
	453
	454	@vindex exit-language-environment-hook
	455	Before it starts to set up the new language environment,
	456	@code{set-language-environment} first runs the hook
	457	@code{exit-language-environment-hook}. This hook is useful for undoing
	458	customizations that were made with @code{set-language-environment-hook}.
	459	For instance, if you set up a special key binding in a specific language
	460	environment using @code{set-language-environment-hook}, you should set
	461	up @code{exit-language-environment-hook} to restore the normal binding
	462	for that key.
	463
	464	@node Input Methods
	465	@section Input Methods
	466
	467	@cindex input methods
	468	An @dfn{input method} is a kind of character conversion designed
	469	specifically for interactive input. In Emacs, typically each language
893585f4	470	has its own input method; sometimes several languages that use the same
8cf51b2c GM	471	characters can share one input method. A few languages support several
	472	input methods.
	473
	474	The simplest kind of input method works by mapping @acronym{ASCII} letters
	475	into another alphabet; this allows you to use one other alphabet
	476	instead of @acronym{ASCII}. The Greek and Russian input methods
	477	work this way.
	478
	479	A more powerful technique is composition: converting sequences of
	480	characters into one letter. Many European input methods use composition
	481	to produce a single non-@acronym{ASCII} letter from a sequence that consists of a
	482	letter followed by accent characters (or vice versa). For example, some
893585f4	483	methods convert the sequence @kbd{o ^} into a single accented letter.
8cf51b2c GM	484	These input methods have no special commands of their own; all they do
	485	is compose sequences of printing characters.
	486
	487	The input methods for syllabic scripts typically use mapping followed
	488	by composition. The input methods for Thai and Korean work this way.
	489	First, letters are mapped into symbols for particular sounds or tone
893585f4	490	marks; then, sequences of these that make up a whole syllable are
8cf51b2c GM	491	mapped into one syllable sign.
	492
	493	Chinese and Japanese require more complex methods. In Chinese input
	494	methods, first you enter the phonetic spelling of a Chinese word (in
	495	input method @code{chinese-py}, among others), or a sequence of
	496	portions of the character (input methods @code{chinese-4corner} and
	497	@code{chinese-sw}, and others). One input sequence typically
	498	corresponds to many possible Chinese characters. You select the one
	499	you mean using keys such as @kbd{C-f}, @kbd{C-b}, @kbd{C-n},
893585f4 GM	500	@kbd{C-p} (or the arrow keys), and digits, which have special meanings
893585f4 GM	501	in this situation.
8cf51b2c GM	502
	503	The possible characters are conceptually arranged in several rows,
	504	with each row holding up to 10 alternatives. Normally, Emacs displays
	505	just one row at a time, in the echo area; @code{(@var{i}/@var{j})}
	506	appears at the beginning, to indicate that this is the @var{i}th row
	507	out of a total of @var{j} rows. Type @kbd{C-n} or @kbd{C-p} to
	508	display the next row or the previous row.
	509
	510	Type @kbd{C-f} and @kbd{C-b} to move forward and backward among
	511	the alternatives in the current row. As you do this, Emacs highlights
	512	the current alternative with a special color; type @code{C-@key{SPC}}
	513	to select the current alternative and use it as input. The
	514	alternatives in the row are also numbered; the number appears before
893585f4 GM	515	the alternative. Typing a number selects the associated alternative
893585f4 GM	516	of the current row and uses it as input.
8cf51b2c GM	517
	518	@key{TAB} in these Chinese input methods displays a buffer showing
	519	all the possible characters at once; then clicking @kbd{Mouse-2} on
	520	one of them selects that alternative. The keys @kbd{C-f}, @kbd{C-b},
	521	@kbd{C-n}, @kbd{C-p}, and digits continue to work as usual, but they
	522	do the highlighting in the buffer showing the possible characters,
	523	rather than in the echo area.
	524
	525	In Japanese input methods, first you input a whole word using
	526	phonetic spelling; then, after the word is in the buffer, Emacs
	527	converts it into one or more characters using a large dictionary. One
	528	phonetic spelling corresponds to a number of different Japanese words;
	529	to select one of them, use @kbd{C-n} and @kbd{C-p} to cycle through
	530	the alternatives.
	531
	532	Sometimes it is useful to cut off input method processing so that the
	533	characters you have just entered will not combine with subsequent
	534	characters. For example, in input method @code{latin-1-postfix}, the
893585f4	535	sequence @kbd{o ^} combines to form an @samp{o} with an accent. What if
8cf51b2c GM	536	you want to enter them as separate characters?
	537
	538	One way is to type the accent twice; this is a special feature for
893585f4 GM	539	entering the separate letter and accent. For example, @kbd{o ^ ^} gives
	540	you the two characters @samp{o^}. Another way is to type another letter
	541	after the @kbd{o}---something that won't combine with that---and
	542	immediately delete it. For example, you could type @kbd{o o @key{DEL}
	543	^} to get separate @samp{o} and @samp{^}.
8cf51b2c GM	544
	545	Another method, more general but not quite as easy to type, is to use
	546	@kbd{C-\ C-\} between two characters to stop them from combining. This
	547	is the command @kbd{C-\} (@code{toggle-input-method}) used twice.
	548	@ifnottex
	549	@xref{Select Input Method}.
	550	@end ifnottex
	551
	552	@cindex incremental search, input method interference
	553	@kbd{C-\ C-\} is especially useful inside an incremental search,
	554	because it stops waiting for more characters to combine, and starts
	555	searching for what you have already entered.
	556
	557	To find out how to input the character after point using the current
	558	input method, type @kbd{C-u C-x =}. @xref{Position Info}.
	559
	560	@vindex input-method-verbose-flag
	561	@vindex input-method-highlight-flag
	562	The variables @code{input-method-highlight-flag} and
	563	@code{input-method-verbose-flag} control how input methods explain
	564	what is happening. If @code{input-method-highlight-flag} is
	565	non-@code{nil}, the partial sequence is highlighted in the buffer (for
	566	most input methods---some disable this feature). If
	567	@code{input-method-verbose-flag} is non-@code{nil}, the list of
	568	possible characters to type next is displayed in the echo area (but
	569	not when you are in the minibuffer).
	570
ce79424f	571	Another facility for typing characters not on your keyboard is by
9ea10cc3	572	using @kbd{C-x 8 @key{RET}} (@code{insert-char}) to insert a single
ce79424f EZ	573	character based on its Unicode name or code-point; see @ref{Inserting
	574	Text}.
	575
8cf51b2c GM	576	@node Select Input Method
	577	@section Selecting an Input Method
	578
	579	@table @kbd
	580	@item C-\
71cd7772	581	Enable or disable use of the selected input method (@code{toggle-input-method}).
8cf51b2c GM	582
8cf51b2c GM	583	@item C-x @key{RET} C-\ @var{method} @key{RET}
71cd7772	584	Select a new input method for the current buffer (@code{set-input-method}).
8cf51b2c GM	585
	586	@item C-h I @var{method} @key{RET}
	587	@itemx C-h C-\ @var{method} @key{RET}
	588	@findex describe-input-method
	589	@kindex C-h I
	590	@kindex C-h C-\
	591	Describe the input method @var{method} (@code{describe-input-method}).
	592	By default, it describes the current input method (if any). This
	593	description should give you the full details of how to use any
	594	particular input method.
	595
	596	@item M-x list-input-methods
	597	Display a list of all the supported input methods.
	598	@end table
	599
	600	@findex set-input-method
	601	@vindex current-input-method
	602	@kindex C-x RET C-\
	603	To choose an input method for the current buffer, use @kbd{C-x
	604	@key{RET} C-\} (@code{set-input-method}). This command reads the
	605	input method name from the minibuffer; the name normally starts with the
	606	language environment that it is meant to be used with. The variable
	607	@code{current-input-method} records which input method is selected.
	608
	609	@findex toggle-input-method
	610	@kindex C-\
	611	Input methods use various sequences of @acronym{ASCII} characters to
	612	stand for non-@acronym{ASCII} characters. Sometimes it is useful to
	613	turn off the input method temporarily. To do this, type @kbd{C-\}
	614	(@code{toggle-input-method}). To reenable the input method, type
	615	@kbd{C-\} again.
	616
	617	If you type @kbd{C-\} and you have not yet selected an input method,
05f7d0d3	618	it prompts you to specify one. This has the same effect as using
8cf51b2c GM	619	@kbd{C-x @key{RET} C-\} to specify an input method.
	620
	621	When invoked with a numeric argument, as in @kbd{C-u C-\},
	622	@code{toggle-input-method} always prompts you for an input method,
	623	suggesting the most recently selected one as the default.
	624
	625	@vindex default-input-method
	626	Selecting a language environment specifies a default input method for
	627	use in various buffers. When you have a default input method, you can
	628	select it in the current buffer by typing @kbd{C-\}. The variable
	629	@code{default-input-method} specifies the default input method
	630	(@code{nil} means there is none).
	631
	632	In some language environments, which support several different input
	633	methods, you might want to use an input method different from the
	634	default chosen by @code{set-language-environment}. You can instruct
	635	Emacs to select a different default input method for a certain
	636	language environment, if you wish, by using
	637	@code{set-language-environment-hook} (@pxref{Language Environments,
	638	set-language-environment-hook}). For example:
	639
	640	@lisp
	641	(defun my-chinese-setup ()
	642	"Set up my private Chinese environment."
	643	(if (equal current-language-environment "Chinese-GB")
	644	(setq default-input-method "chinese-tonepy")))
	645	(add-hook 'set-language-environment-hook 'my-chinese-setup)
	646	@end lisp
	647
	648	@noindent
	649	This sets the default input method to be @code{chinese-tonepy}
	650	whenever you choose a Chinese-GB language environment.
	651
0cf8a906 KH	652	You can instruct Emacs to activate a certain input method
	653	automatically. For example:
	654
	655	@lisp
	656	(add-hook 'text-mode-hook
	657	(lambda () (set-input-method "german-prefix")))
	658	@end lisp
	659
	660	@noindent
05f7d0d3	661	This automatically activates the input method ``german-prefix'' in
0cf8a906 KH	662	Text mode.
0cf8a906 KH	663
8cf51b2c GM	664	@findex quail-set-keyboard-layout
	665	Some input methods for alphabetic scripts work by (in effect)
	666	remapping the keyboard to emulate various keyboard layouts commonly used
	667	for those scripts. How to do this remapping properly depends on your
	668	actual keyboard layout. To specify which layout your keyboard has, use
	669	the command @kbd{M-x quail-set-keyboard-layout}.
	670
	671	@findex quail-show-key
	672	You can use the command @kbd{M-x quail-show-key} to show what key (or
	673	key sequence) to type in order to input the character following point,
	674	using the selected keyboard layout. The command @kbd{C-u C-x =} also
05f7d0d3	675	shows that information, in addition to other information about the
8cf51b2c GM	676	character.
	677
	678	@findex list-input-methods
ae742cb5 CY	679	@kbd{M-x list-input-methods} displays a list of all the supported
	680	input methods. The list gives information about each input method,
	681	including the string that stands for it in the mode line.
8cf51b2c	682
8cf51b2c GM	683	@node Coding Systems
	684	@section Coding Systems
	685	@cindex coding systems
	686
	687	Users of various languages have established many more-or-less standard
	688	coding systems for representing them. Emacs does not use these coding
	689	systems internally; instead, it converts from various coding systems to
	690	its own system when reading data, and converts the internal coding
	691	system to other coding systems when writing data. Conversion is
	692	possible in reading or writing files, in sending or receiving from the
	693	terminal, and in exchanging data with subprocesses.
	694
	695	Emacs assigns a name to each coding system. Most coding systems are
ad36c422 CY	696	used for one language, and the name of the coding system starts with
	697	the language name. Some coding systems are used for several
	698	languages; their names usually start with @samp{iso}. There are also
	699	special coding systems, such as @code{no-conversion}, @code{raw-text},
	700	and @code{emacs-internal}.
8cf51b2c GM	701
	702	@cindex international files from DOS/Windows systems
	703	A special class of coding systems, collectively known as
	704	@dfn{codepages}, is designed to support text encoded by MS-Windows and
	705	MS-DOS software. The names of these coding systems are
	706	@code{cp@var{nnnn}}, where @var{nnnn} is a 3- or 4-digit number of the
	707	codepage. You can use these encodings just like any other coding
	708	system; for example, to visit a file encoded in codepage 850, type
	709	@kbd{C-x @key{RET} c cp850 @key{RET} C-x C-f @var{filename}
f68eb991	710	@key{RET}}.
8cf51b2c GM	711
	712	In addition to converting various representations of non-@acronym{ASCII}
	713	characters, a coding system can perform end-of-line conversion. Emacs
	714	handles three different conventions for how to separate lines in a file:
05f7d0d3 GM	715	newline (``unix''), carriage-return linefeed (``dos''), and just
05f7d0d3 GM	716	carriage-return (``mac'').
8cf51b2c GM	717
	718	@table @kbd
	719	@item C-h C @var{coding} @key{RET}
71cd7772	720	Describe coding system @var{coding} (@code{describe-coding-system}).
8cf51b2c GM	721
	722	@item C-h C @key{RET}
	723	Describe the coding systems currently in use.
	724
	725	@item M-x list-coding-systems
	726	Display a list of all the supported coding systems.
	727	@end table
	728
	729	@kindex C-h C
	730	@findex describe-coding-system
	731	The command @kbd{C-h C} (@code{describe-coding-system}) displays
	732	information about particular coding systems, including the end-of-line
	733	conversion specified by those coding systems. You can specify a coding
	734	system name as the argument; alternatively, with an empty argument, it
	735	describes the coding systems currently selected for various purposes,
	736	both in the current buffer and as the defaults, and the priority list
	737	for recognizing coding systems (@pxref{Recognize Coding}).
	738
	739	@findex list-coding-systems
	740	To display a list of all the supported coding systems, type @kbd{M-x
	741	list-coding-systems}. The list gives information about each coding
	742	system, including the letter that stands for it in the mode line
	743	(@pxref{Mode Line}).
	744
	745	@cindex end-of-line conversion
	746	@cindex line endings
	747	@cindex MS-DOS end-of-line conversion
	748	@cindex Macintosh end-of-line conversion
	749	Each of the coding systems that appear in this list---except for
	750	@code{no-conversion}, which means no conversion of any kind---specifies
	751	how and whether to convert printing characters, but leaves the choice of
	752	end-of-line conversion to be decided based on the contents of each file.
	753	For example, if the file appears to use the sequence carriage-return
	754	linefeed to separate lines, DOS end-of-line conversion will be used.
	755
05f7d0d3	756	Each of the listed coding systems has three variants, which specify
8cf51b2c GM	757	exactly what to do for end-of-line conversion:
	758
	759	@table @code
	760	@item @dots{}-unix
	761	Don't do any end-of-line conversion; assume the file uses
	762	newline to separate lines. (This is the convention normally used
05f7d0d3	763	on Unix and GNU systems, and Mac OS X.)
8cf51b2c GM	764
	765	@item @dots{}-dos
	766	Assume the file uses carriage-return linefeed to separate lines, and do
	767	the appropriate conversion. (This is the convention normally used on
	768	Microsoft systems.@footnote{It is also specified for MIME @samp{text/*}
	769	bodies and in other network transport contexts. It is different
05f7d0d3	770	from the SGML reference syntax record-start/record-end format, which
8cf51b2c GM	771	Emacs doesn't support directly.})
	772
	773	@item @dots{}-mac
	774	Assume the file uses carriage-return to separate lines, and do the
05f7d0d3 GM	775	appropriate conversion. (This was the convention used on the
05f7d0d3 GM	776	Macintosh system prior to OS X.)
8cf51b2c GM	777	@end table
	778
	779	These variant coding systems are omitted from the
	780	@code{list-coding-systems} display for brevity, since they are entirely
	781	predictable. For example, the coding system @code{iso-latin-1} has
	782	variants @code{iso-latin-1-unix}, @code{iso-latin-1-dos} and
	783	@code{iso-latin-1-mac}.
	784
	785	@cindex @code{undecided}, coding system
	786	The coding systems @code{unix}, @code{dos}, and @code{mac} are
	787	aliases for @code{undecided-unix}, @code{undecided-dos}, and
	788	@code{undecided-mac}, respectively. These coding systems specify only
	789	the end-of-line conversion, and leave the character code conversion to
	790	be deduced from the text itself.
	791
978ff6c5	792	@cindex @code{raw-text}, coding system
8cf51b2c	793	The coding system @code{raw-text} is good for a file which is mainly
05f7d0d3	794	@acronym{ASCII} text, but may contain byte values above 127 that are
8cf51b2c GM	795	not meant to encode non-@acronym{ASCII} characters. With
	796	@code{raw-text}, Emacs copies those byte values unchanged, and sets
	797	@code{enable-multibyte-characters} to @code{nil} in the current buffer
	798	so that they will be interpreted properly. @code{raw-text} handles
	799	end-of-line conversion in the usual way, based on the data
	800	encountered, and has the usual three variants to specify the kind of
	801	end-of-line conversion to use.
	802
978ff6c5	803	@cindex @code{no-conversion}, coding system
8cf51b2c GM	804	In contrast, the coding system @code{no-conversion} specifies no
	805	character code conversion at all---none for non-@acronym{ASCII} byte values and
	806	none for end of line. This is useful for reading or writing binary
	807	files, tar files, and other files that must be examined verbatim. It,
	808	too, sets @code{enable-multibyte-characters} to @code{nil}.
	809
	810	The easiest way to edit a file with no conversion of any kind is with
	811	the @kbd{M-x find-file-literally} command. This uses
	812	@code{no-conversion}, and also suppresses other Emacs features that
	813	might convert the file contents before you see them. @xref{Visiting}.
	814
978ff6c5	815	@cindex @code{emacs-internal}, coding system
ad36c422 CY	816	The coding system @code{emacs-internal} (or @code{utf-8-emacs},
	817	which is equivalent) means that the file contains non-@acronym{ASCII}
	818	characters stored with the internal Emacs encoding. This coding
	819	system handles end-of-line conversion based on the data encountered,
	820	and has the usual three variants to specify the kind of end-of-line
	821	conversion.
8cf51b2c GM	822
	823	@node Recognize Coding
	824	@section Recognizing Coding Systems
	825
ad36c422 CY	826	Whenever Emacs reads a given piece of text, it tries to recognize
	827	which coding system to use. This applies to files being read, output
	828	from subprocesses, text from X selections, etc. Emacs can select the
	829	right coding system automatically most of the time---once you have
	830	specified your preferences.
8cf51b2c GM	831
	832	Some coding systems can be recognized or distinguished by which byte
	833	sequences appear in the data. However, there are coding systems that
	834	cannot be distinguished, not even potentially. For example, there is no
	835	way to distinguish between Latin-1 and Latin-2; they use the same byte
	836	values with different meanings.
	837
	838	Emacs handles this situation by means of a priority list of coding
	839	systems. Whenever Emacs reads a file, if you do not specify the coding
	840	system to use, Emacs checks the data against each coding system,
	841	starting with the first in priority and working down the list, until it
	842	finds a coding system that fits the data. Then it converts the file
	843	contents assuming that they are represented in this coding system.
	844
	845	The priority list of coding systems depends on the selected language
	846	environment (@pxref{Language Environments}). For example, if you use
	847	French, you probably want Emacs to prefer Latin-1 to Latin-2; if you use
	848	Czech, you probably want Latin-2 to be preferred. This is one of the
	849	reasons to specify a language environment.
	850
	851	@findex prefer-coding-system
	852	However, you can alter the coding system priority list in detail
	853	with the command @kbd{M-x prefer-coding-system}. This command reads
	854	the name of a coding system from the minibuffer, and adds it to the
	855	front of the priority list, so that it is preferred to all others. If
	856	you use this command several times, each use adds one element to the
	857	front of the priority list.
	858
	859	If you use a coding system that specifies the end-of-line conversion
	860	type, such as @code{iso-8859-1-dos}, what this means is that Emacs
	861	should attempt to recognize @code{iso-8859-1} with priority, and should
	862	use DOS end-of-line conversion when it does recognize @code{iso-8859-1}.
	863
	864	@vindex file-coding-system-alist
	865	Sometimes a file name indicates which coding system to use for the
	866	file. The variable @code{file-coding-system-alist} specifies this
	867	correspondence. There is a special function
	868	@code{modify-coding-system-alist} for adding elements to this list. For
	869	example, to read and write all @samp{.txt} files using the coding system
	870	@code{chinese-iso-8bit}, you can execute this Lisp expression:
	871
	872	@smallexample
	873	(modify-coding-system-alist 'file "\\.txt\\'" 'chinese-iso-8bit)
	874	@end smallexample
	875
	876	@noindent
	877	The first argument should be @code{file}, the second argument should be
	878	a regular expression that determines which files this applies to, and
	879	the third argument says which coding system to use for these files.
	880
	881	@vindex inhibit-eol-conversion
	882	@cindex DOS-style end-of-line display
	883	Emacs recognizes which kind of end-of-line conversion to use based on
	884	the contents of the file: if it sees only carriage-returns, or only
	885	carriage-return linefeed sequences, then it chooses the end-of-line
	886	conversion accordingly. You can inhibit the automatic use of
	887	end-of-line conversion by setting the variable @code{inhibit-eol-conversion}
	888	to non-@code{nil}. If you do that, DOS-style files will be displayed
	889	with the @samp{^M} characters visible in the buffer; some people
	890	prefer this to the more subtle @samp{(DOS)} end-of-line type
	891	indication near the left edge of the mode line (@pxref{Mode Line,
	892	eol-mnemonic}).
	893
	894	@vindex inhibit-iso-escape-detection
895	@cindex escape sequences in files
896	By default, the automatic detection of coding system is sensitive to
897	escape sequences. If Emacs sees a sequence of characters that begin
898	with an escape character, and the sequence is valid as an ISO-2022
899	code, that tells Emacs to use one of the ISO-2022 encodings to decode
900	the file.
901
902	However, there may be cases that you want to read escape sequences
903	in a file as is. In such a case, you can set the variable
904	@code{inhibit-iso-escape-detection} to non-@code{nil}. Then the code
905	detection ignores any escape sequences, and never uses an ISO-2022
906	encoding. The result is that all escape sequences become visible in
907	the buffer.
908
909	The default value of @code{inhibit-iso-escape-detection} is
910	@code{nil}. We recommend that you not change it permanently, only for
05f7d0d3	911	one specific operation. That's because some Emacs Lisp source files
8cf51b2c GM	912	in the Emacs distribution contain non-@acronym{ASCII} characters encoded in the
	913	coding system @code{iso-2022-7bit}, and they won't be
	914	decoded correctly when you visit those files if you suppress the
	915	escape sequence detection.
05f7d0d3	916	@c I count a grand total of 3 such files, so is the above really true?
8cf51b2c GM	917
	918	@vindex auto-coding-alist
	919	@vindex auto-coding-regexp-alist
05f7d0d3 GM	920	The variables @code{auto-coding-alist} and
05f7d0d3 GM	921	@code{auto-coding-regexp-alist} are
8cf51b2c	922	the strongest way to specify the coding system for certain patterns of
05f7d0d3 GM	923	file names, or for files containing certain patterns, respectively.
05f7d0d3 GM	924	These variables even override @samp{--coding:--} tags in the file
71cd7772	925	itself (@pxref{Specify Coding}). For example, Emacs
8cf51b2c GM	926	uses @code{auto-coding-alist} for tar and archive files, to prevent it
	927	from being confused by a @samp{--coding:--} tag in a member of the
	928	archive and thinking it applies to the archive file as a whole.
05f7d0d3 GM	929	@ignore
05f7d0d3 GM	930	@c This describes old-style BABYL files, which are no longer relevant.
8cf51b2c GM	931	Likewise, Emacs uses @code{auto-coding-regexp-alist} to ensure that
8cf51b2c GM	932	RMAIL files, whose names in general don't match any particular
05f7d0d3 GM	933	pattern, are decoded correctly.
	934	@end ignore
	935
	936	@vindex auto-coding-functions
	937	Another way to specify a coding system is with the variable
	938	@code{auto-coding-functions}. For example, one of the builtin
8cf51b2c	939	@code{auto-coding-functions} detects the encoding for XML files.
05f7d0d3 GM	940	Unlike the previous two, this variable does not override any
05f7d0d3 GM	941	@samp{--coding:--} tag.
8cf51b2c	942
8cf51b2c GM	943	@node Specify Coding
	944	@section Specifying a File's Coding System
	945
	946	If Emacs recognizes the encoding of a file incorrectly, you can
313f790e CY	947	reread the file using the correct coding system with @kbd{C-x
	948	@key{RET} r} (@code{revert-buffer-with-coding-system}). This command
	949	prompts for the coding system to use. To see what coding system Emacs
	950	actually used to decode the file, look at the coding system mnemonic
	951	letter near the left edge of the mode line (@pxref{Mode Line}), or
	952	type @kbd{C-h C} (@code{describe-coding-system}).
8cf51b2c GM	953
	954	@vindex coding
	955	You can specify the coding system for a particular file in the file
	956	itself, using the @w{@samp{--@dots{}--}} construct at the beginning,
	957	or a local variables list at the end (@pxref{File Variables}). You do
	958	this by defining a value for the ``variable'' named @code{coding}.
	959	Emacs does not really have a variable @code{coding}; instead of
	960	setting a variable, this uses the specified coding system for the
	961	file. For example, @samp{--mode: C; coding: latin-1;--} specifies
	962	use of the Latin-1 coding system, as well as C mode. When you specify
	963	the coding explicitly in the file, that overrides
	964	@code{file-coding-system-alist}.
	965
8cf51b2c GM	966	@node Output Coding
	967	@section Choosing Coding Systems for Output
	968
	969	@vindex buffer-file-coding-system
	970	Once Emacs has chosen a coding system for a buffer, it stores that
	971	coding system in @code{buffer-file-coding-system}. That makes it the
	972	default for operations that write from this buffer into a file, such
	973	as @code{save-buffer} and @code{write-region}. You can specify a
	974	different coding system for further file output from the buffer using
	975	@code{set-buffer-file-coding-system} (@pxref{Text Coding}).
	976
	977	You can insert any character Emacs supports into any Emacs buffer,
	978	but most coding systems can only handle a subset of these characters.
ad36c422 CY	979	Therefore, it's possible that the characters you insert cannot be
	980	encoded with the coding system that will be used to save the buffer.
	981	For example, you could visit a text file in Polish, encoded in
	982	@code{iso-8859-2}, and add some Russian words to it. When you save
8cf51b2c GM	983	that buffer, Emacs cannot use the current value of
	984	@code{buffer-file-coding-system}, because the characters you added
	985	cannot be encoded by that coding system.
	986
	987	When that happens, Emacs tries the most-preferred coding system (set
	988	by @kbd{M-x prefer-coding-system} or @kbd{M-x
ad36c422 CY	989	set-language-environment}). If that coding system can safely encode
	990	all of the characters in the buffer, Emacs uses it, and stores its
	991	value in @code{buffer-file-coding-system}. Otherwise, Emacs displays
	992	a list of coding systems suitable for encoding the buffer's contents,
	993	and asks you to choose one of those coding systems.
8cf51b2c GM	994
	995	If you insert the unsuitable characters in a mail message, Emacs
	996	behaves a bit differently. It additionally checks whether the
71cd7772	997	@c What determines this?
8cf51b2c	998	most-preferred coding system is recommended for use in MIME messages;
eceeb5fc CY	999	if not, it informs you of this fact and prompts you for another coding
	1000	system. This is so you won't inadvertently send a message encoded in
	1001	a way that your recipient's mail software will have difficulty
	1002	decoding. (You can still use an unsuitable coding system if you enter
	1003	its name at the prompt.)
8cf51b2c	1004
71cd7772	1005	@c It seems that select-message-coding-system does this.
1df7defd	1006	@c Both sendmail.el and smptmail.el call it; i.e., smtpmail.el still
71cd7772	1007	@c obeys sendmail-coding-system.
8cf51b2c	1008	@vindex sendmail-coding-system
71cd7772	1009	When you send a mail message (@pxref{Sending Mail}),
e73c2434 CY	1010	Emacs has four different ways to determine the coding system to use
	1011	for encoding the message text. It tries the buffer's own value of
	1012	@code{buffer-file-coding-system}, if that is non-@code{nil}.
	1013	Otherwise, it uses the value of @code{sendmail-coding-system}, if that
	1014	is non-@code{nil}. The third way is to use the default coding system
	1015	for new files, which is controlled by your choice of language
71cd7772	1016	@c i.e., default-sendmail-coding-system
e73c2434 CY	1017	environment, if that is non-@code{nil}. If all of these three values
	1018	are @code{nil}, Emacs encodes outgoing mail using the Latin-1 coding
	1019	system.
71cd7772	1020	@c FIXME? Where does the Latin-1 default come in?
8cf51b2c GM	1021
	1022	@node Text Coding
	1023	@section Specifying a Coding System for File Text
	1024
	1025	In cases where Emacs does not automatically choose the right coding
	1026	system for a file's contents, you can use these commands to specify
	1027	one:
	1028
	1029	@table @kbd
	1030	@item C-x @key{RET} f @var{coding} @key{RET}
71cd7772 GM	1031	Use coding system @var{coding} to save or revisit the file in
71cd7772 GM	1032	the current buffer (@code{set-buffer-file-coding-system}).
8cf51b2c GM	1033
	1034	@item C-x @key{RET} c @var{coding} @key{RET}
	1035	Specify coding system @var{coding} for the immediately following
313f790e	1036	command (@code{universal-coding-system-argument}).
8cf51b2c GM	1037
8cf51b2c GM	1038	@item C-x @key{RET} r @var{coding} @key{RET}
313f790e CY	1039	Revisit the current file using the coding system @var{coding}
313f790e CY	1040	(@code{revert-buffer-with-coding-system}).
8cf51b2c GM	1041
	1042	@item M-x recode-region @key{RET} @var{right} @key{RET} @var{wrong} @key{RET}
	1043	Convert a region that was decoded using coding system @var{wrong},
	1044	decoding it using coding system @var{right} instead.
	1045	@end table
	1046
	1047	@kindex C-x RET f
	1048	@findex set-buffer-file-coding-system
	1049	The command @kbd{C-x @key{RET} f}
	1050	(@code{set-buffer-file-coding-system}) sets the file coding system for
1df7defd	1051	the current buffer (i.e., the coding system to use when saving or
cd996018 CY	1052	reverting the file). You specify which coding system using the
	1053	minibuffer. You can also invoke this command by clicking with
	1054	@kbd{Mouse-3} on the coding system indicator in the mode line
	1055	(@pxref{Mode Line}).
	1056
	1057	If you specify a coding system that cannot handle all the characters
	1058	in the buffer, Emacs will warn you about the troublesome characters,
	1059	and ask you to choose another coding system, when you try to save the
	1060	buffer (@pxref{Output Coding}).
8cf51b2c GM	1061
	1062	@cindex specify end-of-line conversion
	1063	You can also use this command to specify the end-of-line conversion
	1064	(@pxref{Coding Systems, end-of-line conversion}) for encoding the
	1065	current buffer. For example, @kbd{C-x @key{RET} f dos @key{RET}} will
71cd7772 GM	1066	cause Emacs to save the current buffer's text with DOS-style
71cd7772 GM	1067	carriage-return linefeed line endings.
8cf51b2c GM	1068
	1069	@kindex C-x RET c
	1070	@findex universal-coding-system-argument
	1071	Another way to specify the coding system for a file is when you visit
	1072	the file. First use the command @kbd{C-x @key{RET} c}
	1073	(@code{universal-coding-system-argument}); this command uses the
	1074	minibuffer to read a coding system name. After you exit the minibuffer,
	1075	the specified coding system is used for @emph{the immediately following
	1076	command}.
	1077
	1078	So if the immediately following command is @kbd{C-x C-f}, for example,
	1079	it reads the file using that coding system (and records the coding
	1080	system for when you later save the file). Or if the immediately following
	1081	command is @kbd{C-x C-w}, it writes the file using that coding system.
	1082	When you specify the coding system for saving in this way, instead
	1083	of with @kbd{C-x @key{RET} f}, there is no warning if the buffer
	1084	contains characters that the coding system cannot handle.
	1085
	1086	Other file commands affected by a specified coding system include
	1087	@kbd{C-x i} and @kbd{C-x C-v}, as well as the other-window variants
	1088	of @kbd{C-x C-f}. @kbd{C-x @key{RET} c} also affects commands that
	1089	start subprocesses, including @kbd{M-x shell} (@pxref{Shell}). If the
	1090	immediately following command does not use the coding system, then
	1091	@kbd{C-x @key{RET} c} ultimately has no effect.
	1092
	1093	An easy way to visit a file with no conversion is with the @kbd{M-x
	1094	find-file-literally} command. @xref{Visiting}.
	1095
4e3b4528 SM	1096	The default value of the variable @code{buffer-file-coding-system}
	1097	specifies the choice of coding system to use when you create a new file.
	1098	It applies when you find a new file, and when you create a buffer and
	1099	then save it in a file. Selecting a language environment typically sets
	1100	this variable to a good choice of default coding system for that language
8cf51b2c GM	1101	environment.
	1102
	1103	@kindex C-x RET r
	1104	@findex revert-buffer-with-coding-system
	1105	If you visit a file with a wrong coding system, you can correct this
	1106	with @kbd{C-x @key{RET} r} (@code{revert-buffer-with-coding-system}).
	1107	This visits the current file again, using a coding system you specify.
	1108
	1109	@findex recode-region
	1110	If a piece of text has already been inserted into a buffer using the
	1111	wrong coding system, you can redo the decoding of it using @kbd{M-x
	1112	recode-region}. This prompts you for the proper coding system, then
	1113	for the wrong coding system that was actually used, and does the
	1114	conversion. It first encodes the region using the wrong coding system,
	1115	then decodes it again using the proper coding system.
	1116
	1117	@node Communication Coding
	1118	@section Coding Systems for Interprocess Communication
	1119
	1120	This section explains how to specify coding systems for use
	1121	in communication with other processes.
	1122
	1123	@table @kbd
	1124	@item C-x @key{RET} x @var{coding} @key{RET}
	1125	Use coding system @var{coding} for transferring selections to and from
166bc0c8	1126	other graphical applications (@code{set-selection-coding-system}).
8cf51b2c GM	1127
	1128	@item C-x @key{RET} X @var{coding} @key{RET}
	1129	Use coding system @var{coding} for transferring @emph{one}
166bc0c8	1130	selection---the next one---to or from another graphical application
313f790e	1131	(@code{set-next-selection-coding-system}).
8cf51b2c GM	1132
	1133	@item C-x @key{RET} p @var{input-coding} @key{RET} @var{output-coding} @key{RET}
	1134	Use coding systems @var{input-coding} and @var{output-coding} for
313f790e CY	1135	subprocess input and output in the current buffer
313f790e CY	1136	(@code{set-buffer-process-coding-system}).
8cf51b2c GM	1137	@end table
	1138
	1139	@kindex C-x RET x
	1140	@kindex C-x RET X
	1141	@findex set-selection-coding-system
	1142	@findex set-next-selection-coding-system
	1143	The command @kbd{C-x @key{RET} x} (@code{set-selection-coding-system})
	1144	specifies the coding system for sending selected text to other windowing
	1145	applications, and for receiving the text of selections made in other
	1146	applications. This command applies to all subsequent selections, until
	1147	you override it by using the command again. The command @kbd{C-x
	1148	@key{RET} X} (@code{set-next-selection-coding-system}) specifies the
	1149	coding system for the next selection made in Emacs or read by Emacs.
	1150
53b7759e	1151	@vindex x-select-request-type
221bb7f6 EZ	1152	The variable @code{x-select-request-type} specifies the data type to
	1153	request from the X Window System for receiving text selections from
	1154	other applications. If the value is @code{nil} (the default), Emacs
71cd7772	1155	tries @code{UTF8_STRING} and @code{COMPOUND_TEXT}, in this order, and
221bb7f6 EZ	1156	uses various heuristics to choose the more appropriate of the two
	1157	results; if none of these succeed, Emacs falls back on @code{STRING}.
	1158	If the value of @code{x-select-request-type} is one of the symbols
	1159	@code{COMPOUND_TEXT}, @code{UTF8_STRING}, @code{STRING}, or
	1160	@code{TEXT}, Emacs uses only that request type. If the value is a
	1161	list of some of these symbols, Emacs tries only the request types in
	1162	the list, in order, until one of them succeeds, or until the list is
	1163	exhausted.
53b7759e	1164
8cf51b2c GM	1165	@kindex C-x RET p
	1166	@findex set-buffer-process-coding-system
	1167	The command @kbd{C-x @key{RET} p} (@code{set-buffer-process-coding-system})
	1168	specifies the coding system for input and output to a subprocess. This
	1169	command applies to the current buffer; normally, each subprocess has its
	1170	own buffer, and thus you can use this command to specify translation to
	1171	and from a particular subprocess by giving the command in the
	1172	corresponding buffer.
	1173
313f790e CY	1174	You can also use @kbd{C-x @key{RET} c}
	1175	(@code{universal-coding-system-argument}) just before the command that
	1176	runs or starts a subprocess, to specify the coding system for
	1177	communicating with that subprocess. @xref{Text Coding}.
8cf51b2c GM	1178
	1179	The default for translation of process input and output depends on the
	1180	current language environment.
	1181
	1182	@vindex locale-coding-system
	1183	@cindex decoding non-@acronym{ASCII} keyboard input on X
	1184	The variable @code{locale-coding-system} specifies a coding system
	1185	to use when encoding and decoding system strings such as system error
	1186	messages and @code{format-time-string} formats and time stamps. That
71cd7772 GM	1187	coding system is also used for decoding non-@acronym{ASCII} keyboard
71cd7772 GM	1188	input on the X Window System. You should choose a coding system that is compatible
8cf51b2c GM	1189	with the underlying system's text representation, which is normally
	1190	specified by one of the environment variables @env{LC_ALL},
	1191	@env{LC_CTYPE}, and @env{LANG}. (The first one, in the order
	1192	specified above, whose value is nonempty is the one that determines
	1193	the text representation.)
	1194
	1195	@node File Name Coding
	1196	@section Coding Systems for File Names
	1197
	1198	@table @kbd
	1199	@item C-x @key{RET} F @var{coding} @key{RET}
	1200	Use coding system @var{coding} for encoding and decoding file
71cd7772	1201	names (@code{set-file-name-coding-system}).
8cf51b2c GM	1202	@end table
8cf51b2c GM	1203
8cf51b2c GM	1204	@findex set-file-name-coding-system
8cf51b2c GM	1205	@kindex C-x @key{RET} F
71cd7772 GM	1206	@cindex file names with non-@acronym{ASCII} characters
	1207	The command @kbd{C-x @key{RET} F} (@code{set-file-name-coding-system})
	1208	specifies a coding system to use for encoding file @emph{names}. It
	1209	has no effect on reading and writing the @emph{contents} of files.
	1210
	1211	@vindex file-name-coding-system
	1212	In fact, all this command does is set the value of the variable
	1213	@code{file-name-coding-system}. If you set the variable to a coding
	1214	system name (as a Lisp symbol or a string), Emacs encodes file names
	1215	using that coding system for all file operations. This makes it
	1216	possible to use non-@acronym{ASCII} characters in file names---or, at
	1217	least, those non-@acronym{ASCII} characters that the specified coding
	1218	system can encode.
8cf51b2c GM	1219
8cf51b2c GM	1220	If @code{file-name-coding-system} is @code{nil}, Emacs uses a
71cd7772 GM	1221	default coding system determined by the selected language environment,
	1222	and stored in the @code{default-file-name-coding-system} variable.
	1223	@c FIXME? Is this correct? What is the "default language environment"?
ad36c422 CY	1224	In the default language environment, non-@acronym{ASCII} characters in
	1225	file names are not encoded specially; they appear in the file system
	1226	using the internal Emacs representation.
8cf51b2c	1227
7df14908 EZ	1228	@cindex file-name encoding, MS-Windows
	1229	@vindex w32-unicode-filenames
	1230	When Emacs runs on MS-Windows versions that are descendants of the
	1231	NT family (Windows 2000, XP, Vista, Windows 7, and Windows 8), the
	1232	value of @code{file-name-coding-system} is largely ignored, as Emacs
	1233	by default uses APIs that allow to pass Unicode file names directly.
	1234	By contrast, on Windows 9X, file names are encoded using
	1235	@code{file-name-coding-system}, which should be set to the codepage
	1236	(@pxref{Coding Systems, codepage}) pertinent for the current system
	1237	locale. The value of the variable @code{w32-unicode-filenames}
	1238	controls whether Emacs uses the Unicode APIs when it calls OS
	1239	functions that accept file names. This variable is set by the startup
	1240	code to @code{nil} on Windows 9X, and to @code{t} on newer versions of
	1241	MS-Windows.
	1242
8cf51b2c GM	1243	@strong{Warning:} if you change @code{file-name-coding-system} (or the
	1244	language environment) in the middle of an Emacs session, problems can
	1245	result if you have already visited files whose names were encoded using
	1246	the earlier coding system and cannot be encoded (or are encoded
	1247	differently) under the new coding system. If you try to save one of
	1248	these buffers under the visited file name, saving may use the wrong file
71cd7772	1249	name, or it may encounter an error. If such a problem happens, use @kbd{C-x
8cf51b2c GM	1250	C-w} to specify a new file name for that buffer.
	1251
	1252	@findex recode-file-name
	1253	If a mistake occurs when encoding a file name, use the command
	1254	@kbd{M-x recode-file-name} to change the file name's coding
	1255	system. This prompts for an existing file name, its old coding
	1256	system, and the coding system to which you wish to convert.
	1257
	1258	@node Terminal Coding
	1259	@section Coding Systems for Terminal I/O
	1260
	1261	@table @kbd
8cf51b2c	1262	@item C-x @key{RET} t @var{coding} @key{RET}
313f790e CY	1263	Use coding system @var{coding} for terminal output
313f790e CY	1264	(@code{set-terminal-coding-system}).
71cd7772 GM	1265
	1266	@item C-x @key{RET} k @var{coding} @key{RET}
	1267	Use coding system @var{coding} for keyboard input
	1268	(@code{set-keyboard-coding-system}).
8cf51b2c GM	1269	@end table
	1270
	1271	@kindex C-x RET t
	1272	@findex set-terminal-coding-system
	1273	The command @kbd{C-x @key{RET} t} (@code{set-terminal-coding-system})
	1274	specifies the coding system for terminal output. If you specify a
	1275	character code for terminal output, all characters output to the
	1276	terminal are translated into that coding system.
	1277
	1278	This feature is useful for certain character-only terminals built to
	1279	support specific languages or character sets---for example, European
	1280	terminals that support one of the ISO Latin character sets. You need to
	1281	specify the terminal coding system when using multibyte text, so that
	1282	Emacs knows which characters the terminal can actually handle.
	1283
	1284	By default, output to the terminal is not translated at all, unless
	1285	Emacs can deduce the proper coding system from your terminal type or
	1286	your locale specification (@pxref{Language Environments}).
	1287
	1288	@kindex C-x RET k
	1289	@findex set-keyboard-coding-system
	1290	@vindex keyboard-coding-system
71cd7772 GM	1291	The command @kbd{C-x @key{RET} k} (@code{set-keyboard-coding-system}),
71cd7772 GM	1292	or the variable @code{keyboard-coding-system}, specifies the coding
8cf51b2c GM	1293	system for keyboard input. Character-code translation of keyboard
	1294	input is useful for terminals with keys that send non-@acronym{ASCII}
	1295	graphic characters---for example, some terminals designed for ISO
	1296	Latin-1 or subsets of it.
	1297
	1298	By default, keyboard input is translated based on your system locale
	1299	setting. If your terminal does not really support the encoding
	1300	implied by your locale (for example, if you find it inserts a
	1301	non-@acronym{ASCII} character if you type @kbd{M-i}), you will need to set
	1302	@code{keyboard-coding-system} to @code{nil} to turn off encoding.
	1303	You can do this by putting
	1304
	1305	@lisp
	1306	(set-keyboard-coding-system nil)
	1307	@end lisp
	1308
	1309	@noindent
ad36c422	1310	in your init file.
8cf51b2c GM	1311
	1312	There is a similarity between using a coding system translation for
	1313	keyboard input, and using an input method: both define sequences of
	1314	keyboard input that translate into single characters. However, input
	1315	methods are designed to be convenient for interactive use by humans, and
	1316	the sequences that are translated are typically sequences of @acronym{ASCII}
	1317	printing characters. Coding systems typically translate sequences of
	1318	non-graphic characters.
	1319
	1320	@node Fontsets
	1321	@section Fontsets
	1322	@cindex fontsets
	1323
	1324	A font typically defines shapes for a single alphabet or script.
	1325	Therefore, displaying the entire range of scripts that Emacs supports
	1326	requires a collection of many fonts. In Emacs, such a collection is
05806f43	1327	called a @dfn{fontset}. A fontset is defined by a list of font specifications,
b545ff9c	1328	each assigned to handle a range of character codes, and may fall back
05806f43	1329	on another fontset for characters that are not covered by the fonts
b545ff9c	1330	it specifies.
8cf51b2c	1331
05806f43 GM	1332	@cindex fonts for various scripts
05806f43 GM	1333	@cindex Intlfonts package, installation
8cf51b2c GM	1334	Each fontset has a name, like a font. However, while fonts are
	1335	stored in the system and the available font names are defined by the
	1336	system, fontsets are defined within Emacs itself. Once you have
	1337	defined a fontset, you can use it within Emacs by specifying its name,
	1338	anywhere that you could use a single font. Of course, Emacs fontsets
05806f43 GM	1339	can use only the fonts that the system supports. If some characters
	1340	appear on the screen as empty boxes or hex codes, this means that the
	1341	fontset in use for them has no font for those characters. In this
	1342	case, or if the characters are shown, but not as well as you would
	1343	like, you may need to install extra fonts. Your operating system may
	1344	have optional fonts that you can install; or you can install the GNU
	1345	Intlfonts package, which includes fonts for most supported
	1346	scripts.@footnote{If you run Emacs on X, you may need to inform the X
	1347	server about the location of the newly installed fonts with commands
	1348	such as:
	1349	@c FIXME? I feel like this may be out of date.
1df7defd	1350	@c E.g., the intlfonts tarfile is ~ 10 years old.
05806f43 GM	1351
	1352	@example
	1353	xset fp+ /usr/local/share/emacs/fonts
	1354	xset fp rehash
	1355	@end example
	1356	}
8cf51b2c	1357
b545ff9c JR	1358	Emacs creates three fontsets automatically: the @dfn{standard
b545ff9c JR	1359	fontset}, the @dfn{startup fontset} and the @dfn{default fontset}.
05806f43 GM	1360	@c FIXME? The doc of standard-fontset-spec says:
	1361	@c "You have the biggest chance to display international characters
	1362	@c with correct glyphs by using the standard fontset." (my emphasis)
de649682	1363	@c See http://lists.gnu.org/archive/html/emacs-devel/2012-04/msg00430.html
b545ff9c	1364	The default fontset is most likely to have fonts for a wide variety of
05806f43	1365	non-@acronym{ASCII} characters, and is the default fallback for the
b545ff9c	1366	other two fontsets, and if you set a default font rather than fontset.
05806f43	1367	However, it does not specify font family names, so results can be
a4bead12	1368	somewhat random if you use it directly. You can specify use of a
05806f43 GM	1369	particular fontset by starting Emacs with the @samp{-fn} option.
05806f43 GM	1370	For example,
8cf51b2c GM	1371
	1372	@example
	1373	emacs -fn fontset-standard
	1374	@end example
	1375
	1376	@noindent
	1377	You can also specify a fontset with the @samp{Font} resource (@pxref{X
	1378	Resources}).
	1379
a4bead12 JR	1380	If no fontset is specified for use, then Emacs uses an
	1381	@acronym{ASCII} font, with @samp{fontset-default} as a fallback for
	1382	characters the font does not cover. The standard fontset is only used if
	1383	explicitly requested, despite its name.
	1384
8cf51b2c	1385	A fontset does not necessarily specify a font for every character
0eb025fb EZ	1386	code. If a fontset specifies no font for a certain character, or if
	1387	it specifies a font that does not exist on your system, then it cannot
	1388	display that character properly. It will display that character as a
0088729a	1389	hex code or thin space or an empty box instead. (@xref{Text Display, ,
0eb025fb	1390	glyphless characters}, for details.)
8cf51b2c GM	1391
	1392	@node Defining Fontsets
	1393	@section Defining fontsets
	1394
	1395	@vindex standard-fontset-spec
b545ff9c JR	1396	@vindex w32-standard-fontset-spec
b545ff9c JR	1397	@vindex ns-standard-fontset-spec
8cf51b2c	1398	@cindex standard fontset
b545ff9c	1399	When running on X, Emacs creates a standard fontset automatically according to the value
8cf51b2c GM	1400	of @code{standard-fontset-spec}. This fontset's name is
	1401
	1402	@example
	1403	--fixed-medium-r-normal--16-----*-fontset-standard
	1404	@end example
	1405
	1406	@noindent
	1407	or just @samp{fontset-standard} for short.
	1408
05806f43 GM	1409	On GNUstep and Mac OS X, the standard fontset is created using the value of
05806f43 GM	1410	@code{ns-standard-fontset-spec}, and on MS Windows it is
b545ff9c JR	1411	created using the value of @code{w32-standard-fontset-spec}.
b545ff9c JR	1412
05806f43 GM	1413	@c FIXME? How does one access these, or do anything with them?
05806f43 GM	1414	@c Does it matter?
8cf51b2c GM	1415	Bold, italic, and bold-italic variants of the standard fontset are
	1416	created automatically. Their names have @samp{bold} instead of
	1417	@samp{medium}, or @samp{i} instead of @samp{r}, or both.
	1418
	1419	@cindex startup fontset
b545ff9c JR	1420	Emacs generates a fontset automatically, based on any default
	1421	@acronym{ASCII} font that you specify with the @samp{Font} resource or
	1422	the @samp{-fn} argument, or the default font that Emacs found when it
	1423	started. This is the @dfn{startup fontset} and its name is
	1424	@code{fontset-startup}. It does this by replacing the
	1425	@var{charset_registry} field with @samp{fontset}, and replacing
	1426	@var{charset_encoding} field with @samp{startup}, then using the
	1427	resulting string to specify a fontset.
8cf51b2c	1428
05806f43	1429	For instance, if you start Emacs with a font of this form,
8cf51b2c	1430
05806f43 GM	1431	@c FIXME? I think this is a little misleading, because you cannot (?)
	1432	@c actually specify a font with wildcards, it has to be a complete spec.
	1433	@c Also, an X font specification of this form hasn't (?) been
	1434	@c mentioned before now, and is somewhat obsolete these days.
	1435	@c People are more likely to use a form like
	1436	@c emacs -fn "DejaVu Sans Mono-12"
	1437	@c How does any of this apply in that case?
8cf51b2c GM	1438	@example
	1439	emacs -fn "courier-medium-r-normal--14-140--iso8859-1"
	1440	@end example
	1441
	1442	@noindent
	1443	Emacs generates the following fontset and uses it for the initial X
	1444	window frame:
	1445
	1446	@example
b545ff9c	1447	--courier-medium-r-normal--14-140-----fontset-startup
8cf51b2c GM	1448	@end example
8cf51b2c GM	1449
05806f43 GM	1450	The startup fontset will use the font that you specify, or a variant
05806f43 GM	1451	with a different registry and encoding, for all the characters that
b545ff9c JR	1452	are supported by that font, and fallback on @samp{fontset-default} for
	1453	other characters.
	1454
8cf51b2c GM	1455	With the X resource @samp{Emacs.Font}, you can specify a fontset name
	1456	just like an actual font name. But be careful not to specify a fontset
	1457	name in a wildcard resource like @samp{Emacs*Font}---that wildcard
	1458	specification matches various other resources, such as for menus, and
05806f43 GM	1459	@c FIXME is this still true?
05806f43 GM	1460	menus cannot handle fontsets. @xref{X Resources}.
8cf51b2c GM	1461
	1462	You can specify additional fontsets using X resources named
	1463	@samp{Fontset-@var{n}}, where @var{n} is an integer starting from 0.
	1464	The resource value should have this form:
	1465
	1466	@smallexample
	1467	@var{fontpattern}, @r{[}@var{charset}:@var{font}@r{]@dots{}}
	1468	@end smallexample
	1469
	1470	@noindent
05806f43 GM	1471	@var{fontpattern} should have the form of a standard X font name (see
05806f43 GM	1472	the previous fontset-startup example), except
8cf51b2c GM	1473	for the last two fields. They should have the form
	1474	@samp{fontset-@var{alias}}.
	1475
	1476	The fontset has two names, one long and one short. The long name is
	1477	@var{fontpattern}. The short name is @samp{fontset-@var{alias}}. You
	1478	can refer to the fontset by either name.
	1479
	1480	The construct @samp{@var{charset}:@var{font}} specifies which font to
	1481	use (in this fontset) for one particular character set. Here,
	1482	@var{charset} is the name of a character set, and @var{font} is the
	1483	font to use for that character set. You can use this construct any
	1484	number of times in defining one fontset.
	1485
	1486	For the other character sets, Emacs chooses a font based on
	1487	@var{fontpattern}. It replaces @samp{fontset-@var{alias}} with values
	1488	that describe the character set. For the @acronym{ASCII} character font,
	1489	@samp{fontset-@var{alias}} is replaced with @samp{ISO8859-1}.
	1490
	1491	In addition, when several consecutive fields are wildcards, Emacs
	1492	collapses them into a single wildcard. This is to prevent use of
	1493	auto-scaled fonts. Fonts made by scaling larger fonts are not usable
05806f43	1494	for editing, and scaling a smaller font is not also useful, because it is
8cf51b2c GM	1495	better to use the smaller font in its own size, which is what Emacs
	1496	does.
	1497
	1498	Thus if @var{fontpattern} is this,
	1499
	1500	@example
	1501	--fixed-medium-r-normal--24-----*-fontset-24
	1502	@end example
	1503
	1504	@noindent
	1505	the font specification for @acronym{ASCII} characters would be this:
	1506
	1507	@example
	1508	--fixed-medium-r-normal--24-*-ISO8859-1
	1509	@end example
	1510
	1511	@noindent
	1512	and the font specification for Chinese GB2312 characters would be this:
	1513
	1514	@example
	1515	--fixed-medium-r-normal--24--gb2312-*
	1516	@end example
	1517
	1518	You may not have any Chinese font matching the above font
	1519	specification. Most X distributions include only Chinese fonts that
05806f43 GM	1520	have @samp{song ti} or @samp{fangsong ti} in the @var{family} field. In
05806f43 GM	1521	such a case, @samp{Fontset-@var{n}} can be specified as:
8cf51b2c GM	1522
	1523	@smallexample
	1524	Emacs.Fontset-0: --fixed-medium-r-normal--24-----*-fontset-24,\
	1525	chinese-gb2312:---medium-r-normal--24--gb2312-
	1526	@end smallexample
	1527
	1528	@noindent
	1529	Then, the font specifications for all but Chinese GB2312 characters have
	1530	@samp{fixed} in the @var{family} field, and the font specification for
	1531	Chinese GB2312 characters has a wild card @samp{*} in the @var{family}
	1532	field.
	1533
	1534	@findex create-fontset-from-fontset-spec
	1535	The function that processes the fontset resource value to create the
	1536	fontset is called @code{create-fontset-from-fontset-spec}. You can also
	1537	call this function explicitly to create a fontset.
	1538
d68eb23c	1539	@xref{Fonts}, for more information about font naming.
8cf51b2c	1540
b545ff9c JR	1541	@node Modifying Fontsets
	1542	@section Modifying Fontsets
	1543	@cindex fontsets, modifying
	1544	@findex set-fontset-font
	1545
	1546	Fontsets do not always have to be created from scratch. If only
	1547	minor changes are required it may be easier to modify an existing
	1548	fontset. Modifying @samp{fontset-default} will also affect other
	1549	fontsets that use it as a fallback, so can be an effective way of
	1550	fixing problems with the fonts that Emacs chooses for a particular
	1551	script.
	1552
	1553	Fontsets can be modified using the function @code{set-fontset-font},
	1554	specifying a character, a charset, a script, or a range of characters
05806f43 GM	1555	to modify the font for, and a font specification for the font to be
05806f43 GM	1556	used. Some examples are:
b545ff9c JR	1557
	1558	@example
	1559	;; Use Liberation Mono for latin-3 charset.
ae742cb5 CY	1560	(set-fontset-font "fontset-default" 'iso-8859-3
ae742cb5 CY	1561	"Liberation Mono")
b545ff9c JR	1562
b545ff9c JR	1563	;; Prefer a big5 font for han characters
ae742cb5 CY	1564	(set-fontset-font "fontset-default"
ae742cb5 CY	1565	'han (font-spec :registry "big5")
b545ff9c JR	1566	nil 'prepend)
b545ff9c JR	1567
ae742cb5 CY	1568	;; Use DejaVu Sans Mono as a fallback in fontset-startup
	1569	;; before resorting to fontset-default.
	1570	(set-fontset-font "fontset-startup" nil "DejaVu Sans Mono"
	1571	nil 'append)
b545ff9c JR	1572
b545ff9c JR	1573	;; Use MyPrivateFont for the Unicode private use area.
ae742cb5 CY	1574	(set-fontset-font "fontset-default" '(#xe000 . #xf8ff)
ae742cb5 CY	1575	"MyPrivateFont")
b545ff9c JR	1576
	1577	@end example
	1578
	1579
8cf51b2c GM	1580	@node Undisplayable Characters
	1581	@section Undisplayable Characters
	1582
05806f43	1583	There may be some non-@acronym{ASCII} characters that your
0be641c0 CY	1584	terminal cannot display. Most text terminals support just a single
0be641c0 CY	1585	character set (use the variable @code{default-terminal-coding-system}
05806f43	1586	to tell Emacs which one, @ref{Terminal Coding}); characters that
8cf51b2c GM	1587	can't be encoded in that coding system are displayed as @samp{?} by
	1588	default.
	1589
	1590	Graphical displays can display a broader range of characters, but
	1591	you may not have fonts installed for all of them; characters that have
	1592	no font appear as a hollow box.
	1593
	1594	If you use Latin-1 characters but your terminal can't display
	1595	Latin-1, you can arrange to display mnemonic @acronym{ASCII} sequences
1df7defd	1596	instead, e.g., @samp{"o} for o-umlaut. Load the library
8cf51b2c GM	1597	@file{iso-ascii} to do this.
	1598
	1599	@vindex latin1-display
	1600	If your terminal can display Latin-1, you can display characters
	1601	from other European character sets using a mixture of equivalent
	1602	Latin-1 characters and @acronym{ASCII} mnemonics. Customize the variable
	1603	@code{latin1-display} to enable this. The mnemonic @acronym{ASCII}
	1604	sequences mostly correspond to those of the prefix input methods.
	1605
	1606	@node Unibyte Mode
	1607	@section Unibyte Editing Mode
	1608
	1609	@cindex European character sets
	1610	@cindex accented characters
	1611	@cindex ISO Latin character sets
	1612	@cindex Unibyte operation
	1613	The ISO 8859 Latin-@var{n} character sets define character codes in
	1614	the range 0240 to 0377 octal (160 to 255 decimal) to handle the
	1615	accented letters and punctuation needed by various European languages
43b3b4d1 EZ	1616	(and some non-European ones). Note that Emacs considers bytes with
43b3b4d1 EZ	1617	codes in this range as raw bytes, not as characters, even in a unibyte
64a695bd XF	1618	buffer, i.e., if you disable multibyte characters. However, Emacs can
	1619	still handle these character codes as if they belonged to @emph{one}
	1620	of the single-byte character sets at a time. To specify @emph{which}
	1621	of these codes to use, invoke @kbd{M-x set-language-environment} and
	1622	specify a suitable language environment such as @samp{Latin-@var{n}}.
	1623	@xref{Disabling Multibyte, , Disabling Multibyte Characters, elisp,
	1624	GNU Emacs Lisp Reference Manual}.
8cf51b2c GM	1625
8cf51b2c GM	1626	@vindex unibyte-display-via-language-environment
43b3b4d1 EZ	1627	Emacs can also display bytes in the range 160 to 255 as readable
	1628	characters, provided the terminal or font in use supports them. This
	1629	works automatically. On a graphical display, Emacs can also display
	1630	single-byte characters through fontsets, in effect by displaying the
	1631	equivalent multibyte characters according to the current language
	1632	environment. To request this, set the variable
	1633	@code{unibyte-display-via-language-environment} to a non-@code{nil}
	1634	value. Note that setting this only affects how these bytes are
	1635	displayed, but does not change the fundamental fact that Emacs treats
	1636	them as raw bytes, not as characters.
8cf51b2c GM	1637
	1638	@cindex @code{iso-ascii} library
	1639	If your terminal does not support display of the Latin-1 character
	1640	set, Emacs can display these characters as @acronym{ASCII} sequences which at
	1641	least give you a clear idea of what the characters are. To do this,
	1642	load the library @code{iso-ascii}. Similar libraries for other
05806f43 GM	1643	Latin-@var{n} character sets could be implemented, but have not been
05806f43 GM	1644	so far.
8cf51b2c GM	1645
	1646	@findex standard-display-8bit
	1647	@cindex 8-bit display
	1648	Normally non-ISO-8859 characters (decimal codes between 128 and 159
	1649	inclusive) are displayed as octal escapes. You can change this for
	1650	non-standard ``extended'' versions of ISO-8859 character sets by using the
	1651	function @code{standard-display-8bit} in the @code{disp-table} library.
	1652
	1653	There are two ways to input single-byte non-@acronym{ASCII}
	1654	characters:
	1655
	1656	@itemize @bullet
	1657	@cindex 8-bit input
	1658	@item
	1659	You can use an input method for the selected language environment.
	1660	@xref{Input Methods}. When you use an input method in a unibyte buffer,
	1661	the non-@acronym{ASCII} character you specify with it is converted to unibyte.
	1662
	1663	@item
	1664	If your keyboard can generate character codes 128 (decimal) and up,
	1665	representing non-@acronym{ASCII} characters, you can type those character codes
	1666	directly.
	1667
0be641c0 CY	1668	On a graphical display, you should not need to do anything special to
0be641c0 CY	1669	use these keys; they should simply work. On a text terminal, you
05806f43	1670	should use the command @code{M-x set-keyboard-coding-system} or customize the
8cf51b2c GM	1671	variable @code{keyboard-coding-system} to specify which coding system
8cf51b2c GM	1672	your keyboard uses (@pxref{Terminal Coding}). Enabling this feature
d7e9a7f8	1673	will probably require you to use @key{ESC} to type Meta characters;
8cf51b2c	1674	however, on a console terminal or in @code{xterm}, you can arrange for
d7e9a7f8 EZ	1675	Meta to be converted to @key{ESC} and still be able type 8-bit
	1676	characters present directly on the keyboard or using @key{Compose} or
	1677	@key{AltGr} keys. @xref{User Input}.
8cf51b2c GM	1678
	1679	@kindex C-x 8
	1680	@cindex @code{iso-transl} library
	1681	@cindex compose character
	1682	@cindex dead character
	1683	@item
	1684	For Latin-1 only, you can use the key @kbd{C-x 8} as a ``compose
	1685	character'' prefix for entry of non-@acronym{ASCII} Latin-1 printing
	1686	characters. @kbd{C-x 8} is good for insertion (in the minibuffer as
	1687	well as other buffers), for searching, and in any other context where
	1688	a key sequence is allowed.
	1689
	1690	@kbd{C-x 8} works by loading the @code{iso-transl} library. Once that
d7e9a7f8 EZ	1691	library is loaded, the @key{Alt} modifier key, if the keyboard has
d7e9a7f8 EZ	1692	one, serves the same purpose as @kbd{C-x 8}: use @key{Alt} together
8cf51b2c	1693	with an accent character to modify the following letter. In addition,
8edb942b	1694	if the keyboard has keys for the Latin-1 ``dead accent characters'',
8cf51b2c GM	1695	they too are defined to compose with the following character, once
	1696	@code{iso-transl} is loaded.
	1697
	1698	Use @kbd{C-x 8 C-h} to list all the available @kbd{C-x 8} translations.
	1699	@end itemize
	1700
	1701	@node Charsets
	1702	@section Charsets
	1703	@cindex charsets
	1704
18430066 CY	1705	In Emacs, @dfn{charset} is short for ``character set''. Emacs
	1706	supports most popular charsets (such as @code{ascii},
	1707	@code{iso-8859-1}, @code{cp1250}, @code{big5}, and @code{unicode}), in
	1708	addition to some charsets of its own (such as @code{emacs},
	1709	@code{unicode-bmp}, and @code{eight-bit}). All supported characters
	1710	belong to one or more charsets.
	1711
	1712	Emacs normally ``does the right thing'' with respect to charsets, so
	1713	that you don't have to worry about them. However, it is sometimes
	1714	helpful to know some of the underlying details about charsets.
	1715
d68eb23c	1716	One example is font selection (@pxref{Fonts}). Each language
18430066 CY	1717	environment (@pxref{Language Environments}) defines a ``priority
	1718	list'' for the various charsets. When searching for a font, Emacs
	1719	initially attempts to find one that can display the highest-priority
	1720	charsets. For instance, in the Japanese language environment, the
	1721	charset @code{japanese-jisx0208} has the highest priority, so Emacs
	1722	tries to use a font whose @code{registry} property is
	1723	@samp{JISX0208.1983-0}.
8cf51b2c GM	1724
	1725	@findex list-charset-chars
	1726	@cindex characters in a certain charset
	1727	@findex describe-character-set
18430066	1728	There are two commands that can be used to obtain information about
3af970a0 KH	1729	charsets. The command @kbd{M-x list-charset-chars} prompts for a
	1730	charset name, and displays all the characters in that character set.
	1731	The command @kbd{M-x describe-character-set} prompts for a charset
18430066	1732	name, and displays information about that charset, including its
3af970a0 KH	1733	internal representation within Emacs.
	1734
	1735	@findex list-character-sets
ae742cb5 CY	1736	@kbd{M-x list-character-sets} displays a list of all supported
ae742cb5 CY	1737	charsets. The list gives the names of charsets and additional
05806f43 GM	1738	information to identity each charset; see the
	1739	@url{http://www.itscj.ipsj.or.jp/ISO-IR/, International Register of
	1740	Coded Character Sets} for more details. In this list,
18430066 CY	1741	charsets are divided into two categories: @dfn{normal charsets} are
	1742	listed first, followed by @dfn{supplementary charsets}. A
	1743	supplementary charset is one that is used to define another charset
	1744	(as a parent or a subset), or to provide backward-compatibility for
	1745	older Emacs versions.
	1746
	1747	To find out which charset a character in the buffer belongs to, put
	1748	point before it and type @kbd{C-u C-x =} (@pxref{International
	1749	Chars}).
8cf51b2c	1750
f4b6ba46 EZ	1751	@node Bidirectional Editing
	1752	@section Bidirectional Editing
	1753	@cindex bidirectional editing
	1754	@cindex right-to-left text
	1755
	1756	Emacs supports editing text written in scripts, such as Arabic and
	1757	Hebrew, whose natural ordering of horizontal text for display is from
	1758	right to left. However, digits and Latin text embedded in these
	1759	scripts are still displayed left to right. It is also not uncommon to
05806f43 GM	1760	have small portions of text in Arabic or Hebrew embedded in an otherwise
05806f43 GM	1761	Latin document; e.g., as comments and strings in a program source
f4b6ba46 EZ	1762	file. For these reasons, text that uses these scripts is actually
	1763	@dfn{bidirectional}: a mixture of runs of left-to-right and
	1764	right-to-left characters.
	1765
	1766	This section describes the facilities and options provided by Emacs
	1767	for editing bidirectional text.
	1768
	1769	@cindex logical order
	1770	@cindex visual order
	1771	Emacs stores right-to-left and bidirectional text in the so-called
	1772	@dfn{logical} (or @dfn{reading}) order: the buffer or string position
	1773	of the first character you read precedes that of the next character.
	1774	Reordering of bidirectional text into the @dfn{visual} order happens
	1775	at display time. As result, character positions no longer increase
	1776	monotonically with their positions on display. Emacs implements the
	1777	Unicode Bidirectional Algorithm described in the Unicode Standard
	1778	Annex #9, for reordering of bidirectional text for display.
	1779
	1780	@vindex bidi-display-reordering
	1781	The buffer-local variable @code{bidi-display-reordering} controls
	1782	whether text in the buffer is reordered for display. If its value is
	1783	non-@code{nil}, Emacs reorders characters that have right-to-left
	1784	directionality when they are displayed. The default value is
4cc60b9b	1785	@code{t}.
f4b6ba46	1786
84412f2c EZ	1787	@cindex base direction of paragraphs
84412f2c EZ	1788	@cindex paragraph, base direction
f4b6ba46 EZ	1789	Each paragraph of bidirectional text can have its own @dfn{base
f4b6ba46 EZ	1790	direction}, either right-to-left or left-to-right. (Paragraph
05806f43	1791	@c paragraph-separate etc have no influence on this?
1df7defd	1792	boundaries are empty lines, i.e., lines consisting entirely of
84412f2c EZ	1793	whitespace characters.) Text in left-to-right paragraphs begins on
	1794	the screen at the left margin of the window and is truncated or
	1795	continued when it reaches the right margin. By contrast, text in
	1796	right-to-left paragraphs is displayed starting at the right margin and
	1797	is continued or truncated at the left margin.
f4b6ba46 EZ	1798
	1799	@vindex bidi-paragraph-direction
	1800	Emacs determines the base direction of each paragraph dynamically,
	1801	based on the text at the beginning of the paragraph. However,
	1802	sometimes a buffer may need to force a certain base direction for its
	1803	paragraphs. The variable @code{bidi-paragraph-direction}, if
	1804	non-@code{nil}, disables the dynamic determination of the base
	1805	direction, and instead forces all paragraphs in the buffer to have the
	1806	direction specified by its buffer-local value. The value can be either
	1807	@code{right-to-left} or @code{left-to-right}. Any other value is
	1808	interpreted as @code{nil}.
	1809
	1810	@cindex LRM
	1811	@cindex RLM
	1812	Alternatively, you can control the base direction of a paragraph by
	1813	inserting special formatting characters in front of the paragraph.
	1814	The special character @code{RIGHT-TO-LEFT MARK}, or @sc{rlm}, forces
	1815	the right-to-left direction on the following paragraph, while
	1816	@code{LEFT-TO-RIGHT MARK}, or @sc{lrm} forces the left-to-right
d7e9a7f8	1817	direction. (You can use @kbd{C-x 8 @key{RET}} to insert these characters.)
2d3fe5d7 EZ	1818	In a GUI session, the @sc{lrm} and @sc{rlm} characters display as very
2d3fe5d7 EZ	1819	thin blank characters; on text terminals they display as blanks.
f4b6ba46 EZ	1820
	1821	Because characters are reordered for display, Emacs commands that
	1822	operate in the logical order or on stretches of buffer positions may
	1823	produce unusual effects. For example, @kbd{C-f} and @kbd{C-b}
	1824	commands move point in the logical order, so the cursor will sometimes
	1825	jump when point traverses reordered bidirectional text. Similarly, a
	1826	highlighted region covering a contiguous range of character positions
	1827	may look discontinuous if the region spans reordered text. This is
05806f43	1828	normal and similar to the behavior of other programs that support
4c672a0f EZ	1829	bidirectional text. If you set @code{visual-order-cursor-movement} to
	1830	a non-@code{nil} value, cursor motion by the arrow keys follows the
	1831	visual order on screen (@pxref{Moving Point, visual-order movement}).