Implement mouse highlight for bidi-reordered lines.

[bpt/emacs.git] / doc / emacs / mule.texi
diff --git a/doc/emacs/mule.texi b/doc/emacs/mule.texi

index c630ba4..9fdef17 100644 (file)
--- a/doc/emacs/mule.texi
+++ b/doc/emacs/mule.texi
@@ -1,6 +1,6 @@
  @c This is part of the Emacs manual.
  @c Copyright (C) 1997, 1999, 2000, 2001, 2002, 2003, 2004,
-@c   2005, 2006, 2007, 2008, 2009 Free Software Foundation, Inc.
+@c   2005, 2006, 2007, 2008, 2009, 2010 Free Software Foundation, Inc.
  @c See file emacs.texi for copying conditions.
  @node International, Major Modes, Frames, Top
  @chapter International Character Set Support
@@ -66,6 +66,12 @@ displays (@pxref{Terminal Coding}).  If some characters are displayed
  incorrectly, refer to @ref{Undisplayable Characters}, which describes
  possible problems and explains how to solve them.
  
+@item
+Characters from scripts whose natural ordering of text is from right
+to left are reordered for display (@pxref{Bidirectional Editing}).
+These scripts include Arabic, Hebrew, Syriac, Thaana, and a few
+others.
+
  @item
  You can insert non-@acronym{ASCII} characters or search for them.  To do that,
  you can specify an input method (@pxref{Select Input Method}) suitable
@@ -107,6 +113,7 @@ to make sure Emacs interprets keyboard input correctly; see
  * Unibyte Mode::            You can pick one European character set
                                to use without multibyte characters.
  * Charsets::                How Emacs groups its internal character codes.
+* Bidirectional Editing::   Support for right-to-left scripts.
  @end menu
  
  @node International Chars
@@ -221,7 +228,7 @@ in a buffer whose coding system is @code{utf-8-unix}:
          character: @`A (192, #o300, #xc0)
  preferred charset: unicode (Unicode (ISO10646))
         code point: 0xC0
-           syntax: w   which means: word
+           syntax: w    which means: word
           category: j:Japanese l:Latin v:Vietnamese
        buffer code: #xC3 #x80
          file code: not encodable by coding system undecided-unix
@@ -533,6 +540,11 @@ most input methods---some disable this feature).  If
  possible characters to type next is displayed in the echo area (but
  not when you are in the minibuffer).
  
+  Another facility for typing characters not on your keyboard is by
+using the @kbd{C-x 8 @key{RET}} (@code{ucs-insert}) to insert a single
+character based on its Unicode name or code-point; see @ref{Inserting
+Text}.
+
  @node Select Input Method
  @section Selecting an Input Method
  
@@ -961,15 +973,16 @@ still use an unsuitable coding system if you type its name in response
  to the question.)
  
  @vindex sendmail-coding-system
-  When you send a message with Mail mode (@pxref{Sending Mail}), Emacs has
-four different ways to determine the coding system to use for encoding
-the message text.  It tries the buffer's own value of
-@code{buffer-file-coding-system}, if that is non-@code{nil}.  Otherwise,
-it uses the value of @code{sendmail-coding-system}, if that is
-non-@code{nil}.  The third way is to use the default coding system for
-new files, which is controlled by your choice of language environment,
-if that is non-@code{nil}.  If all of these three values are @code{nil},
-Emacs encodes outgoing mail using the Latin-1 coding system.
+  When you send a message with Message mode (@pxref{Sending Mail}),
+Emacs has four different ways to determine the coding system to use
+for encoding the message text.  It tries the buffer's own value of
+@code{buffer-file-coding-system}, if that is non-@code{nil}.
+Otherwise, it uses the value of @code{sendmail-coding-system}, if that
+is non-@code{nil}.  The third way is to use the default coding system
+for new files, which is controlled by your choice of language
+environment, if that is non-@code{nil}.  If all of these three values
+are @code{nil}, Emacs encodes outgoing mail using the Latin-1 coding
+system.
  
  @node Text Coding
  @section Specifying a Coding System for File Text
@@ -1442,7 +1455,7 @@ field.
  fontset is called @code{create-fontset-from-fontset-spec}.  You can also
  call this function explicitly to create a fontset.
  
-  @xref{Font X}, for more information about font naming in X.
+  @xref{Fonts}, for more information about font naming.
  
  @node Modifying Fontsets
  @section Modifying Fontsets
@@ -1515,9 +1528,12 @@ sequences mostly correspond to those of the prefix input methods.
    The ISO 8859 Latin-@var{n} character sets define character codes in
  the range 0240 to 0377 octal (160 to 255 decimal) to handle the
  accented letters and punctuation needed by various European languages
-(and some non-European ones).  If you disable multibyte characters,
-Emacs can still handle @emph{one} of these character codes at a time.
-To specify @emph{which} of these codes to use, invoke @kbd{M-x
+(and some non-European ones).  Note that Emacs considers bytes with
+codes in this range as raw bytes, not as characters, even in a unibyte
+session, i.e.@: if you disable multibyte characters.  However, Emacs
+can still handle these character codes as if they belonged to
+@emph{one} of the single-byte character sets at a time.  To specify
+@emph{which} of these codes to use, invoke @kbd{M-x
  set-language-environment} and specify a suitable language environment
  such as @samp{Latin-@var{n}}.
  
@@ -1527,13 +1543,16 @@ your initialization files are read as unibyte if they contain
  non-@acronym{ASCII} characters.
  
  @vindex unibyte-display-via-language-environment
-  Emacs can also display those characters, provided the terminal or font
-in use supports them.  This works automatically.  Alternatively, on a
-graphical display, Emacs can also display single-byte characters
-through fontsets, in effect by displaying the equivalent multibyte
-characters according to the current language environment.  To request
-this, set the variable @code{unibyte-display-via-language-environment}
-to a non-@code{nil} value.
+  Emacs can also display bytes in the range 160 to 255 as readable
+characters, provided the terminal or font in use supports them.  This
+works automatically.  On a graphical display, Emacs can also display
+single-byte characters through fontsets, in effect by displaying the
+equivalent multibyte characters according to the current language
+environment.  To request this, set the variable
+@code{unibyte-display-via-language-environment} to a non-@code{nil}
+value.  Note that setting this only affects how these bytes are
+displayed, but does not change the fundamental fact that Emacs treats
+them as raw bytes, not as characters.
  
  @cindex @code{iso-ascii} library
    If your terminal does not support display of the Latin-1 character
@@ -1602,51 +1621,128 @@ Use @kbd{C-x 8 C-h} to list all the available @kbd{C-x 8} translations.
  @section Charsets
  @cindex charsets
  
-  Emacs defines most of popular character sets (e.g. ascii,
-iso-8859-1, cp1250, big5, unicode) as @dfn{charsets} and a few of its
-own charsets (e.g. emacs, unicode-bmp, eight-bit).  All supported
-characters belong to one or more charsets.  Usually you don't have to
-take care of ``charset'', but knowing about it may help understanding
-the behavior of Emacs in some cases.
-
-  One example is a font selection.  In each language environment,
-charsets have different priorities.  Emacs, at first, tries to use a
-font that matches with charsets of higher priority.  For instance, in
-Japanese language environment, the charset @code{japanese-jisx0208}
-has the highest priority (@pxref{Describe Language Environment}).  So,
-Emacs tries to use a font whose @code{registry} property is
-``JISX0208.1983-0'' for characters belonging to that charset.
-
-  Another example is a use of @code{charset} text property.  When
-Emacs reads a file encoded in a coding systems that uses escape
-sequences to switch charsets (e.g. iso-2022-int-1), the buffer text
-keep the information of the original charset by @code{charset} text
-property.  By using this information, Emacs can write the file with
-the same byte sequence as the original.
+  In Emacs, @dfn{charset} is short for ``character set''.  Emacs
+supports most popular charsets (such as @code{ascii},
+@code{iso-8859-1}, @code{cp1250}, @code{big5}, and @code{unicode}), in
+addition to some charsets of its own (such as @code{emacs},
+@code{unicode-bmp}, and @code{eight-bit}).  All supported characters
+belong to one or more charsets.
+
+  Emacs normally ``does the right thing'' with respect to charsets, so
+that you don't have to worry about them.  However, it is sometimes
+helpful to know some of the underlying details about charsets.
+
+  One example is font selection (@pxref{Fonts}).  Each language
+environment (@pxref{Language Environments}) defines a ``priority
+list'' for the various charsets.  When searching for a font, Emacs
+initially attempts to find one that can display the highest-priority
+charsets.  For instance, in the Japanese language environment, the
+charset @code{japanese-jisx0208} has the highest priority, so Emacs
+tries to use a font whose @code{registry} property is
+@samp{JISX0208.1983-0}.
  
  @findex list-charset-chars
  @cindex characters in a certain charset
  @findex describe-character-set
-  There are two commands for obtaining information about Emacs
+  There are two commands that can be used to obtain information about
  charsets.  The command @kbd{M-x list-charset-chars} prompts for a
  charset name, and displays all the characters in that character set.
  The command @kbd{M-x describe-character-set} prompts for a charset
-name and displays information about that charset, including its
+name, and displays information about that charset, including its
  internal representation within Emacs.
  
  @findex list-character-sets
-  To display a list of all the supported charsets, type @kbd{M-x
+  To display a list of all supported charsets, type @kbd{M-x
  list-character-sets}.  The list gives the names of charsets and
-additional information to identity each charset (see ISO/IEC's this
-page <http://www.itscj.ipsj.or.jp/ISO-IR/> for the detail).  In the
-list, charsets are categorized into two; the normal charsets are
-listed first, and the supplementary charsets are listed last.  A
-charset in the latter category is used for defining another charset
-(as a parent or a subset), or was used only in Emacs of the older
-versions.
-
-  To find out which charset a character in the buffer belongs to,
-put point before it and type @kbd{C-u C-x =}.
+additional information to identity each charset (see
+@url{http://www.itscj.ipsj.or.jp/ISO-IR/} for details).  In this list,
+charsets are divided into two categories: @dfn{normal charsets} are
+listed first, followed by @dfn{supplementary charsets}.  A
+supplementary charset is one that is used to define another charset
+(as a parent or a subset), or to provide backward-compatibility for
+older Emacs versions.
+
+  To find out which charset a character in the buffer belongs to, put
+point before it and type @kbd{C-u C-x =} (@pxref{International
+Chars}).
+
+@node Bidirectional Editing
+@section Bidirectional Editing
+@cindex bidirectional editing
+@cindex right-to-left text
+
+  Emacs supports editing text written in scripts, such as Arabic and
+Hebrew, whose natural ordering of horizontal text for display is from
+right to left.  However, digits and Latin text embedded in these
+scripts are still displayed left to right.  It is also not uncommon to
+have small portions of text in Arabic or Hebrew embedded in otherwise
+Latin document, e.g., as comments and strings in a program source
+file.  For these reasons, text that uses these scripts is actually
+@dfn{bidirectional}: a mixture of runs of left-to-right and
+right-to-left characters.
+
+  This section describes the facilities and options provided by Emacs
+for editing bidirectional text.
+
+@cindex logical order
+@cindex visual order
+  Emacs stores right-to-left and bidirectional text in the so-called
+@dfn{logical} (or @dfn{reading}) order: the buffer or string position
+of the first character you read precedes that of the next character.
+Reordering of bidirectional text into the @dfn{visual} order happens
+at display time.  As result, character positions no longer increase
+monotonically with their positions on display.  Emacs implements the
+Unicode Bidirectional Algorithm described in the Unicode Standard
+Annex #9, for reordering of bidirectional text for display.
+
+@vindex bidi-display-reordering
+  The buffer-local variable @code{bidi-display-reordering} controls
+whether text in the buffer is reordered for display.  If its value is
+non-@code{nil}, Emacs reorders characters that have right-to-left
+directionality when they are displayed.  The default value is
+@code{nil}.
+
+  Each paragraph of bidirectional text can have its own @dfn{base
+direction}, either right-to-left or left-to-right.  (Paragraph
+boundaries are defined by the regular expressions
+@code{paragraph-start} and @code{paragraph-separate}, see
+@ref{Paragraphs}.)  Text in left-to-right paragraphs begins at the
+left margin of the window and is truncated or continued when it
+reaches the right margin.  By contrast, text in right-to-left
+paragraphs begins at the right margin and is continued or truncated at
+the left margin.
+
+@vindex bidi-paragraph-direction
+  Emacs determines the base direction of each paragraph dynamically,
+based on the text at the beginning of the paragraph.  However,
+sometimes a buffer may need to force a certain base direction for its
+paragraphs.  The variable @code{bidi-paragraph-direction}, if
+non-@code{nil}, disables the dynamic determination of the base
+direction, and instead forces all paragraphs in the buffer to have the
+direction specified by its buffer-local value.  The value can be either
+@code{right-to-left} or @code{left-to-right}.  Any other value is
+interpreted as @code{nil}.
+
+@cindex LRM
+@cindex RLM
+  Alternatively, you can control the base direction of a paragraph by
+inserting special formatting characters in front of the paragraph.
+The special character @code{RIGHT-TO-LEFT MARK}, or @sc{rlm}, forces
+the right-to-left direction on the following paragraph, while
+@code{LEFT-TO-RIGHT MARK}, or @sc{lrm} forces the left-to-right
+direction.  (You can use @kbd{C-x 8 RET} to insert these characters.)
+In a GUI session, the @sc{lrm} and @sc{rlm} characters display as
+blanks.
+
+  Because characters are reordered for display, Emacs commands that
+operate in the logical order or on stretches of buffer positions may
+produce unusual effects.  For example, @kbd{C-f} and @kbd{C-b}
+commands move point in the logical order, so the cursor will sometimes
+jump when point traverses reordered bidirectional text.  Similarly, a
+highlighted region covering a contiguous range of character positions
+may look discontinuous if the region spans reordered text.  This is
+normal and similar to behavior of other programs that support
+bidirectional text.
  
  @ignore
     arch-tag: 310ba60d-31ef-4ce7-91f1-f282dd57b6b3