- -*-mode: text; coding: latin-1;-*-
+ -*-mode: text; coding: utf-8;-*-
-Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008
- Free Software Foundation, Inc.
+Copyright (C) 2002-2014 Free Software Foundation, Inc.
See the end of the file for license conditions.
Problems, fixmes and other unicode-related issues
* SINGLE_BYTE_CHAR_P returns true for Latin-1 characters, which has
undesirable effects. E.g.:
- (multibyte-string-p (let ((s "x")) (aset s 0 ?£) s)) => nil
- (multibyte-string-p (concat [?£])) => nil
- (text-char-description ?£) => "M-#"
+ (multibyte-string-p (let ((s "x")) (aset s 0 ?£) s)) => nil
+ (multibyte-string-p (concat [?£])) => nil
+ (text-char-description ?£) => "M-#"
These examples are all fixed by the change of 2002-10-14, but
there still exist questionable SINGLE_BYTE_CHAR_P in the
dumped emacs. But, those maps (char tables) generated while
temacs is running can't be removed from the dumped emacs.
- * Translation tables for {en,de}code currently aren't supported.
-
- This should be fixed by the changes of 2002-10-14.
-
- * Defining CCL coding systems currently doesn't work.
-
- This should be fixed by the changes of 2003-01-30.
-
* iso-2022 charsets get unified on i/o.
With the change on 2003-01-06, decoding routines put `charset'
spelling and calendar, but that's not a Unicode issue.)
* Handle Unicode combining characters usefully, e.g. diacritics, and
- handle more scripts specifically (à la Devanagari). There are
+ handle more scripts specifically (à la Devanagari). There are
issues with canonicalization.
- * Bidi is a separate issue with no support currently.
-
* We need tabular input methods, e.g. for maths symbols. (Not
specific to Unicode.)
worry about what happens when double-width charsets covering
non-CJK characters are unified.
- * Emacs 20/21 .elc files are currently not loadable. It may or may
- not be possible to do this properly.
+ * There are type errors lurking, e.g. in
+ Fcheck_coding_systems_region. Define ENABLE_CHECKING to find them.
- With the change on 2002-07-24, elc files generated by Emacs
- 20.3 and later are correctly loaded (including those
- containing multibyte characters and compressed). But, elc
- files generated by 20.2 and the primer are still not loadable.
- Is it really worth working on it?
+ * Old auto-save files, and similar files, such as Gnus drafts,
+ containing non-ASCII characters probably won't be re-read correctly.
- * Rmail won't work with non-ASCII text. Encoding issues for Babyl
- files need sorting out, but rms says Babyl will go before this is
- released.
- * Gnus still needs some attention, and we need to get changes
- accepted by Gnus maintainers...
+Source file encoding
+--------------------
- * There are type errors lurking, e.g. in
- Fcheck_coding_systems_region. Define ENABLE_CHECKING to find them.
+Most Emacs source files are encoded in UTF-8 (or in ASCII, which is a
+subset), but there are a few exceptions, listed below. Perhaps
+someday many of these files will be converted to UTF-8, for
+convenience when using tools like 'grep -r', but this might need
+nontrivial changes to the build process.
- * You can grep the code for lots of fixmes.
+ * chinese-big5
- * Old auto-save files, and similar files, such as Gnus drafts,
- containing non-ASCII characters probably won't be re-read correctly.
+ These are verbatim copies of files taken from external sources.
+ They haven't been converted to UTF-8.
+
+ leim/CXTERM-DIC/4Corner.tit
+ leim/CXTERM-DIC/ARRAY30.tit
+ leim/CXTERM-DIC/ECDICT.tit
+ leim/CXTERM-DIC/ETZY.tit
+ leim/CXTERM-DIC/PY-b5.tit
+ leim/CXTERM-DIC/Punct-b5.tit
+ leim/CXTERM-DIC/QJ-b5.tit
+ leim/CXTERM-DIC/ZOZY.tit
+ leim/MISC-DIC/CTLau-b5.html
+ leim/MISC-DIC/cangjie-table.b5
+
+ * chinese-iso-8bit
+
+ These are verbatim copies of files taken from external sources.
+ They haven't been converted to UTF-8.
+
+ leim/CXTERM-DIC/CCDOSPY.tit
+ leim/CXTERM-DIC/Punct.tit
+ leim/CXTERM-DIC/QJ.tit
+ leim/CXTERM-DIC/SW.tit
+ leim/CXTERM-DIC/TONEPY.tit
+ leim/MISC-DIC/pinyin.map
+ leim/MISC-DIC/CTLau.html
+ leim/MISC-DIC/ziranma.cin
+
+ * cp850
+
+ This file contains non-ASCII characters in unibyte strings. When
+ editing a keyboard layout it's more convenient to see 'é' than
+ '\202', and the MS-DOS compiler requires the single byte if a
+ backslash escape is not being used.
+
+ src/msdos.c
+
+ * iso-2022-cn-ext
+
+ This file is externally generated from leim/MISC-DIC/cangjie-table.b5
+ by Big5->CNS converter. It hasn't been converted to UTF-8.
+
+ leim/MISC-DIC/cangjie-table.cns
+
+ * iso-latin-2
+
+ These files are processed by csplain, a program that requires
+ Latin-2 input. In 2012 the csplain maintainers started
+ recommending UTF-8, but these files haven't been converted yet.
+
+ etc/refcards/cs-dired-ref.tex
+ etc/refcards/cs-refcard.tex
+ etc/refcards/cs-survival.tex
+ etc/refcards/sk-dired-ref.tex
+ etc/refcards/sk-refcard.tex
+ etc/refcards/sk-survival.tex
+
+ * japanese-iso-8bit
+
+ SKK-JISYO.L is a verbatim copy of a file taken from an external source.
+ It hasn't been converted to UTF-8.
+
+ leim/SKK-DIC/SKK-JISYO.L
+
+ * japanese-shift-jis
+
+ This is a verbatim copy of a file taken from an external source.
+ It hasn't been converted to UTF-8.
+
+ admin/charsets/mapfiles/cns2ucsdkw.txt
+
+ * no-conversion
+
+ This file purposely contains arbitrary bytes interspersed within text,
+ to test whether the Emacs distribution is corrupted.
+
+ lib-src/testfile
+
+ * iso-2022-7bit
+
+ This file switches between CJK charsets, which is not encoded in UTF-8.
+
+ etc/HELLO
+
+ Each of these files contains just one CJK charset, but Emacs
+ currently has no easy way to specify set-charset-priority on a
+ per-file basis, so converting any of these files to UTF-8 might
+ change the file's appearance when viewed by an Emacs that is
+ operating in some other language environment.
+
+ etc/tutorials/TUTORIAL.ja
+ leim/quail/cyril-jis.el
+ leim/quail/hanja-jis.el
+ leim/quail/japanese.el
+ leim/quail/py-punct.el
+ leim/quail/pypunct-b5.el
+ lisp/international/ja-dic-cnv.el
+ lisp/international/ja-dic-utl.el
+ lisp/international/kinsoku.el
+ lisp/international/kkc.el
+ lisp/international/titdic-cnv.el
+ lisp/language/japan-util.el
+ lisp/language/japanese.el
+ lisp/term/x-win.el
+
+ * utf-8-emacs
+
+ These files contain characters that cannot be encoded in UTF-8.
+
+ leim/quail/tibetan.el
+ leim/quail/ethiopic.el
+ lisp/international/titdic-cnv.el
+ lisp/language/tibetan.el
+ lisp/language/tibet-util.el
+ lisp/language/ind-util.el
\f
This file is part of GNU Emacs.