Commit | Line | Data |
---|---|---|
c38e0c97 | 1 | -*-mode: text; coding: utf-8;-*- |
e88a2ed3 | 2 | |
ba318903 | 3 | Copyright (C) 2002-2014 Free Software Foundation, Inc. |
e88a2ed3 GM |
4 | See the end of the file for license conditions. |
5 | ||
6 | Problems, fixmes and other unicode-related issues | |
7 | ------------------------------------------------------------- | |
8 | ||
9 | Notes by fx to record various things of variable importance. handa | |
10 | needs to check them -- don't take too seriously, especially with | |
11 | regard to completeness. | |
12 | ||
13 | * SINGLE_BYTE_CHAR_P returns true for Latin-1 characters, which has | |
14 | undesirable effects. E.g.: | |
c38e0c97 PE |
15 | (multibyte-string-p (let ((s "x")) (aset s 0 ?£) s)) => nil |
16 | (multibyte-string-p (concat [?£])) => nil | |
17 | (text-char-description ?£) => "M-#" | |
e88a2ed3 GM |
18 | |
19 | These examples are all fixed by the change of 2002-10-14, but | |
20 | there still exist questionable SINGLE_BYTE_CHAR_P in the | |
21 | code (keymap.c and print.c). | |
22 | ||
23 | * Rationalize character syntax and its relationship to the Unicode | |
24 | database. (Applies mainly to symbol an punctuation syntax.) | |
25 | ||
26 | * Fontset handling and customization needs work. We want to relate | |
27 | fonts to scripts, probably based on the Unicode blocks. The | |
28 | presence of small-repertoire 10646-encoded fonts in XFree 4 is a | |
29 | pain, not currently worked round. | |
30 | ||
31 | With the change on 2002-07-26, multiple fonts can be | |
32 | specified in a fontset for a specific range of characters. | |
33 | Each range can also be specified by script. Before using | |
34 | ISO10646 fonts, Emacs checks their repertories to avoid such | |
35 | fonts that don't have a glyph for a specific character. | |
36 | ||
37 | fx has worked on fontset customization, but was stymied by | |
38 | basic problems with the way the default face is dealt with | |
39 | (and something else, I think). This needs revisiting. | |
40 | ||
41 | * Work is also needed on charset and coding system priorities. | |
42 | ||
43 | * The relevant bits of latin1-disp.el need porting (and probably | |
44 | re-naming/updating). See also cyril-util.el. | |
45 | ||
46 | * Quail files need more work now the encoding is largely irrelevant. | |
47 | ||
48 | * What to do with the old coding categories stuff? | |
49 | ||
50 | * The preferred-coding-system property of charsets should probably be | |
51 | junked unless it can be made more useful now. | |
52 | ||
53 | * find-multibyte-characters needs looking at. | |
54 | ||
55 | * Implement Korean cp949/UHC, BIG5-HKSCS and any other important missing | |
56 | charsets. | |
57 | ||
58 | * Lazy-load tables for unify-charset somehow? | |
59 | ||
60 | Actually, Emacs clears out all charset maps and unify-map just | |
61 | before dumping, and they are loaded again on demand by the | |
62 | dumped emacs. But, those maps (char tables) generated while | |
63 | temacs is running can't be removed from the dumped emacs. | |
64 | ||
e88a2ed3 GM |
65 | * iso-2022 charsets get unified on i/o. |
66 | ||
67 | With the change on 2003-01-06, decoding routines put `charset' | |
68 | property to decoded text, and iso-2022 encoder pay attention | |
69 | to it. Thus, for instance, reading and writing by | |
70 | iso-2022-7bit preserve the original designation sequences. | |
71 | The property name `preferred-charset' may be better? | |
72 | ||
73 | We may have to utilize this property to decide a font. | |
74 | ||
75 | * Revisit locale processing: look at treating the language and | |
76 | charset parts separately. (Language should affect things like | |
77 | spelling and calendar, but that's not a Unicode issue.) | |
78 | ||
79 | * Handle Unicode combining characters usefully, e.g. diacritics, and | |
c38e0c97 | 80 | handle more scripts specifically (à la Devanagari). There are |
e88a2ed3 GM |
81 | issues with canonicalization. |
82 | ||
e88a2ed3 GM |
83 | * We need tabular input methods, e.g. for maths symbols. (Not |
84 | specific to Unicode.) | |
85 | ||
86 | * Need multibyte text in menus, e.g. for the above. (Not specific to | |
87 | Unicode -- see Emacs etc/TODO, but now mostly works with gtk.) | |
88 | ||
89 | * There's currently no support for Unicode normalization. | |
90 | ||
91 | * Populate char-width-table correctly for Unicode characters and | |
92 | worry about what happens when double-width charsets covering | |
93 | non-CJK characters are unified. | |
94 | ||
e88a2ed3 GM |
95 | * There are type errors lurking, e.g. in |
96 | Fcheck_coding_systems_region. Define ENABLE_CHECKING to find them. | |
97 | ||
e88a2ed3 GM |
98 | * Old auto-save files, and similar files, such as Gnus drafts, |
99 | containing non-ASCII characters probably won't be re-read correctly. | |
100 | ||
d37e4893 PE |
101 | |
102 | Source file encoding | |
103 | -------------------- | |
104 | ||
105 | Most Emacs source files are encoded in UTF-8 (or in ASCII, which is a | |
106 | subset), but there are a few exceptions, listed below. Perhaps | |
2b0fae5e | 107 | someday many of these files will be converted to UTF-8, for |
1b610f51 PE |
108 | convenience when using tools like 'grep -r', but this might need |
109 | nontrivial changes to the build process. | |
d37e4893 PE |
110 | |
111 | * chinese-big5 | |
112 | ||
1b610f51 PE |
113 | These are verbatim copies of files taken from external sources. |
114 | They haven't been converted to UTF-8. | |
115 | ||
d37e4893 PE |
116 | leim/CXTERM-DIC/4Corner.tit |
117 | leim/CXTERM-DIC/ARRAY30.tit | |
118 | leim/CXTERM-DIC/ECDICT.tit | |
119 | leim/CXTERM-DIC/ETZY.tit | |
120 | leim/CXTERM-DIC/PY-b5.tit | |
121 | leim/CXTERM-DIC/Punct-b5.tit | |
122 | leim/CXTERM-DIC/QJ-b5.tit | |
123 | leim/CXTERM-DIC/ZOZY.tit | |
124 | leim/MISC-DIC/CTLau-b5.html | |
125 | leim/MISC-DIC/cangjie-table.b5 | |
126 | ||
127 | * chinese-iso-8bit | |
128 | ||
1b610f51 PE |
129 | These are verbatim copies of files taken from external sources. |
130 | They haven't been converted to UTF-8. | |
131 | ||
d37e4893 PE |
132 | leim/CXTERM-DIC/CCDOSPY.tit |
133 | leim/CXTERM-DIC/Punct.tit | |
134 | leim/CXTERM-DIC/QJ.tit | |
135 | leim/CXTERM-DIC/SW.tit | |
136 | leim/CXTERM-DIC/TONEPY.tit | |
137 | leim/MISC-DIC/pinyin.map | |
138 | leim/MISC-DIC/CTLau.html | |
139 | leim/MISC-DIC/ziranma.cin | |
140 | ||
1b610f51 PE |
141 | * cp850 |
142 | ||
143 | This file contains non-ASCII characters in unibyte strings. When | |
144 | editing a keyboard layout it's more convenient to see 'é' than | |
145 | '\202', and the MS-DOS compiler requires the single byte if a | |
146 | backslash escape is not being used. | |
147 | ||
148 | src/msdos.c | |
149 | ||
150 | * iso-2022-cn-ext | |
151 | ||
152 | This file is externally generated from leim/MISC-DIC/cangjie-table.b5 | |
153 | by Big5->CNS converter. It hasn't been converted to UTF-8. | |
154 | ||
155 | leim/MISC-DIC/cangjie-table.cns | |
156 | ||
d37e4893 PE |
157 | * iso-latin-2 |
158 | ||
1b610f51 PE |
159 | These files are processed by csplain, a program that requires |
160 | Latin-2 input. In 2012 the csplain maintainers started | |
161 | recommending UTF-8, but these files haven't been converted yet. | |
162 | ||
163 | etc/refcards/cs-dired-ref.tex | |
d37e4893 | 164 | etc/refcards/cs-refcard.tex |
d37e4893 | 165 | etc/refcards/cs-survival.tex |
d37e4893 PE |
166 | etc/refcards/sk-dired-ref.tex |
167 | etc/refcards/sk-refcard.tex | |
1b610f51 | 168 | etc/refcards/sk-survival.tex |
d37e4893 PE |
169 | |
170 | * japanese-iso-8bit | |
171 | ||
1b610f51 | 172 | SKK-JISYO.L is a verbatim copy of a file taken from an external source. |
6b8504ba | 173 | It hasn't been converted to UTF-8. |
1b610f51 | 174 | |
d37e4893 | 175 | leim/SKK-DIC/SKK-JISYO.L |
d37e4893 PE |
176 | |
177 | * japanese-shift-jis | |
178 | ||
1b610f51 PE |
179 | This is a verbatim copy of a file taken from an external source. |
180 | It hasn't been converted to UTF-8. | |
181 | ||
d37e4893 PE |
182 | admin/charsets/mapfiles/cns2ucsdkw.txt |
183 | ||
184 | * no-conversion | |
185 | ||
1b610f51 PE |
186 | This file purposely contains arbitrary bytes interspersed within text, |
187 | to test whether the Emacs distribution is corrupted. | |
188 | ||
d37e4893 PE |
189 | lib-src/testfile |
190 | ||
1b610f51 PE |
191 | * iso-2022-7bit |
192 | ||
2aa2157b | 193 | This file switches between CJK charsets, which is not encoded in UTF-8. |
84c3ab68 PE |
194 | |
195 | etc/HELLO | |
196 | ||
2aa2157b PE |
197 | Each of these files contains just one CJK charset, but Emacs |
198 | currently has no easy way to specify set-charset-priority on a | |
199 | per-file basis, so converting any of these files to UTF-8 might | |
200 | change the file's appearance when viewed by an Emacs that is | |
201 | operating in some other language environment. | |
202 | ||
203 | etc/tutorials/TUTORIAL.ja | |
2aa2157b PE |
204 | leim/quail/cyril-jis.el |
205 | leim/quail/hanja-jis.el | |
2aa2157b PE |
206 | leim/quail/japanese.el |
207 | leim/quail/py-punct.el | |
208 | leim/quail/pypunct-b5.el | |
2aa2157b PE |
209 | lisp/international/ja-dic-cnv.el |
210 | lisp/international/ja-dic-utl.el | |
211 | lisp/international/kinsoku.el | |
212 | lisp/international/kkc.el | |
213 | lisp/international/titdic-cnv.el | |
214 | lisp/language/japan-util.el | |
215 | lisp/language/japanese.el | |
216 | lisp/term/x-win.el | |
217 | ||
4b725a70 PE |
218 | * utf-8-emacs |
219 | ||
1b610f51 PE |
220 | These files contain characters that cannot be encoded in UTF-8. |
221 | ||
222 | leim/quail/tibetan.el | |
223 | leim/quail/ethiopic.el | |
224 | lisp/international/titdic-cnv.el | |
225 | lisp/language/tibetan.el | |
226 | lisp/language/tibet-util.el | |
227 | lisp/language/ind-util.el | |
228 | ||
e88a2ed3 GM |
229 | \f |
230 | This file is part of GNU Emacs. | |
231 | ||
9ad5de0c | 232 | GNU Emacs is free software: you can redistribute it and/or modify |
e88a2ed3 | 233 | it under the terms of the GNU General Public License as published by |
9ad5de0c GM |
234 | the Free Software Foundation, either version 3 of the License, or |
235 | (at your option) any later version. | |
e88a2ed3 GM |
236 | |
237 | GNU Emacs is distributed in the hope that it will be useful, | |
238 | but WITHOUT ANY WARRANTY; without even the implied warranty of | |
239 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
240 | GNU General Public License for more details. | |
241 | ||
242 | You should have received a copy of the GNU General Public License | |
9ad5de0c | 243 | along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. |