Commit | Line | Data |
---|---|---|
c38e0c97 | 1 | -*-mode: text; coding: utf-8;-*- |
e88a2ed3 | 2 | |
ba318903 | 3 | Copyright (C) 2002-2014 Free Software Foundation, Inc. |
e88a2ed3 GM |
4 | See the end of the file for license conditions. |
5 | ||
2394ef28 EZ |
6 | Importing a new Unicode Standard version into Emacs |
7 | ------------------------------------------------------------- | |
8 | ||
9 | Emacs uses the following files from the Unicode Character Database | |
10 | (a.k.a. "UCD): | |
11 | ||
12 | . UnicodeData.txt | |
13 | . BidiMirroring.txt | |
14 | . IVD_Sequences.txt | |
15 | ||
16 | First, these files need to be copied into admin/unidata/, and then | |
17 | Emacs should be rebuilt for them to take effect. Rebuilding Emacs | |
18 | updates several derived files elsewhere in the Emacs source tree, | |
19 | mainly in lisp/international/. | |
20 | ||
21 | When Emacs is rebuilt for the first time after importing the new | |
22 | files, pay attention to any warning or error messages. In particular, | |
23 | admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines | |
24 | new bidirectional attributes of characters, because unidata-gen.el, | |
25 | bidi.c and dispextern.h need to be updated in that case; failure to do | |
26 | so will cause aborts in redisplay. | |
27 | ||
28 | Next, review the changes in UnicodeData.txt vs the previous version | |
29 | used by Emacs. Any changes, be it introduction of new scripts or | |
30 | addition of codepoints to existing scripts, need corresponding changes | |
31 | in the data used for filling char-script-table, see characters.el | |
32 | around line 1300. Other databases and settings in characters.el, such | |
33 | as the data for char-width-table, might also need changes. | |
34 | ||
35 | Any new scripts added by UnicodeData.txt will also need updates to | |
36 | script-representative-chars defined in fontset.el. Other databases in | |
37 | fontset.el might also need to be updated as needed. | |
38 | ||
e88a2ed3 GM |
39 | Problems, fixmes and other unicode-related issues |
40 | ------------------------------------------------------------- | |
41 | ||
42 | Notes by fx to record various things of variable importance. handa | |
43 | needs to check them -- don't take too seriously, especially with | |
44 | regard to completeness. | |
45 | ||
46 | * SINGLE_BYTE_CHAR_P returns true for Latin-1 characters, which has | |
47 | undesirable effects. E.g.: | |
c38e0c97 PE |
48 | (multibyte-string-p (let ((s "x")) (aset s 0 ?£) s)) => nil |
49 | (multibyte-string-p (concat [?£])) => nil | |
50 | (text-char-description ?£) => "M-#" | |
e88a2ed3 GM |
51 | |
52 | These examples are all fixed by the change of 2002-10-14, but | |
53 | there still exist questionable SINGLE_BYTE_CHAR_P in the | |
54 | code (keymap.c and print.c). | |
55 | ||
56 | * Rationalize character syntax and its relationship to the Unicode | |
57 | database. (Applies mainly to symbol an punctuation syntax.) | |
58 | ||
59 | * Fontset handling and customization needs work. We want to relate | |
60 | fonts to scripts, probably based on the Unicode blocks. The | |
61 | presence of small-repertoire 10646-encoded fonts in XFree 4 is a | |
62 | pain, not currently worked round. | |
63 | ||
64 | With the change on 2002-07-26, multiple fonts can be | |
65 | specified in a fontset for a specific range of characters. | |
66 | Each range can also be specified by script. Before using | |
67 | ISO10646 fonts, Emacs checks their repertories to avoid such | |
68 | fonts that don't have a glyph for a specific character. | |
69 | ||
70 | fx has worked on fontset customization, but was stymied by | |
71 | basic problems with the way the default face is dealt with | |
72 | (and something else, I think). This needs revisiting. | |
73 | ||
74 | * Work is also needed on charset and coding system priorities. | |
75 | ||
76 | * The relevant bits of latin1-disp.el need porting (and probably | |
77 | re-naming/updating). See also cyril-util.el. | |
78 | ||
79 | * Quail files need more work now the encoding is largely irrelevant. | |
80 | ||
81 | * What to do with the old coding categories stuff? | |
82 | ||
83 | * The preferred-coding-system property of charsets should probably be | |
84 | junked unless it can be made more useful now. | |
85 | ||
86 | * find-multibyte-characters needs looking at. | |
87 | ||
88 | * Implement Korean cp949/UHC, BIG5-HKSCS and any other important missing | |
89 | charsets. | |
90 | ||
91 | * Lazy-load tables for unify-charset somehow? | |
92 | ||
93 | Actually, Emacs clears out all charset maps and unify-map just | |
94 | before dumping, and they are loaded again on demand by the | |
95 | dumped emacs. But, those maps (char tables) generated while | |
96 | temacs is running can't be removed from the dumped emacs. | |
97 | ||
e88a2ed3 GM |
98 | * iso-2022 charsets get unified on i/o. |
99 | ||
100 | With the change on 2003-01-06, decoding routines put `charset' | |
101 | property to decoded text, and iso-2022 encoder pay attention | |
102 | to it. Thus, for instance, reading and writing by | |
103 | iso-2022-7bit preserve the original designation sequences. | |
104 | The property name `preferred-charset' may be better? | |
105 | ||
106 | We may have to utilize this property to decide a font. | |
107 | ||
108 | * Revisit locale processing: look at treating the language and | |
109 | charset parts separately. (Language should affect things like | |
110 | spelling and calendar, but that's not a Unicode issue.) | |
111 | ||
112 | * Handle Unicode combining characters usefully, e.g. diacritics, and | |
c38e0c97 | 113 | handle more scripts specifically (à la Devanagari). There are |
e88a2ed3 GM |
114 | issues with canonicalization. |
115 | ||
e88a2ed3 GM |
116 | * We need tabular input methods, e.g. for maths symbols. (Not |
117 | specific to Unicode.) | |
118 | ||
119 | * Need multibyte text in menus, e.g. for the above. (Not specific to | |
120 | Unicode -- see Emacs etc/TODO, but now mostly works with gtk.) | |
121 | ||
122 | * There's currently no support for Unicode normalization. | |
123 | ||
124 | * Populate char-width-table correctly for Unicode characters and | |
125 | worry about what happens when double-width charsets covering | |
126 | non-CJK characters are unified. | |
127 | ||
e88a2ed3 GM |
128 | * There are type errors lurking, e.g. in |
129 | Fcheck_coding_systems_region. Define ENABLE_CHECKING to find them. | |
130 | ||
e88a2ed3 GM |
131 | * Old auto-save files, and similar files, such as Gnus drafts, |
132 | containing non-ASCII characters probably won't be re-read correctly. | |
133 | ||
d37e4893 PE |
134 | |
135 | Source file encoding | |
136 | -------------------- | |
137 | ||
138 | Most Emacs source files are encoded in UTF-8 (or in ASCII, which is a | |
139 | subset), but there are a few exceptions, listed below. Perhaps | |
2b0fae5e | 140 | someday many of these files will be converted to UTF-8, for |
1b610f51 PE |
141 | convenience when using tools like 'grep -r', but this might need |
142 | nontrivial changes to the build process. | |
d37e4893 PE |
143 | |
144 | * chinese-big5 | |
145 | ||
1b610f51 PE |
146 | These are verbatim copies of files taken from external sources. |
147 | They haven't been converted to UTF-8. | |
148 | ||
d37e4893 PE |
149 | leim/CXTERM-DIC/4Corner.tit |
150 | leim/CXTERM-DIC/ARRAY30.tit | |
151 | leim/CXTERM-DIC/ECDICT.tit | |
152 | leim/CXTERM-DIC/ETZY.tit | |
153 | leim/CXTERM-DIC/PY-b5.tit | |
154 | leim/CXTERM-DIC/Punct-b5.tit | |
155 | leim/CXTERM-DIC/QJ-b5.tit | |
156 | leim/CXTERM-DIC/ZOZY.tit | |
157 | leim/MISC-DIC/CTLau-b5.html | |
158 | leim/MISC-DIC/cangjie-table.b5 | |
159 | ||
160 | * chinese-iso-8bit | |
161 | ||
1b610f51 PE |
162 | These are verbatim copies of files taken from external sources. |
163 | They haven't been converted to UTF-8. | |
164 | ||
d37e4893 PE |
165 | leim/CXTERM-DIC/CCDOSPY.tit |
166 | leim/CXTERM-DIC/Punct.tit | |
167 | leim/CXTERM-DIC/QJ.tit | |
168 | leim/CXTERM-DIC/SW.tit | |
169 | leim/CXTERM-DIC/TONEPY.tit | |
170 | leim/MISC-DIC/pinyin.map | |
171 | leim/MISC-DIC/CTLau.html | |
172 | leim/MISC-DIC/ziranma.cin | |
173 | ||
1b610f51 PE |
174 | * cp850 |
175 | ||
176 | This file contains non-ASCII characters in unibyte strings. When | |
177 | editing a keyboard layout it's more convenient to see 'é' than | |
178 | '\202', and the MS-DOS compiler requires the single byte if a | |
179 | backslash escape is not being used. | |
180 | ||
181 | src/msdos.c | |
182 | ||
183 | * iso-2022-cn-ext | |
184 | ||
185 | This file is externally generated from leim/MISC-DIC/cangjie-table.b5 | |
186 | by Big5->CNS converter. It hasn't been converted to UTF-8. | |
187 | ||
188 | leim/MISC-DIC/cangjie-table.cns | |
189 | ||
d37e4893 PE |
190 | * iso-latin-2 |
191 | ||
1b610f51 PE |
192 | These files are processed by csplain, a program that requires |
193 | Latin-2 input. In 2012 the csplain maintainers started | |
194 | recommending UTF-8, but these files haven't been converted yet. | |
195 | ||
196 | etc/refcards/cs-dired-ref.tex | |
d37e4893 | 197 | etc/refcards/cs-refcard.tex |
d37e4893 | 198 | etc/refcards/cs-survival.tex |
d37e4893 PE |
199 | etc/refcards/sk-dired-ref.tex |
200 | etc/refcards/sk-refcard.tex | |
1b610f51 | 201 | etc/refcards/sk-survival.tex |
d37e4893 PE |
202 | |
203 | * japanese-iso-8bit | |
204 | ||
1b610f51 | 205 | SKK-JISYO.L is a verbatim copy of a file taken from an external source. |
6b8504ba | 206 | It hasn't been converted to UTF-8. |
1b610f51 | 207 | |
d37e4893 | 208 | leim/SKK-DIC/SKK-JISYO.L |
d37e4893 PE |
209 | |
210 | * japanese-shift-jis | |
211 | ||
1b610f51 PE |
212 | This is a verbatim copy of a file taken from an external source. |
213 | It hasn't been converted to UTF-8. | |
214 | ||
d37e4893 PE |
215 | admin/charsets/mapfiles/cns2ucsdkw.txt |
216 | ||
1b610f51 PE |
217 | * iso-2022-7bit |
218 | ||
2aa2157b | 219 | This file switches between CJK charsets, which is not encoded in UTF-8. |
84c3ab68 PE |
220 | |
221 | etc/HELLO | |
222 | ||
2aa2157b PE |
223 | Each of these files contains just one CJK charset, but Emacs |
224 | currently has no easy way to specify set-charset-priority on a | |
225 | per-file basis, so converting any of these files to UTF-8 might | |
226 | change the file's appearance when viewed by an Emacs that is | |
227 | operating in some other language environment. | |
228 | ||
229 | etc/tutorials/TUTORIAL.ja | |
2aa2157b PE |
230 | leim/quail/cyril-jis.el |
231 | leim/quail/hanja-jis.el | |
2aa2157b PE |
232 | leim/quail/japanese.el |
233 | leim/quail/py-punct.el | |
234 | leim/quail/pypunct-b5.el | |
2aa2157b PE |
235 | lisp/international/ja-dic-cnv.el |
236 | lisp/international/ja-dic-utl.el | |
237 | lisp/international/kinsoku.el | |
238 | lisp/international/kkc.el | |
239 | lisp/international/titdic-cnv.el | |
240 | lisp/language/japan-util.el | |
241 | lisp/language/japanese.el | |
242 | lisp/term/x-win.el | |
243 | ||
4b725a70 PE |
244 | * utf-8-emacs |
245 | ||
1b610f51 PE |
246 | These files contain characters that cannot be encoded in UTF-8. |
247 | ||
248 | leim/quail/tibetan.el | |
249 | leim/quail/ethiopic.el | |
250 | lisp/international/titdic-cnv.el | |
251 | lisp/language/tibetan.el | |
252 | lisp/language/tibet-util.el | |
253 | lisp/language/ind-util.el | |
254 | ||
e88a2ed3 GM |
255 | \f |
256 | This file is part of GNU Emacs. | |
257 | ||
9ad5de0c | 258 | GNU Emacs is free software: you can redistribute it and/or modify |
e88a2ed3 | 259 | it under the terms of the GNU General Public License as published by |
9ad5de0c GM |
260 | the Free Software Foundation, either version 3 of the License, or |
261 | (at your option) any later version. | |
e88a2ed3 GM |
262 | |
263 | GNU Emacs is distributed in the hope that it will be useful, | |
264 | but WITHOUT ANY WARRANTY; without even the implied warranty of | |
265 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
266 | GNU General Public License for more details. | |
267 | ||
268 | You should have received a copy of the GNU General Public License | |
9ad5de0c | 269 | along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. |