1
  
2
  
3
  
4
  
5
  
6
  
7
  
8
  
9
  
10
  
11
  
12
  
13
  
14
  
15
  
16
  
17
  
18
  
19
  
20
  
21
  
22
  
23
  
24
  
25
  
26
  
27
  
28
  
29
  
30
  
31
  
32
  
33
  
34
  
35
  
36
  
37
  
38
  
39
  
40
  
41
  
42
  
43
  
44
  
45
  
46
  
47
  
48
  
49
  
50
  
51
  
52
  
53
  
54
  
55
  
56
  
57
  
58
  
59
  
60
  
61
  
62
  
63
  
64
  
65
  
66
  
67
  
68
  
69
  
70
  
71
  
72
  
73
  
74
  
75
  
76
  
77
  
78
  
79
  
80
  
81
  
82
  
83
  
84
  
85
  
86
  
87
  
88
  
89
  
90
  
91
  
92
  
93
  
94
  
95
  
96
  
97
  
98
  
99
  
100
  
101
  
102
  
103
  
104
  
105
  
106
  
107
  
108
  
109
  
110
  
111
  
112
  
113
  
114
  
115
  
116
  
117
  
118
  
119
  
120
  
121
  
122
  
123
  
124
  
125
  
126
  
127
  
128
  
129
  
130
  
131
  
132
  
133
  
134
  
135
  
136
  
137
  
138
  
139
  
140
  
141
  
142
  
143
  
144
  
145
  
146
  
147
  
148
  
149
  
150
  
151
  
152
  
153
  
154
  
155
  
156
  
157
  
158
  
159
  
160
  
161
  
162
  
163
  
164
  
165
  
166
  
167
  
168
  
169
  
170
  
171
  
172
  
173
  
174
  
175
  
176
  
177
  
178
  
179
  
180
  
181
  
182
  
183
  
184
  
185
  
186
  
187
  
188
  
189
  
190
  
191
  
192
  
193
  
194
  
195
  
196
  
197
  
198
  
199
  
200
  
201
  
202
  
203
  
204
  
205
  
206
  
207
  
208
  
209
  
210
  
211
  
212
  
213
  
214
  
215
  
216
  
217
  
218
  
219
  
220
  
221
  
222
  
223
  
224
  
225
  
226
  
227
  
228
  
229
  
230
  
231
  
232
  
233
  
234
  
235
  
236
  
237
  
238
  
239
  
240
  
241
  
242
  
243
  
244
  
245
  
246
  
247
  
248
  
249
  
250
  
251
  
252
  
253
  
254
  
255
  
256
  
257
  
258
  
259
  
260
  
261
  
262
  
263
  
264
  
265
  
266
  
267
  
268
  
269
  
270
  
271
  
272
  
273
  
274
  
275
  
276
  
277
  
278
  
279
  
280
  
281
  
282
  
283
  
284
  
285
  
286
  
287
  
288
  
289
  
290
  
291
  
292
  
293
  
294
  
295
  
296
  
297
  
298
  
299
  
300
  
301
  
302
  
303
  
304
  
305
  
306
  
307
  
308
  
309
  
310
  
311
  
312
  
313
  
314
  
315
  
316
  
317
  
318
  
319
  
320
  
321
  
322
  
323
  
324
  
325
  
326
  
327
  
328
  
329
  
330
  
331
  
332
  
333
  
334
  
335
  
336
  
337
  
338
  
339
  
340
  
341
  
342
  
343
  
344
  
345
  
346
  
347
  
348
  
349
  
350
  
351
  
352
  
353
  
354
  
355
  
356
  
357
  
358
  
359
  
360
  
361
  
362
  
363
  
364
  
365
  
366
  
367
  
368
  
369
  
370
  
371
  
372
  
373
  
374
  
375
  
376
  
377
  
378
  
379
  
380
  
381
  
382
  
383
  
384
  
385
  
386
  
387
  
388
  
389
  
390
  
391
  
392
  
393
  
394
  
395
  
396
  
397
  
398
  
399
  
400
  
401
  
402
  
403
  
404
  
405
  
406
  
407
  
408
  
409
  
410
  
411
  
412
  
413
  
414
  
415
  
416
  
417
  
418
  
419
  
420
  
421
  
422
  
423
  
424
  
425
  
426
  
427
  
428
  
429
  
430
  
431
  
432
  
433
  
434
  
435
  
436
  
437
  
438
  
439
  
440
  
441
  
442
  
443
  
444
  
445
  
446
  
447
  
448
  
449
  
450
  
451
  
452
  
453
  
454
  
455
  
456
  
457
  
458
  
459
  
460
  
461
  
462
  
463
  
464
  
465
  
466
  
467
  
468
  
469
  
470
  
471
  
472
  
473
  
474
  
475
  
476
  
 
UNICODE 2.1 CHARACTER DATABASE (update 2.1.8) 
 
Copyright (c) 1991-1998 Unicode, Inc. 
All Rights reserved. 
 
DISCLAIMER 
 
The Unicode Character Database "UnicodeData-Latest.txt" is provided as-is by 
Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any 
particular purpose. No warranties of any kind are expressed or implied. The 
recipient agrees to determine applicability of information provided. If this 
file has been purchased on magnetic or optical media from Unicode, Inc., 
the sole remedy for any claim will be exchange of defective media within 
90 days of receipt. 
 
This disclaimer is applicable for all other data files accompanying the 
Unicode Character Database, some of which have been compiled by the 
Unicode Consortium, and some of which have been supplied by other vendors. 
 
LIMITATIONS ON RIGHTS TO REDISTRIBUTE THIS DATA 
 
Recipient is granted the right to make copies in any form for internal 
distribution and to freely use the information supplied in the creation of 
products supporting the Unicode (TM) Standard. This file can be redistributed 
to third parties or other organizations (whether for profit or not) as long 
as this notice and the disclaimer notice are retained. 
 
EXPLANATORY INFORMATION 
 
The Unicode Character Database defines the default Unicode character 
properties, and internal mappings. Particular implementations may choose to 
override the properties and mappings that are not normative. If that is done, 
it is up to the implementer to establish a protocol to convey that 
information. For more information about character properties and mappings, 
see "The Unicode Standard, Worldwide Character Encoding, Version 2.0", 
published by Addison-Wesley. For information about other data files 
accompanying the Unicode Character Database, see the section of the 
Unicode Standard they were extracted from, or the explanatory readme 
files and/or header sections with those files. 
 
The Unicode Character Database has been updated to reflect Version 2.1 
of the Unicode Standard, with two additional characters added to those 
published in Version 2.0: 
 
   U+20AC EURO SIGN 
   U+FFFC OBJECT REPLACEMENT CHARACTER 
 
A number of corrections have also been made to case mappings or other 
errors in the database noted since the publication of Version 2.0. And 
normative bidirectional properties have been modified to reflect 
decisions of the Unicode Technical Committee. 
 
The Unicode Character Database is a plain ASCII text file consisting of lines 
containing fields terminated by semicolons. Each line represents the data for 
one encoded character in the Unicode Standard, Version 2.1. Every encoded 
character has a data entry, with the exception of certain special ranges, as 
detailed below. 
 
There are four special ranges of characters that are represented only by 
their start and end characters, since the properties in the file are uniform, 
except for code values (which are all sequential and assigned). The names of CJK 
ideograph characters and Hangul syllable characters are algorithmically 
derivable. (See the Unicode Standard for more information). Surrogate 
characters and private use characters have no names. 
 
The exact ranges represented by start and end characters are: 
 
   The CJK Ideographs Area (U+4E00 - U+9FFF) 
   The Hangul Syllables Area (U+AC00 - U+D7A3) 
   The Surrogates Area (U+D800 - U+DFFF) 
   The Private Use Area (U+E000 - U+F8FF) 
 
The following table describes the format and meaning of each field in a 
data entry in the Unicode Character Database. Fields which contain 
normative information are so indicated. 
 
Note that the term "normative" when applied to a property or field of 
the Unicode Character Database, does not mean that the value of that 
field will *never* change. Corrections and extensions to the standard 
in the future may require minor changes to normative values, even 
though the Unicode Technical Committee strives to minimize such changes. 
"Normative" means that implementations that claim conformance to the 
Unicode Standard (at a particular version) and which make use of 
a particular property or field must follow the specifications of the 
standard for that property or field in order to be conformant. If a 
property or field is only "informative", a conformant implementation 
is free to use or change such values as it may require, while still 
being conformant to the standard. 
 
Field   Explanation 
-----   ----------- 
 
  0     Code value in 4-digit hexadecimal format. 
        This field is normative. 
 
  1     Unicode 2.1 Character Name. These names match exactly the 
        names published in Chapter 7 of the Unicode Standard, Version 
        2.0, except for the two additional characters. 
        This field is normative. 
 
  2     General Category. This is a useful breakdown into various "character 
        types" which can be used as a default categorization in implementations. 
        Some of the values are normative, and some are informative. 
        See below for a brief explanation. 
 
  3     Canonical Combining Classes. The classes used for the 
        Canonical Ordering Algorithm in the Unicode Standard. These 
        classes are also printed in Chapter 4 of the Unicode Standard. 
        This field is normative. See below for a brief explanation. 
 
  4     Bidirectional Category. See the list below for an explanation of the 
        abbreviations used in this field. These are the categories required 
        by the Bidirectional Behavior Algorithm in the Unicode Standard. 
        These categories are summarized in Chapter 4 of the Unicode Standard. 
        This field is normative. 
 
  5     Character Decomposition. In the Unicode Standard, not all of 
        the decompositions are full decompositions. Recursive 
        application of look-up for decompositions will, in all cases, lead to 
        a maximal decomposition. The decompositions match exactly the 
        decompositions published with the character names in Chapter 7 
        of the Unicode Standard. This field is normative. 
 
  6     Decimal digit value. This is a numeric field. If the character 
        has the decimal digit property, as specified in Chapter 4 of 
        the Unicode Standard, the value of that digit is represented 
        with an integer value in this field. This field is normative. 
 
  7     Digit value. This is a numeric field. If the character represents a 
        digit, not necessarily a decimal digit, the value is here. This 
        covers digits which do not form decimal radix forms, such as the 
        compatibility superscript digits. This field is informative. 
 
  8     Numeric value. This is a numeric field. If the character has the 
        numeric property, as specified in Chapter 4 of the Unicode 
        Standard, the value of that character is represented with an 
        integer or rational number in this field. This includes fractions as, 
        e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH. 
        Also included are numerical values for compatibility characters 
        such as circled numbers. This field is normative. 
 
  9     If the characters has been identified as a "mirrored" character in 
        bidirectional text, this field has the value "Y"; otherwise "N". 
        The list of mirrored characters is also printed in Chapter 4 of 
        the Unicode Standard. This field is normative. 
 
 10     Unicode 1.0 Name. This is the old name as published in Unicode 1.0. 
        This name is only provided when it is significantly different from 
        the Unicode 2.1 name for the character. This field is informative. 
 
 11     10646 Comment field. This field is informative. 
 
 12     Upper case equivalent mapping. If a character is part of an 
        alphabet with case distinctions, and has an upper case equivalent, 
        then the upper case equivalent is in this field. See the explanation 
        below on case distinctions. These mappings are always one-to-one, 
        not one-to-many or many-to-one. This field is informative. 
 
 13     Lower case equivalent mapping. Similar to 12. This field is informative. 
 
 14     Title case equivalent mapping. Similar to 12. This field is informative. 
 
GENERAL CATEGORY 
 
The values in this field are abbreviations for the following. Some of the 
values are normative, and some are informative. For more information, see 
the Unicode Standard. Note: the standard does not assign information to 
control characters (except for TAB in the Bidirectonal Algorithm). 
Implementations will generally also assign categories to certain control 
characters, notably CR and LF, according to platform conventions. 
 
Normative 
    Mn = Mark, Non-Spacing 
    Mc = Mark, Spacing Combining 
    Me = Mark, Enclosing 
 
    Nd = Number, Decimal Digit 
    Nl = Number, Letter 
    No = Number, Other 
 
    Zs = Separator, Space 
    Zl = Separator, Line 
    Zp = Separator, Paragraph 
 
    Cc = Other, Control 
    Cf = Other, Format 
    Cs = Other, Surrogate 
    Co = Other, Private Use 
    Cn = Other, Not Assigned 
 
Informative 
    Lu = Letter, Uppercase 
    Ll = Letter, Lowercase 
    Lt = Letter, Titlecase 
    Lm = Letter, Modifier 
    Lo = Letter, Other 
 
    Pc = Punctuation, Connector 
    Pd = Punctuation, Dash 
    Ps = Punctuation, Open 
    Pe = Punctuation, Close 
    Pi = Punctuation, Initial quote (may behave like Ps or Pe depending 
                              on usage) 
    Pf = Punctuation, Final quote (may behave like Ps or Pe depending 
                              on usage) 
    Po = Punctuation, Other 
 
    Sm = Symbol, Math 
    Sc = Symbol, Currency 
    Sk = Symbol, Modifier 
    So = Symbol, Other 
 
BIDIRECTIONAL PROPERTIES 
 
Please refer to the Unicode Standard for an explanation of the algorithm for 
Bidirectional Behavior and an explanation of the sigificance of these categories. 
These values are normative. 
 
Strong types: 
        L       Left-Right; Most alphabetic, syllabic, and logographic 
                        characters (e.g., CJK ideographs) 
        R       Right-Left; Arabic, Hebrew, and 
                        punctuation specific to those scripts 
Weak types: 
        EN      European Number 
        ES      European Number Separator 
        ET      European Number Terminator 
        AN      Arabic Number 
        CS      Common Number Separator 
 
Separators: 
        B       Block Separator 
        S       Segment Separator 
 
Neutrals: 
        WS      Whitespace 
        ON      Other Neutrals ; All other characters: punctuation, symbols 
 
CHARACTER DECOMPOSITION TAGS 
 
The decomposition is a normative property of a character. The tags supplied 
with certain decompositions generally indicate formatting information. 
Where no such tag is given, the decomposition is designated as canonical. 
Conversely, the presence of a formatting tag also indicates 
that the decomposition is a compatibility decomposition and not a canonical 
decomposition. In the absence of other formatting information in a 
compatibility decomposition, the tag <compat> is used to distinguish it from 
canonical decompositions. 
 
In some instances a canonical decomposition or a compatibility decomposition 
may consist of a single character. For a canonical decomposition, this 
indicates that the character is a canonical equivalent of another single 
character. For a compatibility decomposition, this indicates that the 
character is a compatibility equivalent of another single character. 
 
The compatibility formatting tags used are: 
 
        <font>          A font variant (e.g. a blackletter form). 
        <noBreak>       A no-break version of a space or hyphen. 
        <initial>       An initial presentation form (Arabic). 
        <medial>        A medial presentation form (Arabic). 
        <final>         A final presentation form (Arabic). 
        <isolated>      An isolated presentation form (Arabic). 
        <circle>        An encircled form. 
        <super>         A superscript form. 
        <sub>           A subscript form. 
        <vertical>      A vertical layout presentation form. 
        <wide>          A wide (or zenkaku) compatibility character. 
        <narrow>        A narrow (or hankaku) compatibility character. 
        <small>         A small variant form (CNS compatibility). 
        <square>        A CJK squared font variant. 
        <fraction>      A vulgar fraction form. 
        <compat>        Otherwise unspecified compatibility character. 
 
CANONICAL COMBINING CLASSES 
 
  0: Spacing, enclosing, reordrant, and surrounding 
  1: Overlays and interior 
  6: Tibetan subjoined Letters 
  7: Nuktas 
  8: Hiragana/Katakana voiced marks 
  9: Viramas 
 10: Start of fixed position classes 
199: End of fixed position classes 
200: Below left attached 
202: Below attached 
204: Below right attached 
208: Left attached (reordrant around single base character) 
210: Right attached 
212: Above left attached 
214: Above attached 
216: Above right attached 
218: Below left 
220: Below 
222: Below right 
224: Left (reordrant around single base character) 
226: Right 
228: Above left 
230: Above 
232: Above right 
234: Double above 
240: Below (iota subscript) 
 
Note: some of the combining classes in this list do not currently have 
members but are specified here for completeness. 
 
DECOMPOSITIONS AND NORMALIZATION 
 
The Unicode Technical Report #15 Normalization Forms is found on 
http://www.unicode.org/unicode/reports/tr15/. 
 
That report specifies how the decompositions defined in the Unicode 
Character Database are used to derive normalized forms of Unicode 
text. 
 
Note that as of the 2.1.8 update of the Unicode Character Database, 
the decompositions in the UnicodeData.txt file can be used to recursively 
derive the full decomposition in canonical order, without the need 
to separately apply canonical reordering. However, canonical reordering 
of combining character sequences must still be applied in decomposition 
when normalizing source text which contains any combining marks. 
 
CASE MAPPINGS 
 
In addition to uppercase and lowercase, because of the inclusion of certain 
composite characters for compatibility, such as "01F1;LATIN CAPITAL LETTER 
DZ", there is a third case, called titlecase, which is used where the first 
character of a word is to be capitalized (e.g. UPPERCASE, Titlecase, 
lowercase). An example of such a character is "01F2;LATIN CAPITAL LETTER D 
WITH SMALL LETTER Z". 
 
The uppercase, titlecase and lowercase fields are only included for characters 
that have a single corresponding character of that type. Composite characters 
(such as "339D;SQUARE CM") that do not have a single corresponding character 
of that type can be cased by decomposition. 
 
The case mapping is an informative, default mapping. Case itself, on 
the other hand, has normative status. Thus, for example, 0041 LATIN 
CAPITAL LETTER A is normatively uppercase, but its lowercase mapping 
the 0061 LATIN SMALL LETTER A is informative. The reason for this is 
that case can be considered to be an inherent property of a particular 
character (and is usually, but not always, derivable from the presence 
of the terms "CAPITAL" or "SMALL" in the character name), but case 
mappings between characters are occasionally influenced by local 
conventions. For example, certain languages, such 
as Turkish, German, French, or Greek may have small deviations from the 
default mappings listed in the Unicode Character Database. 
 
For compatibility with existing parsers, the Unicode Character Database 
only contains case mappings for characters where they are one-to-one 
mappings; it also omits information about locale-specific 
case mappings. Information about these special cases can be found in 
a separate data file, SpecialCasing.txt, which has been added with 
the 2.1.8 update to the Unicode data files. SpecialCasing.txt contains 
additional informative case mappings that are either not one-to-one or 
which are context-sensitive. 
 
MODIFICATION HISTORY 
 
Modifications made for Version 2.1.8 of the Unicode Character Database: 
* Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI 
        so that decompositions involving iota subscript are derivable 
        directly in canonically reordered form; this also has a bearing 
        on simplification of casing of polytonic Greek. 
* Changes in decompositions related to Greek tonos. These result from 
        the clarification that monotonic Greek "tonos" should be 
        equated with U+0301 COMBINING ACUTE, rather than with 
        U+030D COMBINING VERTICAL LINE ABOVE. (All Greek characters 
        in the Greek block involving "tonos"; some Greek characters 
        in the polytonic Greek in the 1FXX block.) 
* Changed decompositions involving dialytika tonos. (U+0390, U+03B0) 
* Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) 
        These changes simplify normalization. 
* Removed canonical decomposition for Latin Candrabindu. (U+0310) 
* Correct error in canonical decomposition for U+1FF4. 
* Added compatibility decompositions to clarify collation tables. 
        (U+2100, U+2101, U+2105, U+2106, U+1E9A) 
* A series of general category changes to assist the convergence of 
        of Unicode definition of identifier with ISO TR 10176: 
        So > Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B 
        Po > Lo: U+0E2F, U+0EAF, U+3006 
        Lm > Sk: U+309B, U+309C 
        Po > Pc: U+30FB, U+FF65 
        Ps/Pe > Mn: U+0F3E, U+0F3F 
* A series of bidi property changes for consistency. 
        L > ET: U+09F2, U+09F3 
        ON > L: U+3007 
        L > ON: U+0F3A..U+0F3D, U+037E, U+0387 
* Add case mapping: U+01A6 <-> U+0280 
* Update symmetric swapping value for guillemets: U+00AB, U+00BB, 
        U+2039, U+203A. 
* Changes to combining class values. Most Indic fixed position class 
        non-spacing marks were changed to combining class 0. This fixes 
        some inconsistencies in how canonical reordering would apply 
        to Indic script, including Tibetan. Indic interacting top/bottom 
        fixed position classes were merged into single (non-zero) 
        classes as part of this change. Tibetan subjoined consonants 
        are changed from combining class 6 to combining class 0. 
        Thai pinthu (U+0E3A) moved to combining class 9. Move two 
        Devanagari stress marks into generic above and below 
        combining classes (U+0951, U+0952). 
* Correct placement of semicolon near symmetric swapping field. 
        (U+FA0E, etc., scattered positions to U+FA29) 
 
[Note that Versions 2.1.6 and 2.1.7 of the Unicode Character 
Database were for internal change tracking only, and were never 
finally approved for public release.] 
 
Modifications made for Version 2.1.5 of the Unicode Character Database: 
* Changed decomposition for U+FF9E and U+FF9F so that correct collation 
        weighting will automatically result from the canonical 
        equivalences. 
* Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9, 
        U+04E0, U+04E1, U+04E8, U+04E9 (the implication being that 
        no canonical equivalence is claimed between these 8 characters 
        and similar Latin letters), and updated 4 canonical decompositions 
        for U+04DB, U+04DC, U+04EA, U+04EB to reflect the implied 
        difference in the base character. 
* Added Pi, and Pf categories and assigned the relevant quotation 
        marks to those categories, based on the Unicode Technical 
        Corrigendem on Quotation Characters. 
* Updating of many bidi properties, following the advice of the ad hoc 
        committee on bidi, and to make the bidi properties of compatibility 
        characters more consistent. 
* Changed category of several Tibetan characters: U+0F3E, U+0F3F, 
        U+0F88..U+0F8B to make them non-combining, reflecting the 
        combined opinion of Tibetan experts. 
* Added case mapping for U+03F2. 
* Corrected case mapping for U+0275. 
* Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0.. 
        U+03F2. 
* Corrected compatibility label for U+2121. 
* Add specific entries for all the CJK compatibility ideographs, 
        U+F900..U+FA2D, so the canonical decomposition for each 
        (the URO character it is equivalent to) can be carried 
        in the database. 
 
[Note that Versions 2.1.3 and 2.1.4 of the Unicode Character 
Database were for internal change tracking only, and were never 
finally approved for public release.] 
 
Modifications made in updating the Unicode Character Database to 
Version 2.1.2 for the Unicode Standard, Version 2.1 (from Version 2.0) are: 
* Added two characters (U+20AC and U+FFFC). 
* Amended bidi properties for U+0026, U+002E, U+0040, U+2007. 
* Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275, 
        U+03C2, U+1E9B. 
* Changed combining order class for U+0F71. 
* Corrected canonical decompositions for U+0F73, U+1FBE. 
* Changed decomposition for U+FB1F from compatibility to canonical. 
* Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB. 
* Corrected compatibility decompositions for U+2469, U+246A, U+3358. 
 
 
Some of the modifications made in updating the Unicode Character Database 
to Version 2.0.14 for the Unicode Standard, Version 2.0 are: 
* Fixed decompositions with TONOS to use correct NSM: 030D. 
* Removed old Hangul Syllables; mapping to new characters are 
        in a separate table. 
* Marked compability decompositions with additional tags. 
* Changed old tag names for clarity. 
* Revision of decompositions to use first-level decomposition, instead 
        of maximal decomposition. 
* Correction of all known errors in decompositions from earlier versions. 
* Added control code names (as old Unicode names). 
* Added Hangul Jamo decompositions. 
* Added Number category to match properties list in book. 
* Fixed categories of Koranic Arabic marks. 
* Fixed categories of precomposed characters to match decomposition where possible. 
* Added Hebrew cantillation marks and the Tibetan script. 
* Added place holders for ranges such as CJK Ideographic Area and the 
        Private Use Area. 
* Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of mistakes in 
        the database.