GBK Chinese Internal Code Specification GBK is another Hanzi encoding standard. Its full name is《汉字内码扩展规范》(GBK) and the English name is Chinese Internal Code Specification. Formulated by the National Information Technology Standardization Technical Committee of the People's Republic of China on December 1, 1995, GBK was defined as a technical specification, published and implemented on December 15, 1995 jointly by the Standardization Department of the State Bureau of Quality and Technical Supervision, the Technical and Quality Supervision Department of the Ministry of Electronics Industry, in the form of Ji Jian Biao Han [1995] Document 229. This version of specification is GBK version 1.0. GB is GuoBiao, meaning national standard, and K is the first letter of pinyin Kuozhan, meaning extension.
GBK is backward compatible with GB 2312 code, and supports the international standard of ISO 10646.1. It is a standard linking GB 2312 code and ISO 10646.1 during the transition from the former to the latter.
ISO 10646, an encoding standard issued by ISO, is called the Universal Multiple-Octet Coded Character Set (abbreviated as UCS), and translated as《通用多八位编 码字符集》 by mainland and 《广用多八位元编码字元集》 by Taiwan. It is fully compatible with the Unicode encoding of Unicode Consortium. ISO 10646.1 is the first part of this standard-Architecture and Basic Multilingual Plane. China endorsed the standard in 1993 by establishing it as the national standard GB 13000.1 (i.e., GB 13000.1 is equal to ISO 10646.1).
ISO 10646 is an encoding system that incorporates the written forms and additional symbols of multiple languages. The Hnazi portion is called "CJK Unified Hanzi"(C refers to China, J Japan, and K Korea). After further breakdown, the part about China includes Hanzi and symbols defined in official standards like GB 2312, GB 12345, "Table of Commonly Used Modern Chinese Characters" from the mainland, as well as Hanzi and symbols of Planes 1 and 2 (basically the same as BIG-5 encoding) and Plane 14 in CNS 11643 from Taiwan.
I. Repertoire
GBK specification incorporates all the CJK Hanzi and symbols in ISO 10646.1 with additions. Specifically, it includes:
1. All Hanzi and non-Hanzi symbols in GB 2312.
2. Other CJK Hanzi in GB 13000.1. The total of the above items is 20902 GB Hanzi.
3. 52 Hanzi in "General Table for Simplified Chinese Characters" that are not integrated into GB 13000.1.
4. 28 radicals and important components in "Kangxi Dictionary" and "Ci Hai" (Great Dictionary) that are not integrated into GB 13000.1.
5. 13 Hanzi structure symbols.
6. 139 graphical symbols in BIG-5 that are not included in GB 2312 but exist in GB 13000.1.
7. 6 pinyin symbols added into GB 12345.
8. Hanzi zero"○".
9.19 vertical punctuation symbols added in GB 12345 (GB 12345 has added 29 such symbols more than GB 2312, of which 10 are not included in GB 13000.1, so GBK does not incorporate them either).
10. 21 Hanzi selected from the CJK compatibility area in GB 13000.1.
11. 31 symbols exclusive for IBM OS/2 contained in GB 13000.1.
II. Code Point Allocation and Sequence
GBK also used two bytes for representation. The general encoding range is 8140-FEFE. The initial byte is between 81-FE and the trailing byte is between 40-FE, eliminating the line of xx7F. There are 23940 code points in total. An aggregate of 21886 Hanzi and graphical characters are incorporated, including 21003 Hanzi (comprising radicals and components) and 883 graphical symbols.
The entire encoding is divided into three parts:
1. Hanzi Area, including:
a. GB 2312 Hanzi Area, i.e., GBK/2: B0A1-F7FE. It contains 6763 Hanzi in GB 2312, arranged in the original sequence.
b. GB 13000.1 Hanzi Extension Area, including:
(1) GBK/3: 8140-A0FE. Incorporating 6080 CJK Hanzi in GB 13000.1.
(2) GBK/4: AA40-FEA0. Incorporating 8160 CJK Hanzi and additional Hanzi. CJK Hanzi is placed in the front, arranged by the magnitude of UCS code; the additional Hanzi (including radicals and components) are placed behind, and arranged by the page in "Kangxi Dictionary"/bit position.
2. Graphical Symbol Area, including:
a. GB 2312 Non-Hanzi Symbol Area: i.e., GBK/1: A1A1-A9FE. Besides symbols in GB 2312, there are 10 lowercase Roman numerals and additional symbols in GB 12345. The number of symbols totals 717.
b. GB 13000.1 Non-Hanzi Extension Area: i.e., GBK/5: A840-A9A0. There are 166 symbols in this area, including Big-5's non-Hanzi symbols, structure symbols and "○".
3. User Defined Area is divided into three sub-areas of (1), (2) and (3).
(1) AAA1-AFFE: 564 code points.
(2) F8A1-FEFE: 658 code points.
(3) A140-A7A0: 672 code points.
Though open to the user, the (3) sub-area is restricted for use, since there is still the possibility to add new characters in this sub-area in the future.
III. Font
GBK makes the following prescriptions on font:
1. In principle, it is consistent with font/stroke form under the row of GB 13000.1 G (i.e., Hanzi originating from official standard of China's mainland).
2. Within the overall framework of CJK Hanzi identification rules, the principle of "no duplication and standard font" applies to all GBK coded Hanzi (transforming toward GB standard). This means to use new Chinese fonts as much as possible provided no duplicated code will emerge.
3. For Hanzi beyond the CJK Hanzi identification rules, or not specified by identification rules, old fonts will be temporarily placed at GBK positions. So under many circumstances, GBK incorporates both the old and new fonts of a same Hanzi.
4. Fonts for non-Hanzi symbols should keep in line with GB 2312 if already included in GB 2312; and the part beyond GB 2312 should keep in line with GB 13000.1
5. Half width is used for phonetic pinyin letters.
GBK Code Table (Ordered by the Classification Sequence)
GBK/1: GB2312 non-Hanzi symbols
A1-A B0-B7B8-BF C0-C7C8-CF D0-D7
GBK/2: GB2312
Hanzi D8-DFE0-E7 E8-EFF0-F7
81-8384-87 88-8B8C-8F 90-93 GBK/3: Hanzi Extension
94-9798-9B 9C-A0
AA-AFB0-B7 B8-BFC0-C7 C8-CF
GBK/4: Hanzi Extension
D0-D7D8-DF E0-E7E8-EF F0-F7
F8-FE
GBK/5: Non-Hanzi A8-A9
D8-DFE0-E7 E8-EFF0-F7
81-8384-87 88-8B8C-8F 90-93
Extension
(1) AA-AF (2) F8-FE
User Defined Area
(3) A1-A7 |