Unicode™

Table of Contents

Introduction

We provide only the basics of Unicode. This information is available in more detail at http://www.unicode.org/standard/WhatIsUnicode.html.

Unicode provides a consistent way of encoding multilingual text. Unicode greatly simplifies the exchange of text files internationally. Unicode provides 16-bit capacity to encode all the characters of the written languages of the world.

Players

Highlights and Benefits of Unicode

Unicode has become one of the more significant recent global software technology trends. Here are some of the advantages:

Unicode Encoding Model

The four levels of the Unicode Character Encoding Model per (http://www.unicode.org/unicode/reports/tr17/), can be summarized as:

ACR: Abstract Character Repertoire:
The set of characters to be encoded, for example, some alphabet or symbol set.
CCS: Coded Character Set: (describes code points)
A mapping from an ACR to a set of nonnegative integers (code points).
CEF: Character Encoding Form: (describes code units)
A mapping from a set of nonnegative integers that are elements of a CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers.
CES: Character Encoding Scheme:
A reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes).

Verbal Example: The repertoire of both ASCII and Latin-1 (ACRs) are mapped to an integer set as specified in ISO/IEC 8859-1 (CCS). This CCS can be mapped with an encoding form such as UTF-16 (CEF) that specifies the binary width (UTF-16 is 2 bytes wide) for encoding the CCS.

How Does Unicode Work

  1. Unicode and ISO/IEC 10646 Relationship: Unicode is conformant to ISO/IEC 10646. For example, Unicode 3.0 contains all the same characters and encoding points as ISO/IEC 10646-1:2000.
  2. Universal Repertoire: The character sets of most existing international and national standards are incorporated into the Unicode Standard. Unicode supports all spoken language as well as mathematical and other symbols. For Example:
    • ASCII: The first 128 characters within the Unicode Standard comprise the ASCII 7-bit character set. The ASCII character set correspond to the 128 Unicode characters from U+0000 to U+007F. ASCII has also been referred to as Basic Latin and US-ASCII.
    • Latin-1: The first 256 characters within the Unicode Standard comprise the Latin-1 character set. The corresponding unicode characters are U+0000 to U+00FF. Latin-1 code set is the same as ISO-8859-1. Latin-1 has also been referred to as "extended ASCII". Note that the first 128 Latin-1 codes map to the US-ASCII codes.
    • ISO-8859: ISO-8859 character set.
  3. Unique Numerical Value: The Unicode standard assigns each character a unique numeric value and name. These values are called code points. The Unicode codespace is a range of code points from U+0000 to U+10FFFF (This number is usually printed in U+xxxx hexadecimal representation).
  4. Unique Character Name: Each defined Unicode character also has a unique official long English character name that is sometimes quite logical (such as "LATIN CAPITAL LETTER A" for U+0041).
  5. CCS Assignments: Here is the first 8K code points and there assignments:
    • Basic Latin (US-ASCII): {U+0000..U+007F}
    • Latin-1 (ISO-8859-1): {U+0080..U+00FF}
    • Latin Extended: {U+0100..U+024F}
    • IPA Extensions: {U+0250..U+02AF}
    • Spacing Modifier Letters: {U+02B0..U+02FF}
    • Combining Diacritical Marks: {U+0300..U+036F}
    • Greek: {U+0370..U+03FF}
    • Cyrillic: {U+0400..U+04FF}
    • Armenian: {U+0530..U+058F}
    • Hebrew: {U+0590..U+05FF}
    • Arabic: {U+0600..U+06FF}
    • Syriac: {U+0700..U+074D}
    • Thaana: {U+0780..U+07B1}
    • ISCII Indic Scripts: {U+0900..U+0DFF}
      • Devanagari: {U+0900..U+097F}
      • Bengali: {U+0980..U+09FF}
      • Gurmukhi: {U+0A00..U+0A7F}
      • Gujarati: {U+0A80..U+0AFF}
      • Oriya: {U+0B00..U+0B7F}
      • Tamil: {U+0B80..U+0BFF}
      • Telugu: {U+0C00..U+0C7F}
      • Kannada: {U+0C80..U+0CFF}
      • Malayalam: {U+0D00..U+0D7F}
      • Sinhalese: {U+0D80..U+0DFF}
    • Thai: {U+0E00..U+0E7F}
    • Lao: {U+0E80..U+0EFF}
    • Tibetan: {U+0F00..U+0FBF}
    • Mongolian: {U+1000..U+109F}
    • Georgian: {U+10A0..U+10FF}
    • Hangul Jamo: {U+1100..U+11FF}
    • Ethiopic: {U+1200..U+137F}
    • Cherokee: {U+13A0..U+13FF}
    • Canadian Syllabics: {U+1400..U+167F}
    • Ogham: {U+1680..U+169F}
    • Runic: {U+16A0..U+16FF}
    • Burmese: {U+1700..U+1759}
    • Khmer: {U+1780..U+17E9}
    • Latin Extended Additional: {U+1E00..U+1EFF}
    • Greek Extended: {U+1F00..U+1FFF}

Top            

Rx4AJAX        About Us | Contact Us | Privacy Policy | 2008 This Site Built By PPThompson