Converting Gurmukhi to Unicode

From SikhiWiki
Revision as of 17:59, 4 September 2020 by Hari singh (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Before the Unicode system for the representation of characters was formalised, it was common for Gurmukhi characters to be represented by using part of the existing 255 ASCII character set in custom font sets. The Personal computer revolution started around 1980 and since then, computers have become more powerful and sophisticated. The use of the Internet use by common man started from about 1999 and the world has dramatically changed forever with these two developments. The early personal computers used mostly English/Latin fonts with 255 characters; out of which 32 characters were control characters, about 70 upper and lower case Latin/English/Western characters, about 20 mathematical characters. This left little room to cater from other languages. By using different font character sets, it became possible to use other languages on the Macintosh and personal computers, but the characters had to be located at same 255 slots as English/Latin. It also became possible to use different languages on MS Windows platform, by the introduction of Windows 3.1. For many languages the limit of 255 characters was a hindrance, thus arose the need for other standards. However, many different standards became prevalent. It soon became obvious, that there was a need for a unified one standard for all languages of the world. This led to the formation of Unicode consortium.

Below are examples of these ASCII Punjabi fonts used in the non-Unicode days:

  • Gurbani Akhar
  • Anmol Lipi
  • Amar Lipi
  • Bulara
  • Prabhki
  • Raaj
  • Gurbani Web Thick
  • Web Akhar Slim
  • Web Lipi Heavy

Unicode Gurmukhi

What is Unicode? The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern IT and software products and standards. Unicode is the accepted international standard that includes support for all major scripts of the World and is adopted by all current major computer operating systems. This is a 8, 16 and 32 bit standard that allows in the 16-bits format, the use of more than 65000 characters in one font. It has support for major Indic (Indian) scripts that include Devanagari (Hindi, Marathi, Sanskrit), Bengali (Bengali, Assamese), Gurmukhi (Punjabi), Gujarati, Oriya, Tamil, Telugu, Kannada and Malayalam. Microsoft Windows XP/ Vista has full support for Indic scripts, including Gurmukhi. All future development regarding scripts will be based on Unicode.

Advantages of Unicode relating to Gurmukhi script: The purpose of this write-up is not to give any detailed information on this topic, but some points are worth mentioning. Migration to Unicode may not be painless as one has to adapt to new ways (but it is not a big deal) and for editing purposes, one has to have a software that has support for Unicode. For example, to edit Unicode text on Windows XP or later computers, MS Word 2003 or later becomes a necessity.

However, there are many advantages in using Unicode text. Documents and web-pages made with Unicode text, when viewed with an appropriate web-browser on a computer with support for Unicode, will always be viewed in the right script even if the font in which web-pages are made is not installed into the system (just as English text is always English, even if the font in which it is made is missing). One can name files and folders in Gurmukhi, search web pages in Gurmukhi, sort text with ease, exchange Gurmukhi data without having to worry about fonts and avoid the hassles of upper-case lower-case and spacing problems that happen when many available non-Unicode Gurmukhi fonts are used. The implementation of Indic scripts by Unicode has been done as per recommendations by the Indian government and it is done in such a way that transliteration (phonetic) between Indic scripts will be easier (compared to if non-standard fonts are used) as code points for corresponding characters are well defined.

Unicode Gurmukhi is a Unicode block containing characters for the Punjabi language, as it is written in Punjab, India. In its original incarnation, the code points U+0A02..U+0A4C were a direct copy of the Gurmukhi characters A2-EC from the 1988 Indian Script Code for Information Interchange ("ISCII") standard. The Devanagari (Hindi, Marathi, Sanskrit, Konkani), Bengali, Gujarati, Oriya, Tamil, Telugu, Kannada, and Malayalam blocks were similarly all based on their ISCII encodings.

Unicode

Unicode is an information technology (IT) standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of March 2020, there is a repertoire of 143,859 characters, with Unicode 13.0 (these characters consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, and both are code-for-code identical.

The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional text display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).[1]

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework.

Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and UCS-2 (a precursor of UTF-16 without full support for Unicode); GB18030 is standardized in China and implements Unicode fully, while not an official Unicode standard.


Unicode Block

A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for administrative and documentation purposes. Typically, proposals such as the addition of new glyphs are discussed and evaluated by considering the relevant block or blocks as a whole.

Each block is generally, but not always, meant to include all the glyphs used by one or more specific languages, or in some general application area such as mathematics, surveying, decorative typesetting, social forums, etc.

Code pages for ISCII conversion

To convert from Unicode (UTF-8) to an ISCII / ANSI coding, the following code pages may be used:

  • 57002: Devanagari (Hindi, Marathi, Sanskrit, Konkani)
  • 57003: Bengali
  • 57004: Tamil
  • 57005: Telugu
  • 57006: Assamese
  • 57007: Odia
  • 57008: Kannada
  • 57009: Malayalam
  • 57010: Gujarati
  • 57011: Punjabi (Gurmukhi)

Block

Gurmukhi1
Unicode.org chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+0A0x
U+0A1x
U+0A2x
U+0A3x ਿ
U+0A4x
U+0A5x
U+0A6x
U+0A7x
Notes
1.^ As of Unicode version 6.0


External Links

Tools