1 Character Encoding
Internally, the computer handles information as numbers. So a word like, “word” is stored and handled in numeric representation. The character, are handled as numbers with help of a method called ‘Character Encoding’.
In the simplest case, for English characters we can use:
a = 00; b = 01; c = 02; d = 03; e = 04; f = 05; g = 06; h = 07, i = 08; j = 09; k = 10; l = 11, m = 12; n = 13; o = 14; p = 15; q = 16; r = 17; s = 18; t = 19; u = 20; v = 21; w = 22; x = 23; y = 24; z = 25.
So if we want to encode, “word” using the above encoding, the word will look like this inside a computer’s memory:
Now if we want to represent, “a word” we realise we can not do so as we do not have a character encoding for ‘blank’. We can write “aword” but not “a word”. So we add a new character to the above list.
a = 00; b = 01; c = 02; d = 03; e = 04; f = 05; g = 06; h = 07, i = 08; j = 09; k = 10; l = 11, m = 12; n = 13; o = 14; p = 15; q = 16; r = 17; s = 18; t = 19; u = 20; v = 21; w = 22; x = 23; y = 24; z = 25; "blank" = 26
So “a word” using the above encoding will look like this inside a computer’s memory:
Now if we want to represent, “A word” using above encoding, we face another problem. Our encoding has numeric code for only small case alphabets and not upper case. So we need to add that too. And how about punctuation marks? We need to add that. How about some often used symbols like Rs to represent rupees, @ used in email addresses, %, +, – and others used in mathematics? Or characters that represent a number like 0, 1, 2, 3, etc?
ASCII stands for American Standard Code for Information Interchange and pronounced as “ask-key”.
In the early days of computing, people realised that even though the problem of encoding characters or assigning them numeric values had a simple solution, but without a standard, it could lead to a very confusing situation.
To explain, let us look at above example. Say computer manufacturer A builds one using the same encoding as above. But computer manufacturer B decides to use, for some reason:
a = 10; b = 11; c = 12; d = 13; e = 14; f = 15; g = 16; h = 17, i = 18; j = 19; k = 20; l = 21, m = 22; n = 23; o = 24; p = 25; q = 26; r = 27; s = 28; t = 29; u = 30; v = 31; w = 32; x = 33; y = 34; z = 35;
Whereas B’s representation of “none” is 23242314, if the same code is carried to A’s machine, it becomes “xyxo”.
Realising this could lead to incompatibility and confusion, a standard for character encoding was decided upon.
Much of the credit for this standard goes to Robert W. Bemer’s work in 1965. The work started in 1963 and by 1968, a standard seven-bit code that was finalised by ANSI (as ANSI X3.4 standard). It was called, American Standard Code for Information Interchange or ASCII.
This standard included non-printing characters (like blank, etc), typographic symbols, punctuation marks, English lower case and upper case characters, numbers and other symbols.
As it was a 7-bit code, the maximum number of possible characters it could encode was 128. Starting from 0, the numbers went up to 127. It was around this time that another standard was decided upon – that a byte would be 8-bit. The 8th unused bit was used for parity check and in some systems for mark end of a string.
As usage of computers spread to other countries, this standard was referred to as US-ASCII. ASCII and its national variants were declared international standard ISO 646 in 1972. So, the names were of the form ISO-646-xx, where xx was a two-character country code (CA – Canadian, CN – Chinese, CU, DE, DK, ES, FR, HU, IT, JP, KR, NO, PT, SE, UK, YU and so on). No, there was no IN for India.
A general symbol for currency was chosen as “¤” as many socialist countries did not want to use “$”.
As computing progressed, people realised that 128 characters were not enough. Especially when more countries began to participate in computing efforts. Also, the practice of having a parity bit was no longer needed or considered a good idea. It was time for an upgrade.
EBCDIC stands for Extended Binary Coded Decimal Interchange Code and pronounced as “eb-sih-dik”.
It was an extension of the 4-bit Binary Coded Decimal encoding. It was devised in around 1963-1964 timeframe by IBM and was the predecessor to ASCII, which was finalised in 1968. EBCDIC is an 8-bit encoding vs. the 7-bit encoding of ASCII.
Being an 8-bit code, the maximum number of possible characters it could encode was 256. Starting from 0, the numbers went up to 255 It was around this time that another standard was decided upon – that a byte would be 8-bit. The 8th unused bit was used for parity check and in some systems for mark end of a string.
It is a very IBM specific code and not used much out of their mainframe family.
Extended ASCII is also referred to as 8-bit ASCII.
Realising the shortcomings of 7-bit code, an 8-bit version was standardised. This included the US-ASCII encoding as first 128 characters (0-127) and another 128 (128-255) were added. It was first used by IBM for their PCs. Eventually, ISO released a standard, ISO 8859 describing an 8-bit ASCII extensions. The US-ASCII based one was ISO 8859-1 (popularly called ISO Latin1). The one for Eastern European languages was standardised as ISO 8859-2, for Cyrillic languages as ISO 8859-5, for Thai as ISO 8859-11, and so on. No, none was standardised for any Indian language. ISO 8859-15 or Latin-9 was a revision of 8859-1 to include the euro symbol.
As usage of computers spread to more countries and more information was being shared using character encoding, limitations of this encoding began to surface.
One was that 256 codes were not sufficient for encoding characters for many languages. Also, overlapping codes, based off different ISO 8859-x standards, led to strange characters showing up if the standard did not support. It was time for an upgrade.
In 1991, a new standard was proposed. It was called Unicode and aimed at providing one big character encoding table that has all characters of almost all languages being used in computing or otherwise and internationally used symbols.
Unicode is a 16-bit code. As compared to 256 of extended ASCII, it can encode (provide numeric codes for) up to 65536 (0-65535). The extended ASCII codes we also preserved. The first 256 characters encoded were same as extended ASCII. Thus, to convert ASCII to Unicode, take all one byte ASCII codes, and add an all-zero byte in front to extend them to 16 bits.
While as on one side, it meant that simple text files were now double the size, on the other hand, a standard was achieved that could serve as cross-language character encoding method.
There are 18 Indian languages (on the last count) listed in the Eighth Schedule of Indian Constitution.
They are: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Sindhi, Tamil, Telugu and Urdu.
Out of these, characters from scripts of following languages are currently part of the Unicode standard (4.0) : Urdu, Hindi, Bengali, Punjabi, Gujarati, Oriya, Tamil, Telgu, Kannada and Malayalam.
The Unicode standard is a work in progress. There are many proposals that are being considered for inclusion. One of them being investigated is for encoding Vedic characters.
Language Script Code (hex) ---------------------------------------------- Start End ---------------------------------------------- Urdu Arabic 0600 067f Hindi Devanagri 0900 097f Bengali Bengali 0980 09ff Punjabi Gurumukhi 0a00 0a7f Gujarati Gujarati 0a80 0aff Oriya Oriya 0b00 0b7f Tamil Tamil 0b80 0bff Telgu Telgu 0c00 0c7f Kannada Kannada 0c80 0cff Malayalam Malayalam 0d00 0dff ----------------------------------------------
The code for Rs symbol is 20a8.
ISCII stands for Indian Script Code for Information Interchange and commonly read as an abbreviation but also at times (“is” as in this) “is-kii” making it sound like two Hindi words, “is” and “kii”. In Hindi, it is referred to as “Soochna Antrvinimay kay liye Bhartiye Lipi Sahita”.
It was established as a standard by Bureau of Indian Standards (IS13194:1991) in 1991 and is based on an earlier Indian Standard IS 10401:1982. ISCII is an 8-bit standard in which lower 128 characters (0-127) confirm to ASCII standard. The higher 128 characters (128-255) are used to encode characters from an Indian script. Unicode has largely preserved the ISCII encoding strategy in their encoding. Though allocating different codes, Unicode is based on the ISCII-1988 revision and is a superset of the ISCII-1991 character encoding. Thus, texts encoded in ISCII-1991 may be automatically converted to Unicode values and back to their original encoding without loss of information.
Indian languages Urdu, Sindhi and Kashmiri are primarily written in Perso-Arabic scripts. But they can be (and sometimes are) written in Devanagari too. Sindhi is also written in the Gujarati script. Apart from Perso-Arabic scripts, all the other scripts used for Indian languages have evolved from the ancient Brahmi script and have a common phonetic structure, making a common character set possible. By manually switching between scripts, an automatic transliteration is achieved.
As per the standard, following mnemonics are used for Indian scripts:
DEV: Devanagari, PNJ: Punjabi, GJR: Gujarati, ORI: Oriya, BNG: Bengali, ASM:Assamese, TLG: Telugu, KND: Kannada, MLM: Malayalam, TML: Tamil, RMN: Roman.
This ISCII-91 DEV is character encoding for characters from Devanagari which is used to write Hindi, etc. ISCII-91 ORI is the character encoding for Oriya and so on.
Character encoding for following scripts has been standardised by the ISCII and they can be used to write Indian Languages. The scripts are:
Devanagari, Punjabi, Gujarati, Oriya, Bengali, Assamese, Telugu, Kannada, Malayalam, Tamil and Roman.
INSFOC stands for Indian Script Font Codes.
Originally designed along with ISCII, ISFOC has not been standardised by BIS yet. A new draft, however, has been written in 2003.
TTF stands for TrueType Font. It is a font standard developed in the late 1980s. The biggest advantage of using TrueType was that a user could increase or decrease their size without losing the quality. In earlier fonts, if a user used any non-standard size, the fonts used to get jagged and lost the clarity and visual quality.
OTF stands for OpenType Font. It is a font standard announced in 1996. They are Unicode based and hence can support any language Unicode supports. Being based on Unicode, they can support up to 65,536 glyphs (as characters/symbols are referred to in Fonts).
USP stands for Unicode Script Processor. This is a technology developed for Complex Scripts like Arabic, Hebrew, Thai, Indic, etc.
UTF stands for Unicode Transformation Format. There are several mechanisms to physically implement Unicode based on storage space consideration, source code compatibility, and interoperability with other systems. These mapping methods are referred to as UTF.
One of the most popular UTF mapping methods is UTF-8. It was created by Rob Pike and Ken Thompson (yes, it is those guys. They even titled the standard paper, “Hello World” in several languages.
It is a variable-length encoding which uses groups of bytes to represent the Unicode standard for the alphabets of many of the world’s languages. Thus, it may use 1 to 4 bytes per character, depending on the Unicode symbol.
All Internet protocols have to identify the encoding used for character data. So, as per IETF requirements, UTF-8 is the, at least one encoding supported, by all Internet protocols.
ASCII, ISCII, Unicode, etc come into the picture in the back-end for storing and processing of text/data. For front-end visual rendering, fonts come in the to picture too.
In the case of Indic Scripts, displays usually need an on-the-fly conversion from ISCII-to-Font and Font-to-ISCII. This is needed, for example for conjunction of characters and mantras. Absence of such rendering can lead to the display of ISCII character in order in which they are stored instead of how they expected to be seen visually.
For example in Hindi, the small “ee” matra is after a character but visually is placed before it. A display that has the on-the-fly conversion from ISCII-to-Font will show it before the character. But a display not having on-the-fly conversion from ISCII-to-Font will show it after the character, for the same ISCII file. That is not how one expects the small “ee” matra to be placed.
Another example from Hindi. The “halant” symbol is placed under a character. A display that has the on-the-fly conversion from ISCII-to-Font will show it under the character, but one without will not.
This complexity in the rendering and editing of Indic Scripts, the display of Indic scripts is referred to as the script being non-linear in nature. Some of the challenges in Indic scripts rendering are:
- Glyphs have variable widths and have positional attributes.
- Vowel signs can be attached to the top, bottom, left and right sides of the base consonant.
- Vowel signs may also combine with consonants to form independent glyphs.
- Consonants frequently combine with each other to form complex conjunct glyphs.
As Indic fonts have not been standardised and INSFOC or any Font-Encoding-Standards has not been developed, most Indic language supporting websites and software vendors had to developing their own fonts. This led to totally incompatible with each-others.
This incompatibility leads to these issues:
- Text composed in the editor using one Indic font can not be opened/edited in another editor using some other font.
- As the vendors would have to pay to support other vendor’s fonts, that would increase the cost of their product. That discourages compatibility.
- Indian Language processing remains limited up to word-processing, DTP, printing only.
- Web page and email messages composed in the editor using one Indic font can be viewed only if the font also attached/downloaded and installed on the receiver’s end. This leads to a user having to install unnecessary number of Fonts and manage them.
This has led to hundreds of web pages all over the net in Indic languages, none compatible with the another. This severely handicaps web search engine who are unable to provide meaningful search result from those web pages, virtually making the information inaccessible.
With inputs from Hariram Pansari – hrpansari at yahoo dot com