plain text

ASCII was very carefully designed; essentially no character has its code by accident. Everything had a reason, although some of those reasons are long obsolete.
kps

As digital natives we tend to take for granted the significant efforts of intelligence that enabled the global tech revolution. Here is one small detail about ASCII control characters, which are for example still used to some extend in terminals (ssh).

What we can learn here is compactness and elegance. But let’s not transfigure the past. Maybe it’s a lesson about the temptations of universality.


Interestingly, knowledge hides behind established opaque conventions. The usual representation isolates the non-printable characters from the printable ones. While this grouping makes sense for simple introductions (“The first 32 characters are control characters”), it obfuscates relations that can be of practical and intellectual value.  By grouping the first 2 bits as columns and the last 5 bits as rows, it becomes clear how a non-printable “character” can be produced by pressing two keys on the keyboard.

Generally speaking: Only after challenging established conventions we can estimate what we lose if we drop the ideas and practices that created them.


And all was good, assuming you were an English speaker.
Joel Spolsky

The above linked articles conceal one important aspect about ASCII. It was chaotic, when scaled globally. The initial elegant design of ASCII was not meant to be universal. When the American Standard Code for Information Interchange (ASCII) was established in 1960-1963, the intention was to have a standard for American devices (teleprinters and telegraphy), and maybe with some hints how others could extend it to support more characters. It was well-designed for this purpose: English alphabet, some frequently used signs, control characters.

What about other scriptures? The standard does not even mention them (nowadays we are expecting it to be universal). So once computers were sold and produced in other countries, the vendors deviated from the ASCII to fulfil the needs of the local language or idiom (e.g.: ß). Multiple countries were overwriting ASCII in parallel to express special characters.

As long as the devices were not interconnected, the misalignment was acceptable. With the emergence of web sites one realized that a consolidation of the local deviations is needed. After several attempts to find a viable character encoding for all world’s language (based on Unicode), UTF-8 prevailed in the last decade of the 20th century.

Universal Character Set Transformation Format (UTF-8) has the advantage that it is compatible with the ASCII. Nothing changed for the English-language dominated industry.

All the other countries, who used to develop their local variant of ASCII (equally ignorant to their neighbouring languages) had the disadvantage of encoding their characters in more complex ways for the sake of universalism and interoperability (it’s not fully accurate, because e.g. also parts of ISCII were re-used in Unicode).


For example:

  • Devanagari, the Indian “alphabet” is frequently used to write Hindi and similar languages (like Marathi).
  • Expressing the character अ  (pronounced like: a) in the ASCII-Deviation ISCII (Indian Script Code for Information Interchange) required 8 bit: 1010 0100 (hexadecimal: A4).
  • Expressing the character अ in Unicode requires 12 bit: 0000 1001 0000 0101 (hexadecimal: 0905)
  • However, the situation is more complex. In Hindi the smallest meaningful unit is a syllable (consisting of consonant + vowel), while the consonant by default has the “a” included: For example this character is pronounced as “ka”: क. If one wants to say “ku”, the symbol gets a modifier: कु. To encode this, one needs two characters:
    0000 1001 0001 0101   (hexadecimal: 0915) for the consonant क (“k(a)”)
    and
    0000 1001 0100 0001 (hexadecimal: 0941) for the modifying matra that represents “u”:
  • In Unicode, the size in total is 24 bit.
  • For ISCII, encoding कु requires 12 bit: 1011 0011 1101 1101  (hexadecimal: B3 DD).

 


Looking at the recent developments in world politics, and according to some mainstream voices (e.g. “Globalization is running out of steam“), we are heading towards a multi-polar world, in the optimistic scenario.

Situations like ASCII and Unicode embed a local and a global approach of encoding characters in binary numbers that are now the common ground in most browsers, and for automatic translation of pages.

It’s hard to imagine how a population that is used to UTF-8 will be interested in using only ASCII. It’s more easy to imagine that UTF-8 will be taken as basis on top of which custom, local extensions and palimpsests will be added, sacrificing some but not all interoperability.

Like the cloud services: For them to operate it is invaluable to have a functioning Internet infrastructure with all it’s interoperable protocols and standards across the world. By using this infrastructure, they build up locally optimized, complex code that is exposed by narrow interfaces. This allows carefree usage by hiding complexity. At the same time, it covers an unbelievable amount of knowledge and data.

Again, the isolation between hidden and visible entities obfuscates relations that can be of practical and intellectual value. You can say: “Not everything can and should be fully visible and interoperable. We should leave room for discovery.” Such new secrecy might promote curiosity. But also abuse of power.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.