Saturday, August 1, 2020

Character Encoding - ASCII, Unicode, Extended-ASCII etc



Ascii and Unicode

ASCII

https://www.ascii-code.com/ - ASCII - 7 bits (128 chars) and extended-ASCII - 8 bit (256 chars)

ASCII control characters (character code 0-31)

The first 32 characters in the ASCII-table are unprintable control codes and are used to control peripherals such as printers

ASCII printable characters (character code 32-127)

Codes 32-127 are common for all the different variations of the ASCII table, they are called printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. You will find almost every character on your keyboard. Character 127 represents the command DEL.

The extended ASCII codes (character code 128-255)

There are several different variations of the 8-bit ASCII table. The table below is according to Windows-1252 (CP-1252) which is a superset of ISO 8859-1, also called ISO Latin-1, in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 128 to 159 range. Characters that differ from ISO-8859-1 is marked by light blue color.

Unicode https://home.unicode.org/

Characters before Unicode

Fundamentally, computers just deal with numbers.  They store letters and other characters by assigning a number for each one.  Before the Unicode standard was developed, there were many different systems, called character encodings, for assigning these numbers.  These earlier character encodings were limited and did not cover characters for all the world’s languages. Even for a single language like English, no single encoding covered all the letters, punctuation, and technical symbols in common use.  Pictographic languages, such as Japanese, were a challenge to support these earlier encoding standards.

Early character encodings also conflicted with one another.  That is, two encodings could use the same number for two different characters, or use different numbers for the same character.  Any given computer might have to support many different encodings. However, when data is passed between computers and different encodings it increased the risk of data corruption or errors.

Character encodings existed for a handful of “large” languages.  But many languages lacked character support altogether.

Unicode characters — A Global Standard to Support ALL the World’s Language

Is Unicode a 16-bit encoding?
Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit.

Can Unicode text be represented in more than one way?
Yes, there are several possible representations of Unicode data, including UTF-8,  UTF-16, and UTF-32
Unicode defines 2 mapping methods - UTF and UTC (ISO 10646)