Character Data • Jonathan Cook

Understanding and manipulating textual data is fundamental to much programming. Strings are a bedrock knowledge area that you need to understand many other programming languages and environments. Strings hold (mostly) printable characters that make sense to humans – we can read them and understand them! Since computers need to ultimately be useful to, and interact with, humans, strings will always be useful!

In many programming environments, character strings are still represented (by default) with 1 byte (8 bits) per character. Each unique character is represented by a unique 1 byte value. A string is a sequence of these character values. The 1 byte value of 0 is often used to indicate the end of a string. We call it the null character. Note that the numeric symbol ‘0’ is NOT represented by a 1 byte value of 0!

One byte values are 8 bits. 8 bits can hold only 256 unique values (why?). So 1 byte character values can represent English and Western language characters.

The English standard uses only 7 bits, and is known as ASCII. For Western languages in general, ASCII is extended into 8-bit ISO 8859-1. 8 bits can handle lower and uppercase letters, accented letters, numbers, punctuation marks, and other special symbols and codes for most all Western languages.

For other languages around the world, Unicode is an accepted standard, with various bit encodings.

In a Unix/Linux terminal, you can do the command man ascii to see an ASCII table, man iso-8859-1 to see an ISO 8859 table, and man utf-8 to learn more about one Unicode format.

The site https://www.charset.org/ is a useful site.