The Ultimate Guide to ASCII Encoding
ASCII is a type of character-encoding that is used for computers to store and retrieve characters (letters, numbers, symbols, spaces, indententations, etc) as bit-patterns for storage in memory and on hard drives. "Character encoding" at a high level means the conversion of a symbol into a binary number, and using a "character map" to read the binary number as a type of letter.
The earliest form of character encoding goes as far back as the electric telegraph. In fact, Morse code, and later the Baudot code were some of the first standardized character codes ever created. A second layer of encoding called encryption or ciphering was also established by militaries of that time, but that is a rather different topic. It wasn't until the 1950s that we began the modern process toward ASCII. IBM started this by developing encoding schemes for use in their 7000 Series computers.
IBM's Binary Coded Decimal (BCD) used a four bit encoding on punchcards. It was a way of storing decimal numbers in a binary form. So instead of numbers running from 0000 (0) to 1111 (15), they ran from 0000 (0) to 1001 (9) — each four bits representing a single digit. Later, IBM created the Extended version of BCD called Extended Binary Coded Decimal Interchange Cod (EBCDIC). It was an 8-bit encoding system for all the standard printable characters.
In that same year, 1963, ASCII was introduced. It use a 7-bit encoding scheme. That represents 128 different numbers. This 7-bit number format might seem odd. After all, aren't computers all 8-bit or 16-bit or 32-bit and so on? Today they are. But early computers were not constructed in that way. What's more, memory on computers was precious and there was no reason to use an extra bit if you didn't need it. A 6-bit code (which existed) wouldn't cover all the upper and lower case letters, numbers, and basic punctuation marks. But a 7-bit code did — with room to spare.
As computers to settle into an 8-bit (byte) structure, ASCII gradually turned into an unofficial 8-bit code, where the other 128 characters were not standardized. This state persisted for some time. In 1991, 8-bit became the official format as maintained by the ISO (International Organization for Standardization) for UTF-8.
The challenge that came up in this time though, was that only one alphabet could be supported by a 7 or 8-bit encoding. In order to support a broader swath of languages, the Unicode encoding schema was devised, along with the Universal Character Set. Unicode has a couple of encoding types, UTF-8 is the 8-bit encoding which has compatibility with ASCII, and which has risen to replace ASCII as the predominent character encoding standard on the web today.
Additionally, UTF-16 and UTF-32 have become used for languages with a lot of characters. However, Chinese, Japanese and Arabic can all be displayed in UTF-8. As a result, UTF-8 is by far the most common encoding format on the web. And for English speakers, things are particularly easy because the first 128 characters of ASCII are the same as those in Unicode. So for use in HTML, referencing an ASCII table to create a character will work regardless of what encoding format you are using.
Where ASCII Fits In
ASCII stands "American Standard Code for Information Interchange" and was created by the American Standards Association (later renamed the American National Standards Institute). The ASCII standard was started in 1960 and released in 1963. It was an extension of telegraphic codes and was first used by Bell data services. Major revisions were made over the years. Until 2007 it was the most widely used character encoding on the web, but it was replaced with UTF-8.
The web's switch from ASCII and Microsoft's ANSI towards UTF-8 can be largely attributable to initiatives by Google, as internet usage was becoming more international and ASCII was only capable of displaying Latin characters. What's important to note, is that UTF-8 is a type of encoding, while Unicode is the character set; because Unicode's first 128 characters are the same as ASCII, it is acceptable to refer to an ASCII table when generating characters in HTML. ASCII does have the ability to use an "escape sequence" in displaying alternative alphabets, which allowed it to become an international standard, but Unicode handles this more directly.
Unicode originated from Apple in 1987, and became the project of the Unicode Consortium in 1991. ASCII was created by the ASA, but further refinement of it continued as part of declarations from ISO. The encoding name of UTF-8 is used by all the standards conforming to the Internet Assigned Numbers Authority (IANA) which means all HTML, CSS, and XML. IANA is a department of the larger ICANN, which is the non-profit which determines internet protocol and domain names.
To summarize, ASCII evolved from telegraph code in the 60's, grew up, and became part of the Unicode character set, which is used by UTF-8, the most dominant encoding format on the web. Domain names and webpage code depend on having this unified character map to work properly. This means that at the very root of the modern internet, there exists a character format invented in the 1870's, computerized as ASCII in the 1960's, modernized for the web with Unicode the 1990's, and broadly adopted through UTF-8's majority use in 2007.
Control Characters vs Printable Characters
There are two types of characters in ASCII, printable characters and control characters. The control characters define numbers 0-31, and 127. The control characters include all the parts of writing that allow for new paragraphs, tabs, end of line, file separators, and a lot of pieces which are mainly transparent. These control characters were created in a time when printed cards were a big part of the computing process. Some of those features have since been replaced, but a lot of the line formatting parts are still around today. Code 127 is actually the code for delete (only in real ASCII not ANSI or Unicode).
All of the printable characters are what you might expect. There are all the lower case characters (a-z) and upper case characters (A-Z), along with numbers, symbols, and punctuation marks — essentially everything seen on a typical keyboard. These principle characters comprise all written words.
Using ASCII in XML and HTML
Every HTML page has a character encoding format assigned to it. Unless otherwise specified, the HTML encoding will default to UTF-8. For using pure ASCII, or ANSI, or any specialized, unique format, all that needs to be done is to have a declaration in a meta tag.
For HTML 4:
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
In the charset tag, you can use UTF-8, ANSI, or ASCII by using charset="us-ascii" or you could look up a specific character set to use, usually by declaring an ISO number. A full list can be found on the IANA character sets page.
Character Code Insertion Format
Generally, however, when someone refers to using the ASCII code, you will want them to clarify if they mean TRUE US-ASCII with a meta-tag, or if they are just asking you to display a special character. In HTML, any time you want to use a special character, like say the cent symbol (¢) or an inverted question mark character (¿) — you would generally be able to use a Unicode symbol or US-ASCII (8-bit) character, by typing in a reference like this:
¢ in HTML looks like: ¢
¿ in HTML looks like: ¿
So you start with an &# followed by a four digit number, finishing with a semi-colon (;).
In this way, you are able to display characters based on their ASCII/Unicode number. Of course, control characters will perform a formatting function or not work at all, depending on which one you use and which real character set you have listed in your meta tag. So in HTML you see the "&#" number, but when displayed in your browser you will see the character.
HTML Special Entity Characters
Now, let's say for example you want to just show an & symbol on your page. You can't just type it into the HTML, but you can type in the corresponding ASCII or Unicode. HTML is a markup language, so while normal letters work fine, special characters and especially < > brackets — are critically important to how the browser reads and shows the HTML.
You don't need to always type in the Unicode/ASCII reference number though. For HTML 4.0 and newer, there are special entities which work similar to a Unicode reference, but instead of memorizing a number you memorize a word.
¢ in HTML looks like: ¢
¿ in HTML looks like: ¿
A full list of these character references can be found at the W3 consortium.
With all this lead-up, you might just be looking for an easy place to find an ASCII or Unicode reference. Look no further, we have references 000-127 here, and you can find the full Unicode format on Wikipedia. Note that characters 000-032 and 127 are not generally printable and are thus indicated with "NA."
Tools and Resources
There is a lot of history on how character codes evolved, and the organizations which hold these standards together for the rest of us. With most internet developers and the W3C settling on UTF-8, for at least the immediate future, that is how pages will be encoded. You're going to need some resources to help you though if you start manually encoding in other formats, or it can be nice just to have a comprehensive reference around.
List of Resources
- IANA character sets page
- HTML Special Characters by the W3 consortium
- Full Unicode format on Wikipedia
- ASCII Table of just 0130-0255
- History of ASCII on ASCII-World
- List of Unicode characters on Wikipedia
No summary of ASCII would be complete without a reference to ASCII art. Special software can be used, or symbols hand-coded, to take on the shape of an image using nothing but symbols. This type of effect has existed since the 1980's, and made popular on systems like the Commodore Amiga Computer. There is even a distinction between "Oldskool" ASCII art which uses pure ASCII in command line, and "Newskool" which uses the special characters in Unicode to make even more complex works of art.
Here's a picture of a zebra's head:
Q. What is the difference between ASCII, Unicode, and UTF-8?
A. ASCII is the older standard from the 1960's, whereas Unicode came into existence in the late 1980's. ASCII is only 128 or 256 characters, but Unicode has over 10,000. Unicode is the character table, UTF-8 (or UTF-16 or UTF-32) is the level of encoding. Unicode 0-256 and ASCII are nearly identical, with just some minor differences on the control characters. UTF-8 is the most common encoding on the web today — and the default.
Q. Do I need to declare what encoding type I'm using for my web page?
A. Only if you know you need to use a unique encoding type. If you don't declare one, most browsers will default to UTF-8. If you are creating a webpage in a foreign language, especially non-Latin, make sure that you are either using UTF-8 or else pick a special charset.
Q. Do I need to memorize any ASCII codes to write HTML?
A. Only if you're trying to be extremely efficient. Most web sites today are dynamic and generate the HTML for you, through systems like a content management system (CMS). If you are a developer, you will probably be using other programming languages in addition to HTML, and those languages might have special ways of generating those ASCII symbols. Finally, as discussed above, many of those codes use special character names in HTML instead of ASCII numbers.
Q. Does the character encoding differ on different operating systems?
A. Somewhat. Unicode is slightly different on Windows vs Unix/Linux. For example, Windows uses UTF-16LE while Linux normally uses UTF-8. Now of course, the encoding used by your operating system might differ from the encoding on a webpage, but your OS and the web-browser work together to convert the character codes into something your computer can display. Sometimes, in older operating systems, this conversion might not work and you would just see blank characters. (For example, it's something you might see visiting a foreign website on Windows XP.)
Q. ASCII Art is awesome! Where can I make my own?
A. AsciiWorld.com has some great galleries and tools in their software section, such as converters and "painters." Have fun!