by Tom Riecken
December 12, 2018
December 12, 2018

ASCII Encoding: The Definitive Guide

ASCII is a type of character-encoding that is used for computers to store and retrieve characters (letters, numbers, symbols, spaces, indentations, etc) as bit-patterns for storage in memory and on hard drives.

“Character encoding” at a high level means the conversion of a symbol into a binary number and using a “character map” to read the binary number as a type of letter.

And MIME types allow users to send data beyond characters, like images and videos.

ASCII, Character Encoding, MIME Types

Character Encoding

The earliest form of character encoding goes as far back as the electric telegraph. In fact, Morse code, and later the Baudot code were some of the first standardized character codes ever created.

A second layer of encoding called encryption or ciphering was also established by militaries of that time, but that is a rather different topic.

It wasn’t until the 1950s that we began the modern process toward ASCII. IBM started this by developing encoding schemes for use in their 7000 Series computers.

IBM’s Binary Coded Decimal (BCD) used a four-bit encoding on punchcards. It was a way of storing decimal numbers in a binary form.

So instead of numbers running from 0000 (0) to 1111 (15), they ran from 0000 (0) to 1001 (9) — each four bits representing a single digit.

Later, IBM created the Extended version of BCD called Extended Binary Coded Decimal Interchange Code (EBCDIC). It was an 8-bit encoding system for all the standard printable characters.

In that same year, 1963, ASCII was introduced.

It use a 7-bit encoding scheme. That represents 128 different numbers.

This 7-bit number format might seem odd. After all, aren’t computers all 8-bit or 16-bit or 32-bit and so on?

Today they are. But early computers were not constructed in that way.

What’s more, memory on computers was precious and there was no reason to use an extra bit if you didn’t need it. A 6-bit code (which existed) wouldn’t cover all the upper and lower case letters, numbers, and basic punctuation marks. But a 7-bit code did — with room to spare.

As computers to settle into an 8-bit (1-byte) structure, ASCII gradually turned into an unofficial 8-bit code, where the other 128 characters were not standardized.

This state persisted for some time. In 1991, 8-bit became the official format as maintained by the ISO (International Organization for Standardization) for UTF-8.

The challenge that came up at this time though, was that only one alphabet could be supported by a 7 or 8-bit encoding.

In order to support a broader swath of languages, the Unicode encoding schema was devised, along with the Universal Character Set. Unicode has a couple of encoding types, UTF-8 is the 8-bit encoding which has compatibility with ASCII, and which has risen to replace ASCII as the predominant character encoding standard on the web today.

Growth of UTF-8

Additionally, UTF-16 and UTF-32 have become used for languages with a lot of characters. However, Chinese, Japanese and Arabic can all be displayed in UTF-8.

As a result, UTF-8 is by far the most common encoding format on the web. And for English speakers, things are particularly easy because the first 128 characters of ASCII are the same as those in Unicode.

So for use in HTML, referencing an ASCII table to create a character will work regardless of what encoding format you are using.

Where ASCII Fits In

ASCII stands for “American Standard Code for Information Interchange” and was created by the American Standards Association (later renamed the American National Standards Institute).

The ASCII standard was started in 1960 and released in 1963. It was an extension of telegraphic codes and was first used by Bell data services.

Major revisions were made over the years. Until 2007 it was the most widely used character encoding on the web, but it was replaced with UTF-8.

The web’s switch from ASCII and Microsoft’s ANSI towards UTF-8 can be largely attributable to initiatives by Google, as internet usage was becoming more international and ASCII was only capable of displaying Latin characters.

What’s important to note, is that UTF-8 is a type of encoding, while Unicode is the character set; because Unicode’s first 128 characters are the same as ASCII, it is acceptable to refer to an ASCII table when generating characters in HTML.

ASCII does have the ability to use an “escape sequence” in displaying alternative alphabets, which allowed it to become an international standard, but Unicode handles this more directly.

Unicode originated from Apple in 1987, and became the project of the Unicode Consortium in 1991. ASCII was created by the ASA, but further refinement of it continued as part of declarations from ISO.

The encoding name of UTF-8 is used by all the standards conforming to the Internet Assigned Numbers Authority (IANA) which means all HTML, CSS, and XML. IANA is a department of the larger ICANN, which is the non-profit which determines internet protocol and domain names.

To summarize, ASCII evolved from telegraph code in the 60’s, grew up, and became part of the Unicode character set, which is used by UTF-8, the most dominant encoding format on the web.

Domain names and webpage code depend on having this unified character map to work properly.

This means that at the very root of the modern internet, there exists a character format invented in the 1870’s, computerized as ASCII in the 1960’s, modernized for the web with Unicode the 1990’s, and broadly adopted through UTF-8’s majority use in 2007.

Control Characters vs Printable Characters

There are two types of characters in ASCII, printable characters and control characters.

The control characters define numbers 0-31, and 127. The control characters include all the parts of writing that allow for new paragraphs, tabs, end of lines, file separators, and a lot of pieces which are mainly transparent.

These control characters were created at a time when printed cards were a big part of the computing process. Some of those features have since been replaced, but a lot of the line formatting parts are still around today. Code 127 is actually the code for delete (only in real ASCII, not ANSI or Unicode).

All of the printable characters are what you might expect. There are all the lower case characters (a-z) and uppercase characters (A-Z), along with numbers, symbols, and punctuation marks — essentially everything seen on a typical keyboard. These principle characters comprise all written words.

Using ASCII in XML and HTML

Every HTML page has a character encoding format assigned to it.

Unless otherwise specified, the HTML encoding will default to UTF-8. For using pure ASCII, or ANSI, or any specialized, unique format, all that needs to be done is to have a declaration in a meta tag.

For HTML 4:

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

For HTML5:

<meta charset="UTF-8">

In the charset tag, you can use UTF-8, ANSI, or ASCII by using charset="us-ascii" or you could look up a specific character set to use, usually by declaring an ISO number. A full list can be found on the IANA character sets page.

Character Code Insertion Format

Generally, however, when someone refers to using the ASCII code, you will want them to clarify if they mean TRUE US-ASCII with a meta-tag, or if they are just asking you to display a special character.

In HTML, any time you want to use a special character, like say the cent symbol (¢) or an inverted question mark character (¿) — you would generally be able to use a Unicode symbol or US-ASCII (8-bit) character, by typing in a reference like this:

¢ in HTML looks like: &#0162;

¿ in HTML looks like: &#0191;

So you start with an &# followed by a four-digit number, finishing with a semi-colon (;).

In this way, you are able to display characters based on their ASCII/Unicode number.

Of course, control characters will perform a formatting function or not work at all, depending on which one you use and which real character set you have listed in your meta tag.

So in HTML you see the “&#” number, but when displayed in your browser you will see the character.

HTML Special Entity Characters

Now, let’s say for example you want to just show an & symbol on your page.

You can’t just type it into the HTML, but you can type in the corresponding ASCII or Unicode.

HTML is a markup language, so while normal letters work fine, special characters and especially < > brackets — are critically important to how the browser reads and shows the HTML.

You don’t need to always type in the Unicode/ASCII reference number though. For HTML 4.0 and newer, there are special entities which work similar to a Unicode reference, but instead of memorizing a number you memorize a word.

¢ in HTML looks like: &cent;

¿ in HTML looks like: &iquest;

A full list of these character references can be found at the W3 consortium.

Reference Table

With all this lead-up, you might just be looking for an easy place to find an ASCII or Unicode reference. Look no further, we have references 000-127 here, and you can find the full Unicode format on Wikipedia.

Note that characters 000-032 and 127 are not generally printable and are thus indicated with “NA.”

ASCII

0 NA &#0000; 32 NA &#0032; 64 @ &#0064; 96 ` &#0096;
1 NA &#0001; 33 ! &#0033; 65 A &#0065; 97 a &#0097;
2 NA &#0002; 34 " &#0034; 66 B &#0066; 98 b &#0098;
3 NA &#0003; 35 # &#0035; 67 C &#0067; 99 c &#0099;
4 NA &#0004; 36 $ &#0036; 68 D &#0068; 100 d &#0100;
5 NA &#0005; 37 % &#0037; 69 E &#0069; 101 e &#0101;
6 NA &#0006; 38 & &#0038; 70 F &#0070; 102 f &#0102;
7 NA &#0007; 39 ' &#0039; 71 G &#0071; 103 g &#0103;
8 NA &#0008; 40 ( &#0040; 72 H &#0072; 104 h &#0104;
9 NA &#0009; 41 ) &#0041; 73 I &#0073; 105 i &#0105;
10 NA &#0010; 42 * &#0042; 74 J &#0074; 106 j &#0106;
11 NA &#0011; 43 + &#0043; 75 K &#0075; 107 k &#0107;
12 NA &#0012; 44 , &#0044; 76 L &#0076; 108 l &#0108;
13 NA &#0013; 45 - &#0045; 77 M &#0077; 109 m &#0109;
14 NA &#0014; 46 . &#0046; 78 N &#0078; 110 n &#0110;
15 NA &#0015; 47 / &#0047; 79 O &#0079; 111 o &#0111;
16 NA &#0016; 48 0 &#0048; 80 P &#0080; 112 p &#0112;
17 NA &#0017; 49 1 &#0049; 81 Q &#0081; 113 q &#0113;
18 NA &#0018; 50 2 &#0050; 82 R &#0082; 114 r &#0114;
19 NA &#0019; 51 3 &#0051; 83 S &#0083; 115 s &#0115;
20 NA &#0020; 52 4 &#0052; 84 T &#0084; 116 t &#0116;
21 NA &#0021; 53 5 &#0053; 85 U &#0085; 117 u &#0117;
22 NA &#0022; 54 6 &#0054; 86 V &#0086; 118 v &#0118;
23 NA &#0023; 55 7 &#0055; 87 W &#0087; 119 w &#0119;
24 NA &#0024; 56 8 &#0056; 88 X &#0088; 120 x &#0120;
25 NA &#0025; 57 9 &#0057; 89 Y &#0089; 121 y &#0121;
26 NA &#0026; 58 : &#0058; 90 Z &#0090; 122 z &#0122;
27 NA &#0027; 59 ; &#0059; 91 [ &#0091; 123 { &#0123;
28 NA &#0028; 60 < &#0060; 92 \ &#0092; 124 | &#0124;
29 NA &#0029; 61 = &#0061; 93 ] &#0093; 125 } &#0125;
30 NA &#0030; 62 > &#0062; 94 ^ &#0094; 126 ~ &#0126;
31 NA &#0031; 63 ? &#0063; 95 _ &#0095; 127 NA &#0127;

ASCII Tools and Resources

There is a lot of history on how character codes evolved, and the organizations which hold these standards together for the rest of us. With most internet developers and the W3C settling on UTF-8, for at least the immediate future, that is how pages will be encoded.

You’re going to need some resources to help you though if you start manually encoding in other formats, or it can be nice just to have a comprehensive reference around.

List of Resources

ASCII Art

No summary of ASCII would be complete without a reference to ASCII art.

Special software can be used, or symbols hand-coded, to take on the shape of an image using nothing but symbols. This type of effect has existed since the 1980s and made popular on systems like the Commodore Amiga Computer.

There is even a distinction between “Oldskool” ASCII art which uses pure ASCII in the command line, and “Newskool” which uses the special characters in Unicode to make even more complex works of art.

Here’s a picture of a zebra’s head:

ASCII Art Zebra

ISO-8859-1

ISO-8859-1 is a character encoding standard. It was released by the International Organization for Standardization (ISO) in 1998 as an extension to ASCII.

ASCII and ISO-8859-1

The most famous character encoding standard is ASCII. ASCII used 7 bits of an eight-bit byte in order to encode the most basic 128 characters used for writing English. A number of system-specific uses were developed for the eighth (high-order) bit.

For example, one system used it to toggle between roman and italic printing styles. Other systems used it to encode additional characters. By using all eight bytes, 256 characters can be encoded.

Since the original ASCII set didn’t include a number of characters needed to write in common non-English languages (such as letters with diacritical marks), extending the character set to 256 greatly increased its capabilities.

IS0-8859-1 is one of those extensions. It was intended to be an international, cross-platform standard. Since it is a superset of standard 8-bit ASCII, it is backward-compatible: a document encoded in ASCII could easily be decoded using ISO-8859-1.

ISO-8859-1 and HTML

According to the standard, ISO-8859-1 was the default character encoding in HTML 4. However, most browsers supported a superset of ISO-8859, called ANSI.

ANSI contains an extra 32 characters which were empty in ISO-8859-1. (Most of the time, when you see a list of ISO-8859-1 characters, it’s actually the full ANSI list.)

Today, the HTML5 standard uses UTF-8, a very large superset that includes the original ASCII, ISO-8859-1, and ANSI encodings.

However, most English-language HTML documents, even those explicitly declaring ISO-8859-1 or UTF-8 as their character set, actually use the smaller ASCII character set. There are two reasons for this:

  • ASCII can be typed on a standard QWERTY keyboard.

  • Many of the technologies used to generate HTML only support ASCII.

Since ISO-8859-1 and UTF-8 are both ASCII-compatible, this doesn’t usually cause any problems.

ISO-8859-1 and Character Entities

The extended set of characters available in ISO-8859-1 can be produced in an ASCII-only document by using HTML character entities. These are strings that begin with the ampersand (“&”) and terminate with a semicolon (“;”).

For example, the copyright symbol (the circle with a “C” in it) can be encoded directly using ISO-8859-1 or UTF-8. But since there is no “©” key on most keyboards, many people find it easier to type &copy;.

This is stored in the file as six ASCII characters: &, c, o, p, y, and ;. Web browsers then display the appropriate ISO-8859-1 character to the user.

Most of the non-ASCII ISO-8859-1 characters have named HTML character entities. Those that do not can be typed with their numerical code. The numerical code is actually the decimal (base 10) version of the binary encoding.

For example, the copyright symbol is encoded as 10101001 in binary, which is 169 in base 10. So you could type &copy; or &#169;.

Non-ASCII Characters in ISO-8859-1 and ANSI

Characters 128-159 on this chart are ANSI characters not included in ISO-8859. The first 127 codes in ISO-8859-1/ANSI are not included here, as they are identical to ASCII, which we’ve listed above.

Character HTML Name HTML Number Description
&euro; &#128; euro sign
&sbquo; &#130; single low-9 quotation mark
ƒ &fnof; &#131; lowercase letter f with hook
&bdquo; &#132; double low-9 quotation mark
&hellip; &#133; horizontal ellipsis
&dagger; &#134; dagger
&Dagger; &#135; double dagger
ˆ &circ; &#136; modifier letter circumflex accent
&permil; &#137; per mille sign
Š &Scaron; &#138; capital letter S with caron
&lsaquo; &#139; single left-pointing angle quotation
Π&OElig; &#140; capital ligature OE
Ž   &#142; captial letter Z with caron
&lsquo; &#145; left single quotation mark
&rsquo; &#146; right single quotation mark
&ldquo; &#147; left double quotation mark
&rdquo; &#148; right double quotation mark
&bull; &#149; bullet
&ndash; &#150; en dash
&mdash; &#151; em dash
˜ &tilde; &#152; tilde
&trade; &#153; TM trade mark sign
š &scaron; &#154; lowercase letter S with caron
&rsaquo; &#155; right-pointing angle quotation mark
œ &oelig; &#156; lowercase ligature oe
ž   &#158; lowercase letter z with caron
Ÿ &Yuml; &#159; capital letter Y with diaeresis
  &nbsp; &#160; non-breaking space
¡ &iexcl; &#161; inverted exclamation mark
¢ &cent; &#162; cent sign
£ &pound; &#163; pound sign (currency)
¤ &curren; &#164; currency sign
¥ &yen; &#165; yen/yuan sign
¦ &brvbar; &#166; broken vertical bar
§ &sect; &#167; section sign
¨ &uml; &#168; diaeresis
© &copy; &#169; copyright sign
ª &ordf; &#170; feminine ordinal indicator
« &laquo; &#171; left double angle quotation mark (guillemet)
¬ &not; &#172; not sign (logic)
­ &shy; &#173; soft/discretionary hyphen
® &reg; &#174; registered trade mark sign
¯ &macr; &#175; spacing macron / overline
° &deg; &#176; degree sign
± &plusmn; &#177; plus/minus sign
² &sup2; &#178; superscript two (squared)
³ &sup3; &#179; superscript three (cubed)
´ &acute; &#180; acute accent
µ &micro; &#181; micro sign
&para; &#182; paragraph sign (pilcrow)
· &middot; &#183; middle dot
¸ &cedil; &#184; cedilla
¹ &sup1; &#185; superscript one
º &ordm; &#186; masculine ordinal indicator
» &raquo; &#187; right double angle quotation mark (guillemet)
¼ &frac14; &#188; one quarter fraction (1 over 4)
½ &frac12; &#189; one half fraction (1 over 2)
¾ &frac34; &#190; three quarters fraction (3 over 4)
¿ &iquest; &#191; inverted question mark
À &Agrave; &#192; capital letter A with grave accent
Á &Aacute; &#193; capital letter A with acute accent
 &Acirc; &#194; capital letter A with circumflex
à &Atilde; &#195; capital letter A with tilde
Ä &Auml; &#196; capital letter A with diaeresis
Å &Aring; &#197; capital letter A with ring above
Æ &AElig; &#198; capital AE ligature
Ç &Ccedil; &#199; capital letter C with cedilla
È &Egrave; &#200; capital letter E with grave accent
É &Eacute; &#201; capital letter E with acute accent
Ê &Ecirc; &#202; capital letter E with circumflex
Ë &Euml; &#203; capital letter E with diaeresis
Ì &Igrave; &#204; capital letter I with grave accent
Í &Iacute; &#205; capital letter I with acute accent
Î &Icirc; &#206; capital letter I with circumflex
Ï &Iuml; &#207; capital letter I with diaeresis
Ð &ETH; &#208; capital letter ETH(Dogecoin symbol)
Ñ &Ntilde; &#209; capital letter N with tilde
Ò &Ograve; &#210; capital letter O with grave accent
Ó &Oacute; &#211; capital letter O with acute accent
Ô &Ocirc; &#212; capital letter O with circumflex
Õ &Otilde; &#213; capital letter O with tilde
Ö &Ouml; &#214; capital letter O with diaeresis
× &times; &#215; multiplication sign
Ø &Oslash; &#216; capital letter O slash
Ù &Ugrave; &#217; capital letter U with grave accent
Ú &Uacute; &#218; capital letter U with acute accent
Û &Ucirc; &#219; capital letter U with circumflex
Ü &Uuml; &#220; capital letter U with diaeresis
Ý &Yacute; &#221; capital letter Y with acute accent
Þ &THORN; &#222; capital letter THORN
ß &szlig; &#223; lowercase letter sharp s (Eszett / scharfes S )
à &agrave; &#224; small letter a with grave accent
á &aacute; &#225; lowercase letter a with acute accent
â &acirc; &#226; lowercase letter a with circumflex
ã &atilde; &#227; lowercase letter a with tilde
ä &auml; &#228; lowercase letter a with diaeresis
å &aring; &#229; lowercase letter a with ring above
æ &aelig; &#230; lowercase ae ligature
ç &ccedil; &#231; lowercase letter c with cedilla (cé cédille)
è &egrave; &#232; lowercase letter e with grave accent
é &eacute; &#233; lowercase letter e with acute accent
ê &ecirc; &#234; lowercase letter e with circumflex
ë &euml; &#235; lowercase letter e with diaeresis
ì &igrave; &#236; lowercase letter i with grave accent
í &iacute; &#237; lowercase letter i with acute accent
î &icirc; &#238; lowercase letter i with circumflex
ï &iuml; &#239; lowercase letter i with diaeresis
ð &eth; &#240; lowercase letter eth
ñ &ntilde; &#241; lowercase letter n with tilde
ò &ograve; &#242; lowercase letter o with grave accent
ó &oacute; &#243; lowercase letter o with acute accent
ô &ocirc; &#244; lowercase letter o with circumflex
õ &otilde; &#245; lowercase letter o with tilde
ö &ouml; &#246; lowercase letter o with diaeresis
÷ &divide; &#247; division sign
ø &oslash; &#248; lowercase letter o with slash
ù &ugrave; &#249; lowercase letter u with grave accent
ú &uacute; &#250; lowercase letter u with acute accent
û &ucirc; &#251; lowercase letter u with circumflex
ü &uuml; &#252; lowercase letter u with diaeresis
ý &yacute; &#253; lowercase letter y with acute accent
þ &thorn; &#254; lowercase letter thorn
ÿ &yuml; &#255; lowercase letter y with diaeresis

Unicode

Unicode is a standard for character encoding managed by The Unicode Consortium.

As we’ve discussed, computer systems don’t store characters (letters, numbers, symbols) literally — there’s no tiny picture of each letter in a document on your hard drive. As you should now know, each character is encoded as a series of binary bits1s and 0s. For example, the code for the lowercase letter “a” is 01100001.

But 01100001 is arbitrary — there’s nothing special about that string of bits that should make it the letter “a” — the computer industry has collectively agreed that it means “a.” So how does the entire industry come to agree on how to represent every possible character? With a character encoding standard. An encoding standard simply specifies all the possible characters available, and assigns each one a string of bits.

There have been several character encoding standards used around the world over the last several decades of computing. For a long time, the most universally-accepted standard was ASCII. The problem with ASCII is that it only encoded a relatively limited number of characters — 256 at most. This excluded non-Latin languages, many important math and science symbols, and even some basic punctuation marks.

Aside from ASCII’s use in English and other languages which use the Latin alphabet, language groups using other alphabets tended to use their own character encoding. Since these encoding schemes were defined apart from each other, they often conflicted; it was impossible to use a single encoding scheme for multiple languages at the same time.

Unicode was originally conceived, and continues to be developed, specifically with the intent to overcome these challenges. The goal of Unicode is to provide a uniersal, unified, and unique code identifier for every grapheme in every language and writing system in the world.

UTF-8

Unicode has been implemented in several character encoding schemes, but the standard most widely used today is UTF-8. UTF-8 has become nearly universal for all types of modern computing.

UTF-8 encodes characters using up to 4 8-bit code blocks. ASCII only used 8 bits per character. Unicode characters previously included in ASCII are represented in UTF-8 by a single 8-bit chunk, the same 8 bits that were used in ASCII. This makes ASCII text forward-compatible in UTF-8. (This is one of the many reasons that UTF-8 became the universal standard — transition was relatively easy.)

The 8×4 scheme provides UTF-8 with over a million code points, allowing Unicode to encode characters from 129 scripts and writing systems.

Resources for Understanding Unicode

Books on Unicode

  • Unicode Explained, by Jukka Korpela, provides a good overview of Unicode and various development challenges that come with implementing it
  • Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard, by Richard Gillam, is a helpful, if somewhat dated, explanation of Unicode, with a lot of Java-focused implementation specifics
  • Fonts and Encodings, by Yannis Haralambous, is not solely about Unicode, but might be the book most worth reading; it covers the history of encoding and representing text in computers, providing both a theoretical and practical foundation for understanding Unicode and a number of closely related subjects.

Unicode Reference Material

Once you have a basic understanding of Unicode, you’ll mostly find yourself needing to look up specific details — such as the exact encoding of a particular character.

  • The C/C++ Unicode Cheatsheet provides info on converting Microsoft C/C++ to Unicode
  • XML and Unicode Technology Reports is a list of technical reports covering various aspects of using XML and Unicode together
  • Decode Unicode provides an online Unicode dictionary with a beautiful UI, allowing you to view every defined Unicode character, even without local font support
  • Data on Languages provides searchable information on using Unicode character sets with various languages
  • Unicode Navigator provides an organized list of all the Unicode characters

Unicode Tools

  • Unicode Analyzer is a Chrome browser extension that provides information on Unicode text in web pages and documents
  • Character Identifier is a Firefox plugin that provides a context menu for finding more information about selected Unicode characters
  • For inserting Unicode characters in text fields on the web, try Unicode Symbols for Chrome or Unicode Input Tool for Firefox
  • UnicodeDataBrowser provides a GUI for easier reading of the UnicodeData.txt file
  • Polyglot 3000 automatically identifies the language of any text
  • Unicode provides a list of Unicode character keyboard layouts for various Unicode-supported scripts
  • Babel is a Python library for a wide range of internationalization and localization tasks
  • D-Type Unicode Text Engine is a C++ library for laying out, rendering, and editing high-quality Unicode text on any device, platform, or operating system
  • Nunicode is a C library for encoding and decoding of UTF-8 documents
  • Portable UTF-8 provides Unicode support for PHP strings
  • Tesseract OCR provides optical character recognition for Unicode text
  • Popchar is an improved character map that lets you easily find and type characters from the whole range of the Unicode space
  • Unicode Utilities provides a number of interesting and useful online tools for working with Unicode
  • Edicode provides an flexible online Unicode keyboard for typing text using various international scripts
  • Quickkey is a flexible keyboard extension for typing the first 65,000 defined Unicode characters
  • Unicode Code Converter converts any entered character code into several different encodings of the same character
  • CharFunk is a JavaScript utility for performing a number of interesting checks and operations on Unicode characters
  • Kreative Recode transforms text files from various encoding into Unicode
  • BabelMap Online provides an in-browser Unicode keyboard, with output in display characters as well as hex or decimal encoding

Text and Code Editors

Most of today’s text editors, code editors, and IDEs either use Unicode by default, or can easily handle Unicode. Sublime, Notepad++, Atom, and Eclipse are all set to UTF-8 as the default character encoding. Vim and Emacs may need a setting change to use UTF-8:

There are also a handful of code and text editors specifically designed to handle the extended Unicode character set:

  • MinEd is a Unicode text editor with contextual support for inserting characters from the full range of the Unicode character space
  • Classical Text Editor is an advanced editor for working with critical and scholarly editions of texts, including multi-lingual texts using a wide range of the Unicode character set

Unicode Fonts

The relationship between fonts and Unicode is a bit oblique. Unicode was created to be backward-compatible with ASCII — text formatted in ASCII can be decoded as Unicode with virtually no problem. And Unicode-encoded text can be displayed using ASCII fonts, as long as only the small set of characters that appear in ASCII are used.

Today, most fonts available on most computers are encoded with Unicode. So, from that standpoint, most fonts are “Unicode fonts.” However, most fonts do not support a particularly large set of the full Unicode standard.

Usually, this is not a problem; someone authoring a text in multiple languages, or with an extended character set, might use several different fonts — one for Latin script, another for each CJK languages, and another for math symbols (for example). However, it can be useful sometimes to have single fonts which contain a large percentage of the Unicode character space. This might be needed when working in plain text and source code environments where using multiple fonts is not feasible, or when visual unity between multiple scripts is especially important.

The following are the most notable font projects providing extended Unicode support. For a more complete listing, including defunct and deprecated fonts, see this page of Unicode fonts. For typesetting Asian languages, see this list of CJK fonts.

  • Everson Mono is a monospace font created by one of the originators of the Unicode standard; its stated purpose is to provide glyphs for as much of the Unicode character space as possible, and (as of this writing) 92 Unicode character blocks are supported.
  • Noto is a large set of display fonts, developed by Google, which together provide support for a high majority of the Unicode character sets, with the intention to eventually support the entire Unicode standard.
  • Deja Vu Fonts is a font family providing wide coverage of the Unicode standard, with Serif, Sans, and Monospace versions.
  • GNU FreeFont is a family of fonts, providing Serif, Sans, and Mono type faces for 37 writing systems and 12 Unicode symbol ranges.
  • GNU Unifont is a monospace, bitmap font with complete coverage for the Unicode 8.0 Basic Multilingual Plane and wide, but incomplete, coverage for the Supplemental Multilingual Plane.

There are also a number of interesting fonts which encode a particular subset of the Unicode standard for specialized use.

  • Junicode is a set of fonts for Medievalists
  • Last Resort is a “font of last resort”; instead of conventional character glyphs, each glyph actually displays information about the Unicode character itself
  • Unicode Fonts for Ancient Scripts is a project to create a set of fonts for several ancient and classical alphabets
  • Unimath Plus provides an extended set of science and math symbols

And here are some additional Unicode font resources, if you still can’t find what you are looking for:

Emoji Resources

Emoji are those funny little smiley faces and thumbs up signs that you can put in your text messages. They are actually part of the Unicode standard. The Emoji portion of Unicode is not universally supported, so if you want to incorporate Emoji into your app or website, you may need some help. Here are resources that will help you use and build with Unicode emoji.

Emoji Reference

  • Emojipedia is a searchable database of Emoji characters
  • Can I Emoji? provides information on native support for Unicode emoji on iOS, Android, OS X, and Windows, as well as major browsers
  • WTF Emoji Foundation is a slightly serious organization dedicated to the advancement of emoji; they run the Emoji Dictionary.
  • Emoji cheat sheet provides a quick reference for Emoji type-in codes

Emoji Libraries

Emoji Keyboards and Collections

MIME Types

MIME stands for “Multipurpose Internet Mail Extensions.” It’s the Internet standard used to identify various file types transmitted online. Originally, it was developed for email that was sent over SMTP (Simple Mail Transfer Protocol) which is the Internet standard for email transmission. Nowadays, MIME is extremely important in other communication protocols such as HTTP.

MIME History

We’ve already discussed the history of ASCII and character encoding. But there is much more to the story of sending information than this.

With time, our messages started to get more complex and it became obvious that this standard format was not enough. Multimedia images that contained audio or video files weren’t defined at all. The same applied to languages that didn’t use the English alphabet. The situation finally began to change when two people joined forces: Nathaniel Borenstein and Ned Freed.

Their proposal redefined the format of messages to allow for email to contain multiple objects in a single message; the use of non-ASCII characters as well as non-English languages; and the use of images, audio, and video. This was the birth of MIME which became the official standard in 1993.

The proposal also defined the encoding standards which are 7bit, 8bit, base64, binary, and quoted-printable. Those encoding standards were supposed to ensure all data is indeed being sent. It also included information on the use of Content-Type header which is necessary to properly identify the type of data that is transmitted.

What Are MIME Types?

MIME types are identifiers used to identify the many file formats being transmitted every day on the Internet. They are standardized by the IANA (Internet Assigned Numbers Authority). MIME Types were first defined and named as such in Request for Comments: 2045 (RFC 2045) published by the IETF (Internet Engineering Task Force) which was the official proposal submitted by Borenstein and Freed.

Structure

MIME types consist of a type and a subtype which are two strings separated by a forward slash. Type represents a category and can be discrete or multipart. Each type has a specific subtype. Traditionally, MIME types are written in lowercase.

Discrete types include text, image, audio, video, and application. Multipart types represent a category of documents which are broken down into distinct parts and often include different MIME types. They include form data and byteranges.

Some MIME types are prefixed by either x or vnd. The x prefix means it hasn’t been registered with the IANA and vnd signifies vendor specific prefix.

Common MIME Types

Application:

  • application/msword (.doc)
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document (.docx)
  • application/vnd.openxmlformats-officedocument.wordprocessingml.template (.dotx)
  • application/vnd.ms-powerpoint (.ppt)
  • application/ecmascript (.es)
  • application/x-javascript (.js)
  • application/octet-stream (.bin, .exe)
  • application/pdf (.pdf)
  • application/postscript (.ps, .ai, .eps)
  • application/rtf (.rtf)
  • application/x-gtar (.gtar)
  • application/x-gzip (.gz)
  • application/x-java-archive (.jar)
  • application/x-java-serialized-object (.ser)
  • application/x-java-vm (.class)
  • application/x-tar (.tar)
  • application/zip (.zip)
  • application/x-7z-compressed ( .7z)
  • application/x-rar-compressed (.rar)
  • application/x-shockwave-flash ( .swf)
  • application/vnd.android.package-archive (.apk)
  • application/x-bittorrent (.torrent)
  • application/epub+zip (.epub)
  • application/vnd.ms-excel (.xsl)
  • application/x-font-ttf (.tff)
  • application/rss+xml (.rss, .xml)
  • application/vnd.adobe.air-application-installer-package+zip (.air)
  • application/x-debian-package (.deb)
  • application/json (.json)

Audio:

  • audio/x-midi (.mid, .midi)
  • audio/x-wav (.wav)
  • audio/mp4 (.mp4a)
  • audio/ogg (.ogg)
  • audio/mpeg ( .mp3)

Image:

  • image/bmp (.bmp)
  • image/gif (.gif)
  • image/jpeg (.jpeg, .jpg, .jpe)
  • image/tiff (.tiff, .tif)
  • image/x-xbitmap (.xbm)
  • image/x-icon (.ico)
  • image/svg+xml (.svg)
  • image/png (.png)

Text:

  • text/html (.htm, .html)
  • text/plain (.txt)
  • text/richtext (.rtf, .rtx)
  • text/css (.css)
  • text/csv (.csv)
  • text/calendar (.ics)

Video:

  • video/mpeg (.mpg, .mpeg, .mpe)
  • video/ogg (.ogv)
  • video/quicktime (.qt, .mov)
  • video/x-msvideo (.avi)
  • video/mp4 (.mp4)
  • video/webm (.webm)

Resources

MIME types allowed us to have a better and richer email experience. The following list of resources will help you learn more in-depth about how and why they came to be as well as how to properly configure a web server for MIME type support, and more.

Online Resources

The following list includes links to the five-part proposal that became the standard draft for MIME.

  • RFC 2045 (PDF): the first part of the proposal specifies the various headers used to describe the structure of MIME messages.
  • RFC 2046 (PDF): the second document defines the general structure of the MIME media typing system and the initial set of media types.
  • RFC 2047 (PDF): the third part of the proposal describes extensions which allow non-US-ASCII text data in Internet mail header fields.
  • RFC 2048 (PDF): the fourth part describes how new MIME types can be registered with IANA.
  • RFC 2049 (PDF): the fifth document describes MIME conformance criteria with examples of MIME message formats.
  • Media Types: a complete list of all media types, which also includes a link to the application for registering new media types.
  • The MIME Guys: How Two Internet Gurus Changed Email Forever: an article based on the interviews with Nathaniel Borenstein and Ned Freed which gives an interesting insight into their work.

Tutorials

The following resources provide useful tutorials on handling MIME types, proper server configuration, and more.

Books

Although there aren’t any books dedicated solely to MIME types, there is still a decent number of books on closely related topics that dedicate a few chapters to them.

  • Internet Email Protocols, Standards and Implementation (1998) by Lawrence Hughes: aimed at more advanced users, this book strengthens the knowledge of essential concepts needed to develop email software and thoroughly describes the key Internet email protocols and extensions such as SMTP, POP3, IMAP, MIME, and DSN.
  • Programming Internet Email (1999) by David Wood: an essential guide that covers all the important concepts necessary to build applications on top of email capabilities. Topics covered include various email protocols, email formats including MIME types, and plenty of examples.
  • Essential Email Standards (1999) by Peter Loshin: this book is a must-have for anyone looking to get an in-depth understanding of email standards. It provides a thorough analysis of the most important RFCs published by IETF as well as their potential use. It also includes a fully searchable digital version of the book on a CD.
  • MH & xmh (2006) by Jerry Peek: this book is freely available online and published under the GNU-GPL license. The third chapter explains in great detail MIME types and multipart messages

Tools

The links below feature a few useful tools for checking the validity of MIME types.

Expand Your Knowledge of MIME Types

MIME types may seem insignificant on the surface but they brought major changes in the way our email messaging works. This list of resources should pique your curiosity and provide you with a deeper understanding of how email and files transmitted over the Internet have transformed through the years.

Summary

Most people just type and don’t really think much about what is happening. A select few bother to think about the niceties of font design and typography.

But even smaller is the number of people who know, or care to know, what happens behind the scenes — how a keypress becomes a letter on their computer screen.

To everyone else, it is either transparent or trivial.

But as we’ve shown, the process of representing language is hardly trivial, and a huge amount of work has gone into making it as transparent as it is. The Unicode Consortium, along with countless developers, designers, and linguists, have made it possible for anyone to write any character, from any language, in any script, on any computer.

This is a notable achievement, and a necessary step toward universal literacy and universal access to computers and the internet.

FAQ

Q. What is the difference between ASCII, Unicode, and UTF-8?

A. ASCII is the older standard from the 1960’s, whereas Unicode came into existence in the late 1980s.

ASCII is only 128 or 256 characters, but Unicode has over 10,000.

Unicode is the character table, UTF-8 (or UTF-16 or UTF-32) is the level of encoding. Unicode 0-256 and ASCII are nearly identical, with just some minor differences on the control characters.

UTF-8 is the most common encoding on the web today — and the default.

Q. Do I need to declare what encoding type I’m using for my web page?

A. Only if you know you need to use a unique encoding type.

If you don’t declare one, most browsers will default to UTF-8. If you are creating a webpage in a foreign language, especially non-Latin, make sure that you are either using UTF-8 or else pick a special charset.

Q. Do I need to memorize any ASCII codes to write HTML?

A. Only if you’re trying to be extremely efficient.

Most websites today are dynamic and generate the HTML for you, through systems like a content management system (CMS). If you are a developer, you will probably be using other programming languages in addition to HTML, and those languages might have special ways of generating those ASCII symbols.

Finally, as discussed above, many of those codes use special character names in HTML instead of ASCII numbers.

Q. Does the character encoding differ on different operating systems?

A. Somewhat.

Unicode is slightly different on Windows vs Unix/Linux. For example, Windows uses UTF-16LE while Linux normally uses UTF-8.

Now, of course, the encoding used by your operating system might differ from the encoding on a webpage, but your OS and the web-browser work together to convert the character codes into something your computer can display.

Sometimes, in older operating systems, this conversion might not work and you would just see blank characters. (For example, it’s something you might see visiting a foreign website on Windows XP.)

Q. ASCII Art is awesome! Where can I make my own?

A. AsciiWorld.com has some great galleries and tools in their software section, such as converters and “painters.” Have fun!


Other Interesting Stuff

We have more guides, tutorials, and infographics related to coding and website development:

HTML for Beginners — Ultimate Guide

If you really want to learn HTML, we’ve created a book-length article, HTML for Beginners — Ultimate Guide. And it really is the ultimate guide; it will take you from the very beginning to mastery.

Before Unicode, it was common to visit websites where all the text was represented by empty boxes. Things have changed a lot. In our infographic Web Design Trends You’ll Never Forget we run through how the web used to be.


Text written by Tom Riecken with additional material by Adam Michael Wood. Compiled and edited by Frank Moraes. ASCII Zebra is in the public domain. Growth of UTF-8 by Krauss is licensed under CC BY-SA 4.0.

About Tom Riecken

Tom has worked as a web developer and data analyst. He is involved with several global and local "futurist" organizations, where he often facilitates discussions about the social impact of technology. His strongest recreational interests include spaceflight, astronomy, and realistic science fiction. He lives in Washington.