ASCII Encoding: The Definitive Guide

ASCII is a type of character-encoding that is used for computers to store and retrieve characters (letters, numbers, symbols, spaces, indentations, etc) as bit-patterns for storage in memory and on hard drives.

"Character encoding" at a high level means the conversion of a symbol into a binary number, and using a "character map" to read the binary number as a type of letter.

ASCII and Character Encoding

Character Encoding

The earliest form of character encoding goes as far back as the electric telegraph. In fact, Morse code, and later the Baudot code were some of the first standardized character codes ever created.

A second layer of encoding called encryption or ciphering was also established by militaries of that time, but that is a rather different topic.

It wasn't until the 1950s that we began the modern process toward ASCII. IBM started this by developing encoding schemes for use in their 7000 Series computers.

IBM's Binary Coded Decimal (BCD) used a four-bit encoding on punchcards. It was a way of storing decimal numbers in a binary form.

So instead of numbers running from 0000 (0) to 1111 (15), they ran from 0000 (0) to 1001 (9) — each four bits representing a single digit.

Later, IBM created the Extended version of BCD called Extended Binary Coded Decimal Interchange Code (EBCDIC). It was an 8-bit encoding system for all the standard printable characters.

In that same year, 1963, ASCII was introduced.

It use a 7-bit encoding scheme. That represents 128 different numbers.

This 7-bit number format might seem odd. After all, aren't computers all 8-bit or 16-bit or 32-bit and so on?

Today they are. But early computers were not constructed in that way.

What's more, memory on computers was precious and there was no reason to use an extra bit if you didn't need it. A 6-bit code (which existed) wouldn't cover all the upper and lower case letters, numbers, and basic punctuation marks. But a 7-bit code did — with room to spare.

As computers to settle into an 8-bit (1-byte) structure, ASCII gradually turned into an unofficial 8-bit code, where the other 128 characters were not standardized.

This state persisted for some time. In 1991, 8-bit became the official format as maintained by the ISO (International Organization for Standardization) for UTF-8.

The challenge that came up at this time though, was that only one alphabet could be supported by a 7 or 8-bit encoding.

In order to support a broader swath of languages, the Unicode encoding schema was devised, along with the Universal Character Set. Unicode has a couple of encoding types, UTF-8 is the 8-bit encoding which has compatibility with ASCII, and which has risen to replace ASCII as the predominant character encoding standard on the web today.

Growth of UTF-8

Additionally, UTF-16 and UTF-32 have become used for languages with a lot of characters. However, Chinese, Japanese and Arabic can all be displayed in UTF-8.

As a result, UTF-8 is by far the most common encoding format on the web. And for English speakers, things are particularly easy because the first 128 characters of ASCII are the same as those in Unicode.

So for use in HTML, referencing an ASCII table to create a character will work regardless of what encoding format you are using.

Where ASCII Fits In

ASCII stands for "American Standard Code for Information Interchange" and was created by the American Standards Association (later renamed the American National Standards Institute).

The ASCII standard was started in 1960 and released in 1963. It was an extension of telegraphic codes and was first used by Bell data services.

Major revisions were made over the years. Until 2007 it was the most widely used character encoding on the web, but it was replaced with UTF-8.

The web's switch from ASCII and Microsoft's ANSI towards UTF-8 can be largely attributable to initiatives by Google, as internet usage was becoming more international and ASCII was only capable of displaying Latin characters.

What's important to note, is that UTF-8 is a type of encoding, while Unicode is the character set; because Unicode's first 128 characters are the same as ASCII, it is acceptable to refer to an ASCII table when generating characters in HTML.

ASCII does have the ability to use an "escape sequence" in displaying alternative alphabets, which allowed it to become an international standard, but Unicode handles this more directly.

Unicode originated from Apple in 1987, and became the project of the Unicode Consortium in 1991. ASCII was created by the ASA, but further refinement of it continued as part of declarations from ISO.

The encoding name of UTF-8 is used by all the standards conforming to the Internet Assigned Numbers Authority (IANA) which means all HTML, CSS, and XML. IANA is a department of the larger ICANN, which is the non-profit which determines internet protocol and domain names.

To summarize, ASCII evolved from telegraph code in the 60's, grew up, and became part of the Unicode character set, which is used by UTF-8, the most dominant encoding format on the web.

Domain names and webpage code depend on having this unified character map to work properly.

This means that at the very root of the modern internet, there exists a character format invented in the 1870's, computerized as ASCII in the 1960's, modernized for the web with Unicode the 1990's, and broadly adopted through UTF-8's majority use in 2007.

Control Characters vs Printable Characters

There are two types of characters in ASCII, printable characters and control characters.

The control characters define numbers 0-31, and 127. The control characters include all the parts of writing that allow for new paragraphs, tabs, end of line, file separators, and a lot of pieces which are mainly transparent.

These control characters were created at a time when printed cards were a big part of the computing process. Some of those features have since been replaced, but a lot of the line formatting parts are still around today. Code 127 is actually the code for delete (only in real ASCII, not ANSI or Unicode).

All of the printable characters are what you might expect. There are all the lower case characters (a-z) and uppercase characters (A-Z), along with numbers, symbols, and punctuation marks — essentially everything seen on a typical keyboard. These principle characters comprise all written words.

Using ASCII in XML and HTML

Every HTML page has a character encoding format assigned to it.

Unless otherwise specified, the HTML encoding will default to UTF-8. For using pure ASCII, or ANSI, or any specialized, unique format, all that needs to be done is to have a declaration in a meta tag.

For HTML 4:

<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

For HTML5:

<meta charset="UTF-8">

In the charset tag, you can use UTF-8, ANSI, or ASCII by using charset="us-ascii" or you could look up a specific character set to use, usually by declaring an ISO number. A full list can be found on the IANA character sets page.

Character Code Insertion Format

Generally, however, when someone refers to using the ASCII code, you will want them to clarify if they mean TRUE US-ASCII with a meta-tag, or if they are just asking you to display a special character.

In HTML, any time you want to use a special character, like say the cent symbol (¢) or an inverted question mark character (¿) — you would generally be able to use a Unicode symbol or US-ASCII (8-bit) character, by typing in a reference like this:

¢ in HTML looks like: &#0162;

¿ in HTML looks like: &#0191;

So you start with an &# followed by a four-digit number, finishing with a semi-colon (;).

In this way, you are able to display characters based on their ASCII/Unicode number.

Of course, control characters will perform a formatting function or not work at all, depending on which one you use and which real character set you have listed in your meta tag.

So in HTML you see the "&#" number, but when displayed in your browser you will see the character.

HTML Special Entity Characters

Now, let's say for example you want to just show an & symbol on your page.

You can't just type it into the HTML, but you can type in the corresponding ASCII or Unicode.

HTML is a markup language, so while normal letters work fine, special characters and especially < > brackets — are critically important to how the browser reads and shows the HTML.

You don't need to always type in the Unicode/ASCII reference number though. For HTML 4.0 and newer, there are special entities which work similar to a Unicode reference, but instead of memorizing a number you memorize a word.

¢ in HTML looks like: &cent;

¿ in HTML looks like: &iquest;

A full list of these character references can be found at the W3 consortium.

Reference Table

With all this lead-up, you might just be looking for an easy place to find an ASCII or Unicode reference. Look no further, we have references 000-127 here, and you can find the full Unicode format on Wikipedia.

Note that characters 000-032 and 127 are not generally printable and are thus indicated with "NA."

ASCII

0NA&#0000;32NA&#0032;64@&#0064;96`&#0096;
1NA&#0001;33!&#0033;65A&#0065;97a&#0097;
2NA&#0002;34"&#0034;66B&#0066;98b&#0098;
3NA&#0003;35#&#0035;67C&#0067;99c&#0099;
4NA&#0004;36$&#0036;68D&#0068;100d&#0100;
5NA&#0005;37%&#0037;69E&#0069;101e&#0101;
6NA&#0006;38&&#0038;70F&#0070;102f&#0102;
7NA&#0007;39'&#0039;71G&#0071;103g&#0103;
8NA&#0008;40(&#0040;72H&#0072;104h&#0104;
9NA&#0009;41)&#0041;73I&#0073;105i&#0105;
10NA&#0010;42*&#0042;74J&#0074;106j&#0106;
11NA&#0011;43+&#0043;75K&#0075;107k&#0107;
12NA&#0012;44,&#0044;76L&#0076;108l&#0108;
13NA&#0013;45-&#0045;77M&#0077;109m&#0109;
14NA&#0014;46.&#0046;78N&#0078;110n&#0110;
15NA&#0015;47/&#0047;79O&#0079;111o&#0111;
16NA&#0016;480&#0048;80P&#0080;112p&#0112;
17NA&#0017;491&#0049;81Q&#0081;113q&#0113;
18NA&#0018;502&#0050;82R&#0082;114r&#0114;
19NA&#0019;513&#0051;83S&#0083;115s&#0115;
20NA&#0020;524&#0052;84T&#0084;116t&#0116;
21NA&#0021;535&#0053;85U&#0085;117u&#0117;
22NA&#0022;546&#0054;86V&#0086;118v&#0118;
23NA&#0023;557&#0055;87W&#0087;119w&#0119;
24NA&#0024;568&#0056;88X&#0088;120x&#0120;
25NA&#0025;579&#0057;89Y&#0089;121y&#0121;
26NA&#0026;58:&#0058;90Z&#0090;122z&#0122;
27NA&#0027;59;&#0059;91[&#0091;123{&#0123;
28NA&#0028;60<&#0060;92\&#0092;124|&#0124;
29NA&#0029;61=&#0061;93]&#0093;125}&#0125;
30NA&#0030;62>&#0062;94^&#0094;126~&#0126;
31NA&#0031;63?&#0063;95_&#0095;127NA&#0127;

ASCII Tools and Resources

There is a lot of history on how character codes evolved, and the organizations which hold these standards together for the rest of us. With most internet developers and the W3C settling on UTF-8, for at least the immediate future, that is how pages will be encoded.

You're going to need some resources to help you though if you start manually encoding in other formats, or it can be nice just to have a comprehensive reference around.

List of Resources

ASCII Art

No summary of ASCII would be complete without a reference to ASCII art.

Special software can be used, or symbols hand-coded, to take on the shape of an image using nothing but symbols. This type of effect has existed since the 1980s and made popular on systems like the Commodore Amiga Computer.

There is even a distinction between "Oldskool" ASCII art which uses pure ASCII in the command line, and "Newskool" which uses the special characters in Unicode to make even more complex works of art.

Here's a picture of a zebra's head:

ASCII Art Zebra

ISO-8859-1

ISO-8859-1 is a character encoding standard. It was released by the International Organization for Standardization (ISO) in 1998 as an extension to ASCII.

ASCII and ISO-8859-1

The most famous character encoding standard is ASCII. ASCII used 7 bits of an eight-bit byte in order to encode the most basic 128 characters used for writing English. A number of system-specific uses were developed for the eighth (high-order) bit.

For example, one system used it to toggle between roman and italic printing styles. Other systems used it to encode additional characters. By using all eight bytes, 256 characters can be encoded.

Since the original ASCII set didn't include a number of characters needed to write in common non-English languages (such as letters with diacritical marks), extending the character set to 256 greatly increased its capabilities.

IS0-8859-1 is one of those extensions. It was intended to be an international, cross-platform standard. Since it is a superset of standard 8-bit ASCII, it is backward-compatible: a document encoded in ASCII could easily be decoded using ISO-8859-1.

ISO-8859-1 and HTML

According to the standard, ISO-8859-1 was the default character encoding in HTML 4. However, most browsers supported a superset of ISO-8859, called ANSI.

ANSI contains an extra 32 characters which were empty in ISO-8859-1. (Most of the time, when you see a list of ISO-8859-1 characters, it's actually the full ANSI list.)

Today, the HTML5 standard uses UTF-8, a very large superset that includes the original ASCII, ISO-8859-1, and ANSI encodings.

However, most English-language HTML documents, even those explicitly declaring ISO-8859-1 or UTF-8 as their character set, actually use the smaller ASCII character set. There are two reasons for this:

  • ASCII can be typed on a standard QWERTY keyboard.

  • Many of the technologies used to generate HTML only support ASCII.

Since ISO-8859-1 and UTF-8 are both ASCII-compatible, this doesn't usually cause any problems.

ISO-8859-1 and Character Entities

The extended set of characters available in ISO-8859-1 can be produced in an ASCII-only document by using HTML character entities. These are strings that begin with the ampersand ("&") and terminate with a semicolon (";").

For example, the copyright symbol (the circle with a "C" in it) can be encoded directly using ISO-8859-1 or UTF-8. But since there is no "©" key on most keyboards, many people find it easier to type &copy;.

This is stored in the file as six ASCII characters: &, c, o, p, y, and ;. Web browsers then display the appropriate ISO-8859-1 character to the user.

Most of the non-ASCII ISO-8859-1 characters have named HTML character entities. Those that do not can be typed with their numerical code. The numerical code is actually the decimal (base 10) version of the binary encoding.

For example, the copyright symbol is encoded as 10101001 in binary, which is 169 in base 10. So you could type &copy; or &#169;.

Non-ASCII Characters in ISO-8859-1 and ANSI

Characters 128-159 on this chart are ANSI characters not included in ISO-8859. The first 127 codes in ISO-8859-1/ANSI are not included here, as they are identical to ASCII, which we've listed above.

Character HTML Name HTML Number Description
&euro; &#128; euro sign
&sbquo; &#130; single low-9 quotation mark
ƒ &fnof; &#131; lowercase letter f with hook
&bdquo; &#132; double low-9 quotation mark
&hellip; &#133; horizontal ellipsis
&dagger; &#134; dagger
&Dagger; &#135; double dagger
ˆ &circ; &#136; modifier letter circumflex accent
&permil; &#137; per mille sign
Š &Scaron; &#138; capital letter S with caron
&lsaquo; &#139; single left-pointing angle quotation
Π&OElig; &#140; capital ligature OE
Ž   &#142; captial letter Z with caron
&lsquo; &#145; left single quotation mark
&rsquo; &#146; right single quotation mark
&ldquo; &#147; left double quotation mark
&rdquo; &#148; right double quotation mark
&bull; &#149; bullet
&ndash; &#150; en dash
&mdash; &#151; em dash
˜ &tilde; &#152; tilde
&trade; &#153; TM trade mark sign
š &scaron; &#154; lowercase letter S with caron
&rsaquo; &#155; right-pointing angle quotation mark
œ &oelig; &#156; lowercase ligature oe
ž   &#158; lowercase letter z with caron
Ÿ &Yuml; &#159; capital letter Y with diaeresis
  &nbsp; &#160; non-breaking space
¡ &iexcl; &#161; inverted exclamation mark
¢ &cent; &#162; cent sign
£ &pound; &#163; pound sign (currency)
¤ &curren; &#164; currency sign
¥ &yen; &#165; yen/yuan sign
¦ &brvbar; &#166; broken vertical bar
§ &sect; &#167; section sign
¨ &uml; &#168; diaeresis
© &copy; &#169; copyright sign
ª &ordf; &#170; feminine ordinal indicator
« &laquo; &#171; left double angle quotation mark (guillemet)
¬ &not; &#172; not sign (logic)
­ &shy; &#173; soft/discretionary hyphen
® &reg; &#174; registered trade mark sign
¯ &macr; &#175; spacing macron / overline
° &deg; &#176; degree sign
± &plusmn; &#177; plus/minus sign
² &sup2; &#178; superscript two (squared)
³ &sup3; &#179; superscript three (cubed)
´ &acute; &#180; acute accent
µ &micro; &#181; micro sign
&para; &#182; paragraph sign (pilcrow)
· &middot; &#183; middle dot
¸ &cedil; &#184; cedilla
¹ &sup1; &#185; superscript one
º &ordm; &#186; masculine ordinal indicator
» &raquo; &#187; right double angle quotation mark (guillemet)
¼ &frac14; &#188; one quarter fraction (1 over 4)
½ &frac12; &#189; one half fraction (1 over 2)
¾ &frac34; &#190; three quarters fraction (3 over 4)
¿ &iquest; &#191; inverted question mark
À &Agrave; &#192; capital letter A with grave accent
Á &Aacute; &#193; capital letter A with acute accent
 &Acirc; &#194; capital letter A with circumflex
à &Atilde; &#195; capital letter A with tilde
Ä &Auml; &#196; capital letter A with diaeresis
Å &Aring; &#197; capital letter A with ring above
Æ &AElig; &#198; capital AE ligature
Ç &Ccedil; &#199; capital letter C with cedilla
È &Egrave; &#200; capital letter E with grave accent
É &Eacute; &#201; capital letter E with acute accent
Ê &Ecirc; &#202; capital letter E with circumflex
Ë &Euml; &#203; capital letter E with diaeresis
Ì &Igrave; &#204; capital letter I with grave accent
Í &Iacute; &#205; capital letter I with acute accent
Î &Icirc; &#206; capital letter I with circumflex
Ï &Iuml; &#207; capital letter I with diaeresis
Ð &ETH; &#208; capital letter ETH(Dogecoin symbol)
Ñ &Ntilde; &#209; capital letter N with tilde
Ò &Ograve; &#210; capital letter O with grave accent
Ó &Oacute; &#211; capital letter O with acute accent
Ô &Ocirc; &#212; capital letter O with circumflex
Õ &Otilde; &#213; capital letter O with tilde
Ö &Ouml; &#214; capital letter O with diaeresis
× &times; &#215; multiplication sign
Ø &Oslash; &#216; capital letter O slash
Ù &Ugrave; &#217; capital letter U with grave accent
Ú &Uacute; &#218; capital letter U with acute accent
Û &Ucirc; &#219; capital letter U with circumflex
Ü &Uuml; &#220; capital letter U with diaeresis
Ý &Yacute; &#221; capital letter Y with acute accent
Þ &THORN; &#222; capital letter THORN
ß &szlig; &#223; lowercase letter sharp s (Eszett / scharfes S )
à &agrave; &#224; small letter a with grave accent
á &aacute; &#225; lowercase letter a with acute accent
â &acirc; &#226; lowercase letter a with circumflex
ã &atilde; &#227; lowercase letter a with tilde
ä &auml; &#228; lowercase letter a with diaeresis
å &aring; &#229; lowercase letter a with ring above
æ &aelig; &#230; lowercase ae ligature
ç &ccedil; &#231; lowercase letter c with cedilla (cé cédille)
è &egrave; &#232; lowercase letter e with grave accent
é &eacute; &#233; lowercase letter e with acute accent
ê &ecirc; &#234; lowercase letter e with circumflex
ë &euml; &#235; lowercase letter e with diaeresis
ì &igrave; &#236; lowercase letter i with grave accent
í &iacute; &#237; lowercase letter i with acute accent
î &icirc; &#238; lowercase letter i with circumflex
ï &iuml; &#239; lowercase letter i with diaeresis
ð &eth; &#240; lowercase letter eth
ñ &ntilde; &#241; lowercase letter n with tilde
ò &ograve; &#242; lowercase letter o with grave accent
ó &oacute; &#243; lowercase letter o with acute accent
ô &ocirc; &#244; lowercase letter o with circumflex
õ &otilde; &#245; lowercase letter o with tilde
ö &ouml; &#246; lowercase letter o with diaeresis
÷ &divide; &#247; division sign
ø &oslash; &#248; lowercase letter o with slash
ù &ugrave; &#249; lowercase letter u with grave accent
ú &uacute; &#250; lowercase letter u with acute accent
û &ucirc; &#251; lowercase letter u with circumflex
ü &uuml; &#252; lowercase letter u with diaeresis
ý &yacute; &#253; lowercase letter y with acute accent
þ &thorn; &#254; lowercase letter thorn
ÿ &yuml; &#255; lowercase letter y with diaeresis

Unicode

Unicode is a standard for character encoding managed by The Unicode Consortium.

As we've discussed, computer systems don't store characters (letters, numbers, symbols) literally — there's no tiny picture of each letter in a document on your hard drive. As you should now know, each character is encoded as a series of binary bits1s and 0s. For example, the code for the lowercase letter "a" is 01100001.

But 01100001 is arbitrary — there's nothing special about that string of bits that should make it the letter "a" — the computer industry has collectively agreed that it means "a." So how does the entire industry come to agree on how to represent every possible character? With a character encoding standard. An encoding standard simply specifies all the possible characters available, and assigns each one a string of bits.

There have been several character encoding standards used around the world over the last several decades of computing. For a long time, the most universally-accepted standard was ASCII. The problem with ASCII is that it only encoded a relatively limited number of characters — 256 at most. This excluded non-Latin languages, many important math and science symbols, and even some basic punctuation marks.

Aside from ASCII's use in English and other languages which use the Latin alphabet, language groups using other alphabets tended to use their own character encoding. Since these encoding schemes were defined apart from each other, they often conflicted; it was impossible to use a single encoding scheme for multiple languages at the same time.

Unicode was originally conceived, and continues to be developed, specifically with the intent to overcome these challenges. The goal of Unicode is to provide a uniersal, unified, and unique code identifier for every grapheme in every language and writing system in the world.

UTF-8

Unicode has been implemented in several character encoding schemes, but the standard most widely used today is UTF-8. UTF-8 has become nearly universal for all types of modern computing.

UTF-8 encodes characters using up to 4 8-bit code blocks. ASCII only used 8 bits per character. Unicode characters previously included in ASCII are represented in UTF-8 by a single 8-bit chunk, the same 8 bits that were used in ASCII. This makes ASCII text forward-compatible in UTF-8. (This is one of the many reasons that UTF-8 became the universal standard — transition was relatively easy.)

The 8×4 scheme provides UTF-8 with over a million code points, allowing Unicode to encode characters from 129 scripts and writing systems.

Resources for Understanding Unicode

Books on Unicode

  • Unicode Explained, by Jukka Korpela, provides a good overview of Unicode and various development challenges that come with implementing it
  • Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard, by Richard Gillam, is a helpful, if somewhat dated, explanation of Unicode, with a lot of Java-focused implementation specifics
  • Fonts and Encodings, by Yannis Haralambous, is not solely about Unicode, but might be the book most worth reading; it covers the history of encoding and representing text in computers, providing both a theoretical and practical foundation for understanding Unicode and a number of closely related subjects.

Unicode Reference Material

Once you have a basic understanding of Unicode, you'll mostly find yourself needing to look up specific details — such as the exact encoding of a particular character.

  • The C/C++ Unicode Cheatsheet provides info on converting Microsoft C/C++ to Unicode
  • XML and Unicode Technology Reports is a list of technical reports covering various aspects of using XML and Unicode together
  • Decode Unicode provides an online Unicode dictionary with a beautiful UI, allowing you to view every defined Unicode character, even without local font support
  • Data on Languages provides searchable information on using Unicode character sets with various languages
  • Unicode Navigator provides an organized list of all the Unicode characters

Unicode Tools

  • Unicode Analyzer is a Chrome browser extension that provides information on Unicode text in web pages and documents
  • Character Identifier is a Firefox plugin that provides a context menu for finding more information about selected Unicode characters
  • For inserting Unicode characters in text fields on the web, try Unicode Symbols for Chrome or Unicode Input Tool for Firefox
  • UnicodeDataBrowser provides a GUI for easier reading of the UnicodeData.txt file
  • Polyglot 3000 automatically identifies the language of any text
  • Unicode provides a list of Unicode character keyboard layouts for various Unicode-supported scripts
  • Babel is a Python library for a wide range of internationalization and localization tasks
  • D-Type Unicode Text Engine is a C++ library for laying out, rendering, and editing high-quality Unicode text on any device, platform, or operating system
  • Nunicode is a C library for encoding and decoding of UTF-8 documents
  • Portable UTF-8 provides Unicode support for PHP strings
  • Tesseract OCR provides optical character recognition for Unicode text
  • Zvon Unicode Reference lets you input (type or paste) any Unicode character and then outputs detailed information about the character, including its Unicode numbers, HTML and MathML entity names, TeX instructions, and usability issues
  • Popchar is an improved character map that lets you easily find and type characters from the whole range of the Unicode space
  • Unicode Utilities provides a number of interesting and useful online tools for working with Unicode
  • Edicode provides an flexible online Unicode keyboard for typing text using various international scripts
  • Quickkey is a flexible keyboard extension for typing the first 65,000 defined Unicode characters
  • Unicode Code Converter converts any entered character code into several different encodings of the same character
  • CharFunk is a JavaScript utility for performing a number of interesting checks and operations on Unicode characters
  • Kreative Recode transforms text files from various encoding into Unicode
  • BabelMap Online provides an in-browser Unicode keyboard, with output in display characters as well as hex or decimal encoding

Text and Code Editors

Most of today's text editors, code editors, and IDEs either use Unicode by default, or can easily handle Unicode. Sublime, Notepad++, Atom, and Eclipse are all set to UTF-8 as the default character encoding. Vim and Emacs may need a setting change to use UTF-8:

There are also a handful of code and text editors specifically designed to handle the extended Unicode character set:

  • MinEd is a Unicode text editor with contextual support for inserting characters from the full range of the Unicode character space
  • Classical Text Editor is an advanced editor for working with critical and scholarly editions of texts, including multi-lingual texts using a wide range of the Unicode character set

Unicode Fonts

The relationship between fonts and Unicode is a bit oblique. Unicode was created to be backward-compatible with ASCII — text formatted in ASCII can be decoded as Unicode with virtually no problem. And Unicode-encoded text can be displayed using ASCII fonts, as long as only the small set of characters that appear in ASCII are used.

Today, most fonts available on most computers are encoded with Unicode. So, from that standpoint, most fonts are "Unicode fonts." However, most fonts do not support a particularly large set of the full Unicode standard.

Usually, this is not a problem; someone authoring a text in multiple languages, or with an extended character set, might use several different fonts — one for Latin script, another for each CJK languages, and another for math symbols (for example). However, it can be useful sometimes to have single fonts which contain a large percentage of the Unicode character space. This might be needed when working in plain text and source code environments where using multiple fonts is not feasible, or when visual unity between multiple scripts is especially important.

The following are the most notable font projects providing extended Unicode support. For a more complete listing, including defunct and deprecated fonts, see this page of Unicode fonts. For typesetting Asian languages, see this list of CJK fonts.

  • Everson Mono is a monospace font created by one of the originators of the Unicode standard; its stated purpose is to provide glyphs for as much of the Unicode character space as possible, and (as of this writing) 92 Unicode character blocks are supported.
  • Noto is a large set of display fonts, developed by Google, which together provide support for a high majority of the Unicode character sets, with the intention to eventually support the entire Unicode standard.
  • Deja Vu Fonts is a font family providing wide coverage of the Unicode standard, with Serif, Sans, and Monospace versions.
  • GNU FreeFont is a family of fonts, providing Serif, Sans, and Mono type faces for 37 writing systems and 12 Unicode symbol ranges.
  • GNU Unifont is a monospace, bitmap font with complete coverage for the Unicode 8.0 Basic Multilingual Plane and wide, but incomplete, coverage for the Supplemental Multilingual Plane.

There are also a number of interesting fonts which encode a particular subset of the Unicode standard for specialized use.

  • Junicode is a set of fonts for Medievalists
  • Last Resort is a "font of last resort"; instead of conventional character glyphs, each glyph actually displays information about the Unicode character itself
  • Unicode Fonts for Ancient Scripts is a project to create a set of fonts for several ancient and classical alphabets
  • Unimath Plus provides an extended set of science and math symbols

And here are some additional Unicode font resources, if you still can't find what you are looking for:

Emoji Resources

Emoji are those funny little smiley faces and thumbs up signs that you can put in your text messages. They are actually part of the Unicode standard. The Emoji portion of Unicode is not universally supported, so if you want to incorporate Emoji into your app or website, you may need some help. Here are resources that will help you use and build with Unicode emoji.

Emoji Reference

  • Emojipedia is a searchable database of Emoji characters
  • Can I Emoji? provides information on native support for Unicode emoji on iOS, Android, OS X, and Windows, as well as major browsers
  • WTF Emoji Foundation is a slightly serious organization dedicated to the advancement of emoji; they run the Emoji Dictionary.
  • Emoji cheat sheet provides a quick reference for Emoji type-in codes

Emoji Libraries

Emoji Keyboards and Collections

Summary

Most people just type and don't really think much about what is happening. A select few bother to think about the niceties of font design and typography.

But even smaller is the number of people who know, or care to know, what happens behind the scenes — how a keypress becomes a letter on their computer screen.

To everyone else, it is either transparent or trivial.

But as we've shown, the process of representing language is hardly trivial, and a huge amount of work has gone into making it as transparent as it is. The Unicode Consortium, along with countless developers, designers, and linguists, have made it possible for anyone to write any character, from any language, in any script, on any computer.

This is a notable achievement, and a necessary step toward universal literacy and universal access to computers and the internet.

FAQ

Q. What is the difference between ASCII, Unicode, and UTF-8?

A. ASCII is the older standard from the 1960's, whereas Unicode came into existence in the late 1980s.

ASCII is only 128 or 256 characters, but Unicode has over 10,000.

Unicode is the character table, UTF-8 (or UTF-16 or UTF-32) is the level of encoding. Unicode 0-256 and ASCII are nearly identical, with just some minor differences on the control characters.

UTF-8 is the most common encoding on the web today — and the default.

Q. Do I need to declare what encoding type I'm using for my web page?

A. Only if you know you need to use a unique encoding type.

If you don't declare one, most browsers will default to UTF-8. If you are creating a webpage in a foreign language, especially non-Latin, make sure that you are either using UTF-8 or else pick a special charset.

Q. Do I need to memorize any ASCII codes to write HTML?

A. Only if you're trying to be extremely efficient.

Most websites today are dynamic and generate the HTML for you, through systems like a content management system (CMS). If you are a developer, you will probably be using other programming languages in addition to HTML, and those languages might have special ways of generating those ASCII symbols.

Finally, as discussed above, many of those codes use special character names in HTML instead of ASCII numbers.

Q. Does the character encoding differ on different operating systems?

A. Somewhat.

Unicode is slightly different on Windows vs Unix/Linux. For example, Windows uses UTF-16LE while Linux normally uses UTF-8.

Now, of course, the encoding used by your operating system might differ from the encoding on a webpage, but your OS and the web-browser work together to convert the character codes into something your computer can display.

Sometimes, in older operating systems, this conversion might not work and you would just see blank characters. (For example, it's something you might see visiting a foreign website on Windows XP.)

Q. ASCII Art is awesome! Where can I make my own?

A. AsciiWorld.com has some great galleries and tools in their software section, such as converters and "painters." Have fun!


Other Interesting Stuff

We have more guides, tutorials, and infographics related to coding and website development:

HTML for Beginners — Ultimate Guide

If you really want to learn HTML, we've created a book-length article, HTML for Beginners — Ultimate Guide. And it really is the ultimate guide; it will take you from the very beginning to mastery.

Before Unicode, it was common to visit websites where all the text was represented by empty boxes. Things have changed a lot. In our infographic Web Design Trends You'll Never Forget we run through how the web used to be.


Text written by Tom Riecken with additional material by Adam Michael Wood. Compiled and edited by Frank Moraes. ASCII Zebra is in the public domain. Growth of UTF-8 by Krauss is licensed under CC BY-SA 4.0.