Unicode Introduction and Resources

Unicode is a standard for character encoding managed by The Unicode Consortium.

A computer system doesn't store characters (letters, numbers, symbols) literally — there's no tiny picture of each letter in a document on your hard drive. As you probably know, each character is encoded as a series of binary bits1s and 0s. For example, the code for the lowercase letter "a" is 01100001.

But 01100001 is arbitrary — there's nothing special about that string of bits that should make it the letter "a" — the computer industry has collectively agreed that it means "a." So how does the entire industry come to agree on how to represent every possible character? With a character encoding standard. An encoding standard simply specifies all the possible characters available, and assigns each one a string of bits.

There have been several character encoding standards used around the world over the last several decades of computing. For a long time, the most universally-accepted standard was ASCII. The problem with ASCII is that it only encoded a relatively limited number of characters — 256 at most. This excluded non-Latin languages, many important math and science symbols, and even some basic punctuation marks.

Aside from ASCII's use in English and other languages which use the Latin alphabet, language groups using other alphabets tended to use their own character encoding. Since these encoding were defined apart from each other, they often conflicted; it was impossible to use a single encoding scheme for multiple languages at the same time.

Unicode was originally conceived, and continues to be developed, specifically with the intent to overcome these challenges. The goal of Unicode is to provide a uniersal, unified, and unique code identifier for every grapheme in every language and writing system in the world.


Unicode has been implemented in several character encoding schemes, but the standard most widely used today is UTF-8. UTF-8 has become nearly universal for all types of modern computing.

UTF-8 encodes characters using up to 4 8-bit code blocks. ASCII only used 8 bits per character. Unicode characters previously included in ASCII are represented in UTF-8 by a single 8-bit chunk, the same 8 bits that were used in ASCII. This makes ASCII text forward-compatible in UTF-8. (This is one of the many reasons that UTF-8 became the universal standard — transition was relatively easy.)

The 8×4 scheme provides UTF-8 with over a million code points, allowing Unicode to encode characters from 129 scripts and writing systems.

Resources for Understanding Unicode

Books on Unicode

  • Unicode Explained, by Jukka Korpela, provides a good overview of Unicode and various development challenges that come with implementing it;
  • Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard, by Richard Gillam, is a helpful, if somewhat dated, explanation of Unicode, with a lot of Java-focused implementation specifics;
  • Fonts and Encodings, by Yannis Haralambous, is not solely about Unicode, but might be the book most worth reading; it covers the history of encoding and representing text in computers, providing both a theoretical and practical foundation for understanding Unicode and a number of closely related subjects.

Unicode Reference Material

Once you have a basic understanding of Unicode, you'll mostly find yourself needing to look up specific details — such as the exact encoding of a particular character.

  • The C/C++ Unicode Cheatsheet provides info on converting Microsoft C/C++ to Unicode;
  • XML and Unicode Technology Reports is a list of technical reports covering various aspects of using XML and Unicode together;
  • Decode Unicode provides an online Unicode dictionary with a beautiful UI, allowing you to view every defined Unicode character, even without local font support;
  • Data on Languages provides searchable information on using Unicode character sets with various languages;
  • Unicode Navigator provides an organized list of all the Unicode characters;


  • Unicode Analyzer is a Chrome browser extension that provides information on Unicode text in web pages and documents;
  • Character Identifier is a Firefox plugin that provides a context menu for finding more information about selected Unicode characters;
  • For inserting Unicode characters in text fields on the web, try Unicode Symbols for Chrome or Unicode Input Tool for Firefox;
  • UnicodeDataBrowser provides a GUI for easier reading of the UnicodeData.txt file;
  • Polyglot 3000 automatically identifies the language of any text;
  • Unicode provides a list of Unicode character keyboard layouts for various Unicode-supported scripts;
  • AutoUniConv is an automatic unicode converter, converting input in any character encoding into Unicode;
  • Babel is a Python library for a wide range of internationalization and localization tasks;
  • D-Type Unicode Text Engine is a C++ library for laying out, rendering, and editing high-quality Unicode text on any device, platform, or operating system;
  • Nunicode is a C library for encoding and decoding of UTF-8 documents;
  • OpenTop provides Unicode support for C++ and its standard library;
  • Portable UTF-8 provides Unicode support for PHP strings;
  • Tesseract OCR provides optical character recognition for Unicode text;
  • Zvon Unicode Reference lets you input (type or paste) any Unicode character and then outputs detailed information about the character, including its Unicode numbers, HTML and MathML entity names, TeX instructions, and usability issues;
  • Popchar is an improved character map that lets you easily find and type characters from the whole range of the Unicode space;
  • Unicode Utilities provides a number of interesting and useful online tools for working with Unicode;
  • Edicode provides an flexible online Unicode keyboard for typing text using various international scripts;
  • Quickkey is a flexible keyboard extension for typing the first 65,000 defined Unicode characters;
  • Unicode Code Converter converts any entered character code into a several different encodings of the same character;
  • CharFunk is a JavaScript utility for performing a number of interesting checks and operations on Unicode characters;
  • Kreative Recode transforms text files from various encoding into Unicode;
  • BabelMap Online provides an in-browser Unicode keyboard, with output in display characters as well as hex or decimal encoding;

Text and Code Editors

Most of today's text editors, code editors, and IDEs either use Unicode by default, or can easily handle Unicode. Sublime, Notepad++, Atom, and Eclipse are all set to UTF-8 as the default character encoding. Vim and Emacs may need a setting change to use UTF-8:

There are also a handful of code and text editors specifically designed to handle the extended Unicode character set:

  • MinEd is a Unicode text editor with contextual support for inserting characters from the full range of the Unicode character space;
  • Classical Text Editor is an advanced editor for working with critical and scholarly editions of texts, including multi-lingual texts using a wide range of the Unicode character set;

Unicode Fonts

The relationship between fonts and Unicode is a bit oblique. Unicode was created to be backwards compatible with ASCII — text formatted in ASCII can be decoded as Unicode with virtually no problem. And Unicode-encoded text can be displayed using ASCII fonts, as long as only the small set of characters that appear in ASCII are used.

Today, most fonts available on most computers are encoded with Unicode. So, from that standpoint, most fonts are "Unicode fonts." However, most fonts do not support a particularly large set of the full Unicode standard.

Usually, this is not a problem; someone authoring a text in multiple languages, or with an extended character set, might use several different fonts — one for Latin script, another for each CJK languages, and another for math symbols (for example). However, it can be useful sometimes to have single fonts which contain a large percentage of the Unicode character space. This might be needed when working in plain text and source code environments where using multiple fonts is not feasible, or when visual unity between multiple scripts is especially important.

The following are the most notable font projects providing extended Unicode support. For a more complete listing, including defunct and deprecated fonts, see this page of Unicode fonts. For typesetting Asian languages, see this list of CJK fonts.

  • Everson Mono is a monospace font created by one of the originators of the Unicode standard; its stated purpose is to provide glyphs for as much of the Unicode character space as possible, and (as of this writing) 92 Unicode character blocks are supported;
  • Noto is a large set of display fonts, developed by Google, which together provide support for a high majority of the Unicode character sets, with the intention to eventually support the entire Unicode standard;
  • Deja Vu Fonts is a font family providing wide coverage of the Unicode standard, with Serif, Sans, and Monospace versions;
  • GNU FreeFont is a family of fonts, providing Serif, Sans, and Mono type faces for 37 writing systems and 12 Unicode symbol ranges;
  • GNU Unifont is a monospace, bitmap font with complete coverage for the Unicode 8.0 Basic Multilingual Plane and wide, but incomplete, coverage for the Supplemental Multilingual Plane.

There are also a number of interesting fonts which encode a particular subset of the Unicode standard for specialized use.

  • Junicode is a set of fonts for Medievalists;
  • Last Resort is a "font of last resort;" instead of conventional character glyphs, each glyph actually displays information about the Unicode character itself;
  • Unicode Fonts for Ancient Scripts is a project to create a set of fonts for several ancient and classical alphabets;
  • Unimath Plus provides an extended set of science and math symbols;

And here are some additional Unicode font resources, if you still can't find what you are looking for:

Emoji Resources

Emoji are those funny little smiley faces and thumbs up signs that you can put in your text messages. They are actually part of the Unicode standard. The Emoji portion of Unicode is not universally supported, so if you want to incorporate Emoji into your app or website, you may need some help. Here are resources that will help you use and build with Unicode emoji.

Emoji Reference

  • Emojipedia is a searchable database of Emoji characters;
  • Can I Emoji? provides information on native support for Unicode emoji on iOS, Android, OS X, and Windows, as well as major browsers;
  • WTF Emoji Foundation is a slightly serious organization dedicated to the advancement of emoji; they run the Emoji Dictionary.
  • Emoji cheat sheet provides a quick reference for Emoji type-in codes

Emoji Libraries

Emoji Keyboards and Collections


Most people just type and don't really think much about what is happening. A select few bother to think about the niceties of font design and typography. But even smaller is the number of people who know, or care to know, what happens behind the scenes — how a keypress becomes a letter on their computer screen. To everyone else, it is either transparent or trivial.

But this process of representing language is hardly trivial, and a huge amount of work has gone into making it as transparent as it is. The Unicode Consortium, along with countless developers, designers, and linguists, have made it possible for anyone to write any character, from any language, in any script, on any computer. This is a notable achievement, and a necessary step toward universal literacy and universal access to computers and the internet.

Further Reading and Resources

We have more guides, tutorials, and infographics related to coding and development:

Before Unicode, it was common to visit websites where all the text was represented by empty boxes. Things have changed a lot. In our infographic Web Design Trends You'll Never Forget we run through how the web used to be.