Unicode Introduction and Resources
A computer system doesn't store characters (letters, numbers, symbols) literally — there's no tiny picture of each letter in a document on your hard drive. As you probably know, each character is encoded as a series of binary bits —
0s. For example, the code for the lowercase letter "a" is
01100001 is arbitrary — there's nothing special about that string of bits that should make it the letter "a" — the computer industry has collectively agreed that it means "a." So how does the entire industry come to agree on how to represent every possible character? With a character encoding standard. An encoding standard simply specifies all the possible characters available, and assigns each one a string of bits.
There have been several character encoding standards used around the world over the last several decades of computing. For a long time, the most universally-accepted standard was ASCII. The problem with ASCII is that it only encoded a relatively limited number of characters — 256 at most. This excluded non-Latin languages, many important math and science symbols, and even some basic punctuation marks.
Aside from ASCII's use in English and other languages which use the Latin alphabet, language groups using other alphabets tended to use their own character encoding. Since these encoding were defined apart from each other, they often conflicted; it was impossible to use a single encoding scheme for multiple languages at the same time.
Unicode was originally conceived, and continues to be developed, specifically with the intent to overcome these challenges. The goal of Unicode is to provide a uniersal, unified, and unique code identifier for every grapheme in every language and writing system in the world.
Unicode has been implemented in several character encoding schemes, but the standard most widely used today is UTF-8. UTF-8 has become nearly universal for all types of modern computing.
UTF-8 encodes characters using up to 4 8-bit code blocks. ASCII only used 8 bits per character. Unicode characters previously included in ASCII are represented in UTF-8 by a single 8-bit chunk, the same 8 bits that were used in ASCII. This makes ASCII text forward-compatible in UTF-8. (This is one of the many reasons that UTF-8 became the universal standard — transition was relatively easy.)
The 8×4 scheme provides UTF-8 with over a million code points, allowing Unicode to encode characters from 129 scripts and writing systems.
Resources for Understanding Unicode
- An Introduction to Writing Systems and Unicode is a very thorough, even eloquent, explanation of character encoding generally, and Unicode in particular; if you can only read one thing on Unicode, this is the one to read;
- The Unicode Standard: A Technical Introduction is the official explanation of the Unicode standard;
- To the BMP and Beyond! is a tutorial on Unicode, suitable for classrom presentation or self-study;
- The Unicode Tutorial explains how Unicode works, including interesting details like combining characters, and how a Unicode parsing engine should function.
Books on Unicode
- Unicode Explained, by Jukka Korpela, provides a good overview of Unicode and various development challenges that come with implementing it;
- Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard, by Richard Gillam, is a helpful, if somewhat dated, explanation of Unicode, with a lot of Java-focused implementation specifics;
- Fonts and Encodings, by Yannis Haralambous, is not solely about Unicode, but might be the book most worth reading; it covers the history of encoding and representing text in computers, providing both a theoretical and practical foundation for understanding Unicode and a number of closely related subjects.
Unicode Reference Material
Once you have a basic understanding of Unicode, you'll mostly find yourself needing to look up specific details — such as the exact encoding of a particular character.
- The C/C++ Unicode Cheatsheet provides info on converting Microsoft C/C++ to Unicode;
- XML and Unicode Technology Reports is a list of technical reports covering various aspects of using XML and Unicode together;
- Decode Unicode provides an online Unicode dictionary with a beautiful UI, allowing you to view every defined Unicode character, even without local font support;
- Data on Languages provides searchable information on using Unicode character sets with various languages;
- Unicode Navigator provides an organized list of all the Unicode characters;
- Unicode Analyzer is a Chrome browser extension that provides information on Unicode text in web pages and documents;
- Character Identifier is a Firefox plugin that provides a context menu for finding more information about selected Unicode characters;
- For inserting Unicode characters in text fields on the web, try Unicode Symbols for Chrome or Unicode Input Tool for Firefox;
- UnicodeDataBrowser provides a GUI for easier reading of the UnicodeData.txt file;
- Polyglot 3000 automatically identifies the language of any text;
- Unicode provides a list of Unicode character keyboard layouts for various Unicode-supported scripts;
- AutoUniConv is an automatic unicode converter, converting input in any character encoding into Unicode;
- Babel is a Python library for a wide range of internationalization and localization tasks;
- D-Type Unicode Text Engine is a C++ library for laying out, rendering, and editing high-quality Unicode text on any device, platform, or operating system;
- Nunicode is a C library for encoding and decoding of UTF-8 documents;
- OpenTop provides Unicode support for C++ and its standard library;
- Portable UTF-8 provides Unicode support for PHP strings;
- Tesseract OCR provides optical character recognition for Unicode text;
- Zvon Unicode Reference lets you input (type or paste) any Unicode character and then outputs detailed information about the character, including its Unicode numbers, HTML and MathML entity names, TeX instructions, and usability issues;
- Popchar is an improved character map that lets you easily find and type characters from the whole range of the Unicode space;
- Unicode Utilities provides a number of interesting and useful online tools for working with Unicode;
- Edicode provides an flexible online Unicode keyboard for typing text using various international scripts;
- Quickkey is a flexible keyboard extension for typing the first 65,000 defined Unicode characters;
- Unicode Code Converter converts any entered character code into a several different encodings of the same character;
- Kreative Recode transforms text files from various encoding into Unicode;
- BabelMap Online provides an in-browser Unicode keyboard, with output in display characters as well as hex or decimal encoding;
Text and Code Editors
Most of today's text editors, code editors, and IDEs either use Unicode by default, or can easily handle Unicode. Sublime, Notepad++, Atom, and Eclipse are all set to UTF-8 as the default character encoding. Vim and Emacs may need a setting change to use UTF-8:
There are also a handful of code and text editors specifically designed to handle the extended Unicode character set:
- MinEd is a Unicode text editor with contextual support for inserting characters from the full range of the Unicode character space;
- Classical Text Editor is an advanced editor for working with critical and scholarly editions of texts, including multi-lingual texts using a wide range of the Unicode character set;
The relationship between fonts and Unicode is a bit oblique. Unicode was created to be backwards compatible with ASCII — text formatted in ASCII can be decoded as Unicode with virtually no problem. And Unicode-encoded text can be displayed using ASCII fonts, as long as only the small set of characters that appear in ASCII are used.
Today, most fonts available on most computers are encoded with Unicode. So, from that standpoint, most fonts are "Unicode fonts." However, most fonts do not support a particularly large set of the full Unicode standard.
Usually, this is not a problem; someone authoring a text in multiple languages, or with an extended character set, might use several different fonts — one for Latin script, another for each CJK languages, and another for math symbols (for example). However, it can be useful sometimes to have single fonts which contain a large percentage of the Unicode character space. This might be needed when working in plain text and source code environments where using multiple fonts is not feasible, or when visual unity between multiple scripts is especially important.
The following are the most notable font projects providing extended Unicode support. For a more complete listing, including defunct and deprecated fonts, see this page of Unicode fonts. For typesetting Asian languages, see this list of CJK fonts.
- Everson Mono is a monospace font created by one of the originators of the Unicode standard; its stated purpose is to provide glyphs for as much of the Unicode character space as possible, and (as of this writing) 92 Unicode character blocks are supported;
- Noto is a large set of display fonts, developed by Google, which together provide support for a high majority of the Unicode character sets, with the intention to eventually support the entire Unicode standard;
- Deja Vu Fonts is a font family providing wide coverage of the Unicode standard, with Serif, Sans, and Monospace versions;
- GNU FreeFont is a family of fonts, providing Serif, Sans, and Mono type faces for 37 writing systems and 12 Unicode symbol ranges;
- GNU Unifont is a monospace, bitmap font with complete coverage for the Unicode 8.0 Basic Multilingual Plane and wide, but incomplete, coverage for the Supplemental Multilingual Plane.
There are also a number of interesting fonts which encode a particular subset of the Unicode standard for specialized use.
- Junicode is a set of fonts for Medievalists;
- Last Resort is a "font of last resort;" instead of conventional character glyphs, each glyph actually displays information about the Unicode character itself;
- Unicode Fonts for Ancient Scripts is a project to create a set of fonts for several ancient and classical alphabets;
- Unimath Plus provides an extended set of science and math symbols;
And here are some additional Unicode font resources, if you still can't find what you are looking for:
- SIL Fonts a number of fonts for various under-supported languages, created by SIL International, a global non-profit serving minority language communities;
- Unicode character ranges and the Unicode fonts that support them will help you find a font for any range of Unicode characters.
Emoji are those funny little smiley faces and thumbs up signs that you can put in your text messages. They are actually part of the Unicode standard. The Emoji portion of Unicode is not universally supported, so if you want to incorporate Emoji into your app or website, you may need some help. Here are resources that will help you use and build with Unicode emoji.
- Emojipedia is a searchable database of Emoji characters;
- Can I Emoji? provides information on native support for Unicode emoji on iOS, Android, OS X, and Windows, as well as major browsers;
- WTF Emoji Foundation is a slightly serious organization dedicated to the advancement of emoji; they run the Emoji Dictionary.
- Emoji cheat sheet provides a quick reference for Emoji type-in codes
- Include Emoji in apps, and translate between several vendor standards, with this PHP Emoji library; or try this PHP7 emoji library that lets you reference Emoji by name within your code;
- Emoji for Python supports both the official Unicode emoji and several sets of aliases; Django developers can also use the django-emoji package;
- Emoji Golang provides Emoji support for the Go programming language;
- there are severalgems for Emoji support in Ruby, but the one by Github is probably the best one to use;
- Emoji-Java provides Emoji support in Java;
- Coloremoji.sty makes it easy to include full-color Emoji in LaTeX documents;
- Npm, the package management system for Node.js has several emoji packages:
- Emoji Syntax is a silly library for the Atom text editor that adds emoji to lines of code based on their meaning.
Emoji Keyboards and Collections
- EmojiXpress for iOS is an Emoji collection and keyboard for the iPhone;
- Emojione is a cross-platform Emoji collection with Creative Commons licensed artwork free for developers;
- iDiversicons provides a wide range of diverse Emoji characters, and an iPhone keyboard.
Most people just type and don't really think much about what is happening. A select few bother to think about the niceties of font design and typography. But even smaller is the number of people who know, or care to know, what happens behind the scenes — how a keypress becomes a letter on their computer screen. To everyone else, it is either transparent or trivial.
But this process of representing language is hardly trivial, and a huge amount of work has gone into making it as transparent as it is. The Unicode Consortium, along with countless developers, designers, and linguists, have made it possible for anyone to write any character, from any language, in any script, on any computer. This is a notable achievement, and a necessary step toward universal literacy and universal access to computers and the internet.
Further Reading and Resources
We have more guides, tutorials, and infographics related to coding and development:
- The Ultimate Guide to ASCII Encoding: go further back in time when computers first started displaying characters.
- PostScript Introduction and Resources: learn all about the page display language that changed the world.
- Fonts For Web Design: a Primer: learn the basics of fonts and their use in web design.
Web Design Trends You'll Never Forget
Before Unicode, it was common to visit websites where all the text was represented by empty boxes. Things have changed a lot. In our infographic Web Design Trends You'll Never Forget we run through how the web used to be.