Regular Expressions Primer and Resource

A regular expression, regex or regexp for short, is a sequence of letters and symbols that defines a logical pattern. Strings of text can then be compared to the pattern in order to identify strings that match the logical pattern defined by the regex. On the basis of this comparison, regex can be used to identify strings of text that meet specific requirements or to validate that strings meet a required pattern.

If this explanation of regex seems a bit abstract, perhaps taking a look at a few common uses of regex will help clarify their usefulness. Regex are used all the time in computer programming. For example, here are a few common uses for regex:

  • To validate that an email address entered into a web form is a properly formulated email address.
  • To identify all files in a computer system that end with a certain file extension.
  • To check URLs requested of a web server and perform redirects if the URLs meet a regex pattern.

It's important to understand that regex is a logical system for describing patterns and not a language. However, regex has been implemented in many different programming languages and can also be used for searching text in many text editors.

History of Regex

Regex was conceived as a theoretical computer science principle when it was created in 1956 by mathematician Stephen Cole Kleene. Initially, regex was purely theoretical. However, in 1968 it was used in a computer application for the first time when Ken Thompson incorporated it in the QED text editor. Thompson wasn't alone in his adoption of regex. Around the same time, Douglas T Ross incorporated regex for lexical analysis in a compiler.

These first regex implementations were soon followed. In the early 1970s, regex logical patterns were added to the Unix text editor, ed. Shortly thereafter, the regex parser built into ed was rolled off as a standalone Unix utility called grep. At that point it was only a matter of time before regex support was incorporated into many Unix utilities and applications including vi, lex, sed, Awk, expr, Emacs, and more.

By the time regex had found its way into various corners of the Unix operating system it was entrenched. However, there was still room for improvement. So the original regex syntax developed by Kleene was improved in the 1980s when support for expanded regex patterns was added to Perl based on an expanded regex library written by Henry Spencer. However, even after the addition of regex to Perl Spencer continued to expand on the idea and a later and more advanced iteration of his library was built into the Tcl programming language — and that implementation eventually made its way into high-profile modern information management projects like PostgreSQL.

Throughout the 1980s, regex was never standardized. However, that changed in 1992 when regex was standardized in POSIX.2. Today, regex is supported by many different programming languages and text editors. Most modern implementations are POSIX compliant but generally do expand on the POSIX standard in various ways that vary from one implementation to the next. As a result, while basic regex patterns generally do match from one implementation to the next, advanced patterns can vary considerably depending on the environment in which they are applied.

How Regex Works

A regular expression is a combination of two types of characters: literals and special characters. In combination, these characters define a logical pattern. Strings of text can be compared to this pattern to see if they fit the pattern defined by the expression.

Literal characters represent themselves. That means that a literal letter a represents the letter a and a literal number 1 represents the digit 1. However, regex isn't very valuable if limited to literal characters. Special characters are what make regex useful.

Special characters have a logical meaning within a regex pattern. For example, let's look at the dot. The dot, or period, is used to represent any character. So, .a would match any two-character sequence in which the second character was the literal letter a. That means that ba, 1a, -a, aa, and the letter a preceded by an empty space would all match that regex pattern.

The dot isn't the only special character in regex. There are around a dozen special characters that can be combined with literals to describe complex logical patterns. Let's look at two examples of how literals and special characters are combined to define complex expressions:

  • If you wanted to make sure that a string of text was an email address you could test it against this regex pattern: ^[A-Z0-9._%+-][email protected][A-Z0-9.-]+\.[A-Z]{2,}$.
  • To scan file names and catch any that ended with the .php file extension, you could use the following regex: ^.*\.php$.

If you're new to regex, those patterns are probably pretty confusing. To understand how those patterns work, you need to understand what all of the special characters mean.

Regex Special Characters

While each regex implementation varies a bit in some regards, they generally all treat these special characters the same way with few exceptions.

  • \: The backslash character is used to escape other special characters. So if you wish to escape another character, such as a dot, so that it will be interpreted as a literal character, you could do so by preceding the dot with a backslash like this: \.
  • ^: An uptick is used to indicate the beginning position in a string. The regex ^a would match any string beginning with the literal letter a.
  • $: A dollar sign is used to match the ending position in a string. The regex a$ would match any string ending with the literal letter a.
  • .: The dot or period matches any character other than a newline (\n) character. The regex 1.3 would match any string with a literal one, any character, and a literal three such as 123, 1a3, and even 1 3.
  • |: The vertical pipe is a choice operator and can be interpreted as a stand-in for the word or. So the regex a|b could be read as "a or b" and would match either the letter a or the letter b.
  • *: An asterisk is used to match the preceding character zero or more times. So the regex .* would match any string at all as long as it did not contain a newline character.
  • ?: The question mark will match the preceding character zero or one times, but no more. The regex 123? would therefore match either 12 or 123.
  • +: A plus symbol matches the preceding character one or more times, but the character must appear at least once, unlike the asterisk which matches the preceding character zero or more times. So, the regex a*b+ would match ab and b, but not a because the b must appear at least once due to the use of the plus symbol.
  • [...]: Brackets match a single character contained within the brackets. Or, an uptick can be added, like this [^...], to match any character that is not contained within the brackets. You do not need to escape special characters within brackets, they will be interpreted as literals. Brackets are often used to define ranges of characters. For example, the regex [0-9] would match any single digit and [A-Z] would match any uppercase letter in the English alphabet. Finally, brackets can contain comma separated characters. So [a,g] would match either of the letters a and g.
  • {...}: Curly braces, called explicit quantifiers, specify the number of times that the preceding character must appear. The regex ab{2}c can only be met by abbc. A second number can be added to create a range of acceptable values. So, ab{2,3}c would match either abbc or abbbc and [0-9]{1,2} would match any one or two digit number.
  • (...): Parenthesis are used to mark a subexpression within a larger expression. So the regex (abc)* would match any string made up of the letters abc, but all three letters must appear together and in that order.

On their own, special characters are somewhat useful. It's in the combination of these special characters with literals that powerful patterns can be described. The list of special characters above includes a few simple special character combinations. Let's take a second look at the regex for identifying PHP files to see how these characters work together.

Here's the regex in question: ^.*\.php$. Let's take it one piece at a time.

  • ^ indicates the beginning of the test string, which in this case would be a file name.
  • .* work together to indicate that the file name can contain any number of any characters, but no newlines.
  • \.php begins with an escape character that ensures that the dot is interpreted as a literal rather than a special character. Next, the literal letters php indicate that after the name of the file, we want to check for the combination of a dot and the letters php.
  • $ indicates that the .php extension must appear at the end of the search string in order to be a match.

As you can see, in just nine characters this regex manages to create a logical pattern that can sift through complex file names and pinpoint those that end with the .php file extension.

Unicode and Regex

One of the problems that can crop up when dealing with regex is how to deal with languages that use characters that aren't part of the modern English alphabet. Unicode is an encoding standard that attempts to solve this problem by encoding digits into numeric values.

Some regex implementations support the use of Unicode to build regex patterns, meaning that regex patterns can be created that will be able to test strings that include characters from a variety of different languages. In addition, Unicode expressions can be combined with literals and special characters to build complex regex patterns.

Some of the most common Unicode regex patterns include:

  • \p{L}: Matches any letter in any language. So, \p{L}{2,4} matches any sequence of letters between two and four characters long.
  • \p{Z}: Matches any space characters.
  • \p{N}: Matches numeric characters.
  • \p{P}: Matches punctuation characters.

There are many additional Unicode properties you can use when building regular expressions, provided the implementation you're working with supports Unicode regex. To learn more, read Unicode Regular Expressions at Regular-Expressions.info.

Resources

We've covered just enough in this primer to give you a basic feel for how regex works and how you can use it to pinpoint specific bits of code and text in a text editor or to identify and validate data in a computer applicatioon. To learn more, check out the follow resources that provided more advanced information than what we've covered in this brief tutorial:

One of the best ways to get the hang of writing regular expressions is to start writing them. The following online tools will allow you to do just that, comparing your custom regex to a block of custom text on the fly so you can figure out how to formulate regular expressions that work:

  • RegExr: the tagline says it all, "Learn, Build, and Test RegEx."
  • Regex Pal: create a regex pattern and then test strings against it to make sure it does exactly what you want it to do.
  • Regex 101: not all regex implementations are created equal. Use this tool to test strings against a regex pattern and fine tune behavior based on the programming language where you'll be implementing the pattern. PHP, JavaScript, Python, and Erlang are supported.

Quite a few texts have been written about regex, but three stand out as being best in class. If you really want to master regex, any one of these three texts will go a long way towards getting you there:

Summary

Regular expressions are a language-independent tool used by computer programmers to build logical patterns. These patterns can then be used to identify strings of text that fit the pattern. Regex implementations abound and regex is supported by most modern programming languages and available within the search tools built into many text editors. Getting started with regex can be tricky, but mastering regex is a necessary step in the development of any computer programmer.


Further Reading and Resources

We have more guides, tutorials, and infographics related to computers and coding:

The Ultimate List of Webmaster Tools A-Z

Sed and Awk and Bash are just part of a large collection of tools that allow system administrators to manage operating systems. The Ultimate List of Webmaster Tools A-Z will provide you with a lot of help in doing your work.