Regular Expressions Tutorial

Regular expressions and special characters in regular expressions

Regular expressions are used to describe a string of text and search for that string of text. With a single regular expression many strings of text can be matched provided they meet the criteria set by the expression. Regular expressions are used by various text editors, programming languages (e.g. Perl, PHP, etc.), and in Apache directives. For example, regular expressions are used with the RedirectMatch directive and with the rewrite directives. So it's useful to know what these expressions mean and how to use them yourself when putting such directives in your .htaccess files. There are different types of regular expressions which are very similar but have some differences. The one used by Apache is Perl Compatible Regular Expressions (PCRE).

Regular expressions are often referred to as regexes (regex in the singular). A regular expression consists of literal characters and special characters. Special characters are also known as meta-characters. A good thing to remember is that Regex searches are case sensitive and whitespaces are also considered characters.

Literal Characters

Literal characters are pretty much the alphanumeric characters. They don't have any special meaning and when they are used in regexes they are matched with the exact same character. For example, an a is an a, 2 is 2, etc. However, sometimes a literal character can have a special meaning in a certain combination with a special character. For example, the combination \d stands for any digit from zero to nine. This is know as shorthand character classes; we'll come to this later.

Now let's proceed to the special characters.

The Dot

.

The dot is used to match any character. This includes alphanumeric characters, symbols and whitespaces. The only exception is the new line. For example, the regular expression:

hx.

would match hxa, hxb, hx7, hx&, etc. Keep in mind that the dot replaces one character. So the above example regex won't match hxaa, for instance.

The Asterisk (Star)

*

The asterisk means that the token/pattern that's right before it should be repeated zero or more times. So the regex:

hxi*

will match hx, hxi, hxii, hxiii, hxiiii, and so on. In this case the asterisk affects only the preceding character. However, it can be applied to whole groups and character sets. We'll come to this later in the tutorial.

The Plus

+

The plus sign is similar to the asterisk, but the preceding token is repeat one or more times. So if we use the example:

hxi+

it will match hxi, hxii, hxiii and so on, but it will not match just hx (as in the example with the asterisk).

The Question Mark

?

The question marks makes the pattern after which it appears optional. This means that the preceding character or group of characters can appear zero or one time. For example:

hxi?

will match hx and hxi.

The Round Brackets

()

The round brackets are used to group characters. This is done so that you can apply other meta-characters to the whole group instead of just to a single character. For example:

hx(id)*

will match hx, hxid, hxidid, hxididid, and so on. The asterisk is applied to the whole group within the brackets which in this case is id.

The other special characters connected with quantity and repetition can also be used with groups. For example:

hx(id)+

will match hxid, hxidid, etc.

And the regex:

hx(id)?

will match hx and hxid.

The Square Brackets

[]

The square brackets are used to specify character classes. With these classes you can control what characters should be matched. For example:

hx[id]

will match hxi and hxd, but not hxid. You can create more complex classes by using the hyphen to specify whole ranges. For example, the regex:

hx[a-z]

will match hxa, hxb, hxc, and so on until the end of the alphabet. In this example the character class in the brackets matches only lowercase letters. If, for example, you use the following regex:

hx[a-zA-Z0-9]

this will match hx plus any letter, whether its lowercase or uppercase, and also any digit from zero to nine(e.g. hxa, hxA, hx2, etc.).

You can also apply special characters to the character class specified in the square brackets. For example:

hx[a-z_]+

will match hx combined with any lowercase letter and/or an underscore. What changes this from the previous examples is the + meta-character after the character class. This means that this example regex will match not only hx_, hxa, hxb, hxc, etc., but also hx_a, hxid, hx_abdfn, hxlmna, etc. The other quantifying characters * and ? can also be used in such a way.

Another interesting thing is that some special characters don't act as such when they are inside a character class, instead they are interpreted literally. For example, the dot, the plus, the asterisk, and the question mark will be interpreted literally when they are placed in square brackets. The hyphen which is used for specifying character ranges will also be interpreted literally if it's placed right after the opening or right after the closing bracket. For example:

hx[.+*?-]

will match hx., hx+, hx*, hx?, and hx-.

There are some other special cases with some of the other meta-characters, but will come to this further down in the tutorial.

The Curly Brackets

{}

The curly brackets are used for specifying quantity. Unlike the asterisk and the plus characters, the curly brackets can be used for determining the quantity more precisely. In the brackets put first the minimum number, then a comma and the maximum number. For example:

hx(id){3,4}

will match hxididid and hxidididid.

If you don't specify a maximum number, there won't be a boundary for the maximum times the particular pattern can be repeated. For example:

hx(id){3,}

will match hxididid, hxidididid, hxididididid, and so on.

If you don't include a maximum number and you don't put a comma after the minimum number, the pattern will be matched exactly as many times as specified by the number. For example:

hx(id){3}

will match only hxididid.

The curly brackets can also be used after a character class. For example:

hx[a-z]{2,4}

will match hx combined with two, three or four lowercase letters (e.g. hxab, hxtmn, hxftop, etc.).

The Backslash

\

The backslash is an escape character. This means that it's used when you want a meta-character to be interpreted literally. You have to put the backslash before the special character. For example:

hx\+

will match hx+.

If you want a backslash to be interpreted literally, you have to put another backslash in front of it. For example:

hx\\

will match hx\.

Unlike the dot, the plus, the asterisk and the question mark, the backslash retains its position as a meta-character when it's used inside a character class. For example:

hx[e\]

will not match hx\. In this case, as usual, you have to escape it with another backslash.

Anchors

^$

The caret and the dollar sign are used as anchors. They are used to specify the beginning and the end of a string. The beginning is marked with ^, and the end with $. For example, let's say that you search through the string:

hostknox provides quality hosting

and you use the following regex:

^host

This will match only the host in hostknox because it's at the beginning of the string. If you search through the same string with the regex:

host$

there will be no matches because neither the host in hostknox, nor the host in hosting is at the end of the string. If, however, you use the same regex host$ to search through the string:

hostknox is a quality host

it will match the host that's at the end of the string. Keep in mind that the Regex Engine processes whitespaces as characters too, so if there's a space after host in the above mentioned string, there will be no matches.

When you use the caret right after the opening bracket of a character class, this means that all characters can be matched except for the ones listed after the caret in the character class. For example:

hx[^ie]

will match hx combined with any other character, but it will not match hxi and hxe. If the caret is in a character class but it's not the first character after the opening bracket, then it will be interpreted literally. For example:

hx[i^e]

will match hxi, hx^, and hxe.

The dollar sign is interpreted literally no matter where within a character class it appears. For example:

hx[$i]

will match hx$ and hxi.

Shorthand

Shorthand character classes exist as a quick way to use some common character classes. There are several shorthand character classes. Their syntax is a backslash followed by a letter. A backslash and a d:

\d

means any digit, so it's equivalent to [0-9]. A backslash followed by a capitalized D:

\D

stands for any character except a digit. This makes it equal to the character class [^\d].

A backslash followed by an s:

\s

matches whitespaces, including new lines. So, for example, in the string h x it will match the space between h and x.

A backslash followed by a capital S:

\S

means any character that is not a whitespace. This makes it equivalent to [^\s].

A backslash followed by a w:

\w

matches any word character. This includes letters, digits and the underscore. It's equivalent to [a-zA-Z0-9_].

A backslash and a capital W:

\W

matches non-word characters (e.g. symbols, whitespaces, etc.).

Shorthand can be used inside square brackets without losing its original meaning. For example

[a-f\d]

will match strings containing the lowercase letters from a to f and any digit.

The Vertical Bar (Pipe)

The pipe is used for alternation. The search will begin with attempting to match the left most alternative, and when it finds a match it will stop. For example:

hxi|hxe|hxa

will search the string for one of the three alternatives. It will try to match hxi first and will stop if it finds a match.