Mastering Lookahead and Lookbehind


Note: For a quick summary of lookarounds, see the lookaround section of the "(?" syntax reference.

Lookarounds often cause confusion to the regex apprentice. I believe this confusion promptly disappears if one simple point is firmly grasped. It is that at the end of a lookahead or a lookbehind, the regex engine hasn't moved on the string. You can chain three more lookaheads after the first, and the regex engine still won't move. In fact, that's a useful technique.

Lookahead Example: Simple Password Validation

Let's get our feet wet right away with an expression that validates a password. The technique shown here will be useful for all kinds of other data you might want to validate (such as email addresses or phone numbers).
Our password must meet four conditions:

1. The password must have between six and ten word characters.
2. It must include at least one lowercase character.
3. It must include at least one uppercase character.
4. It must include at least one number.

With lookarounds, your feet stay planted on the string. You're just looking, not moving!
Our strategy will be to stand at the beginning of the string and look ahead several times. We'll look for the number of characters, then we'll look for the lowercase letter, and so on. If all the lookaheads are successful, we'll know it's just what we want…
We'll then simply gobble it all up with a plain .*

See, a lookaround is really a form of conditional. It says "if you see X, then go ahead and match that". In the case of a negative lookaround, it says "if you don't see Y, then go ahead and match that".

Let's start with condition #1. A string of six-to-ten word characters can be written like this:
\w{6,10}
To look for this at the beginning of the string, we embed the expression in a lookahead, and we anchor the lookahead at the beginning of the string, so that we start looking from there:
^(?=\w{6,10})
We don't want the string to contain anything else, so we add the dollar anchor in the lookahead:
^(?=\w{6,10}$)
Finally, if this lookahead is successful, we match the whole string:
^(?=\w{6,10}$).*

So far, we have an expression that validates that a string has six to ten word characters. Setting the other conditions will be a simple matter of adding lookaheads, and we'll get to this in a moment. But first, let's pause to note that our first condition is a special one that is present in most password-validation-style tasks: it is the condition that specifies the allowable set of character (in our case, only \w characters). This condition also happens to specify the string's length. When we're done, I'll show a simple variation that gets rid of the lookahead for that special condition (the allowable set of characters), but for now I want to walk you through the process of building the validation regex by adding conditions one by one, because that is the core technique to remember.

Now let's add lookaheads for the other conditions.

First, let's look ahead if the string contains at least one lowercase letter:
(?=.*?[a-z])

How does this lookahead work? The .*? lazily eats up anything until it meets one lowercase letter. Of course, once you exit the lookahead, the engine hasn't eaten up any string characters: we are back where we started, at the beginning of the string.

So at this stage, we could combine this lookahead with our earlier expression:
^(?=\w{6,10}$)(?=.*?[a-z]).*
We look ahead twice in a row, and if we're satisfied by what we see, we eat up the string.

Following the same idea, for for condition #3 (at least one uppercase letter), we can use:
(?=.*?[A-Z])

For condition #4 (at least one digit), we use:
(?=.*?\d)

Pulling all the lookaheads together into one regex, here is our password-validating expression:
^(?=\w{6,10}$)(?=.*?[a-z])(?=.*?[A-Z])(?=.*?\d).*

By now, you know I'm fond of unrolling my regexes by using PHP's comment mode. Unrolled, our password validation expression looks like this:

(?x)           # Comment mode

^              # Anchor: beginning of string

(?=\w{6,10}$)  # Look ahead: six to ten word characters, then the end of the string

(?=.*?[a-z])   # Look ahead: anything, then a lower-case letter

(?=.*?[A-Z])   # Look ahead: anything, then an upper-case letter

(?=.*?\d)      # Look ahead: anything, then one digit

.*             # The lookaheads worked. We like this string, so let's eat it all up!

What if you want the password to contain ten or more of any character? No problem, you can modify the first lookaround so that the regex looks like so:
^(?=.{10,}$)(?=.*?[a-z])(?=.*?[A-Z])(?=.*?\d).*


What if you also want the password to contain at least one "special character"? You can add the following lookahead, which just adds a "special character" lookaround at the end.
(?=.*?[-!@#$%^&*()_<>:|{}.?[\]])


Variation: removing the "special lookahead" for the set of allowable characters
Let's go back to our full validation regex:
^(?=\w{6,10}$)(?=.*?[a-z])(?=.*?[A-Z])(?=.*?\d).*

The dot-star at the end eats up all of the string's characters until the end of the string. Now look a our first condition, (?=\w{6,10}$) It too matches all the characters to the end of the string, but inside a lookahead. Well, since we're later going to match all characters anyway, the first lookahead and the dot-star are redundant. We can take the expression in the first lookahead, and use that instead of the dot-star to eat up the whole string:

^(?=.*?[a-z])(?=.*?[A-Z])(?=.*?\d)\w{6,10}$

This variation performs the same job of ensuring that only allowable characters are matched, and that the string is the correct length. But it allows us to get rid of a lookahead. This is more efficient, as it means that the regex engine reads the entire string one les time.

You don't have to use this variation: it is perfectly acceptable to build your validation regexps one condition at a time. But if you like to optimize, you'll probably remember that instead of an ugly dot-star at the end, you can pull out the "special lookahead" that validates that the string only contains characters from a specific set, and use its expression to match the entire string once the other conditions have been checked.


Lookarounds Stand their Ground

If I seem to be flogging a dead horse here, it's only because this point is a common source of confusion. As the password example made clear, lookarounds stand their ground. They start looking from the exact position where the regex engine is presently standing on the string. At the end of the lookaround, the engine hasn't moved, so if you have a second lookaround, it starts looking left on the string if it's a lookbehind, or right if it's a lookahead, starting from the exact same position.

So lookahead doesn't mean "look way ahead into the distance". It means "look what character your right foot is touching, and perhaps beyond". Sure, if a lookahead contains something like a dot-star (.*), it can let you see far off in the distance, but with binoculars—always starting from right where the engine is standing.

Lookarounds want to be Anchored

When a regex that starts with a lookaround fails, the engine does what it would do if the string started with anything else: it tries the regex on the second character. This means that potentially, a lookahead that is meant to be used only once ends up being used many times. In the password example, consider what happens if we leave off the anchor before the first lookahead. For now, just pretend that the plus sign I inserted at the end of {6,10}+ is not there (in this context, it doesn't mean "one or many").
(?=\w{6,10}+$).*

With this unanchored lookahead, if we have a seven-letter string, everything's fine. But what if we have a string that is made of a hundred letters without spaces, then one space? On the first character, the engine tries the lookahead. It fails, because after matching ten characters it fails to find the end of the string specified by the dollar sign. There is no anchor telling the regex engine to only try the lookahead at the start of the string, so it now tries the lookahead in the second position of the hundred-letter string. It fails again. The engine will try the lookahead starting at all of the hundred letters of the string! At the 91st character, it will find ten word characters followed by a space instead of the end of the string, so it will keep failing.

From an efficiency standpoint, this kind of explosion is horrific. This makes the case for helping the regex engine along so it doesn't needlessly apply a lookaround multiple times. To do that, you anchor—either by using string anchors like the caret in this pattern, or by placing literal text (such as letters that you know must be in the string) just before the lookaround.

A Note for Eager Regex Students Only
If you remove the plus sign, compared with a baseline of ^(?=\w{6,10}+$).*, the explosion is even worse. What was that plus sign in the first place? In this context, the plus sign in \w{6,10}+ achieves the same as writing (?>\w{6,10}), which is an atomic group. It means that after finding ten characters and failing to find the end of the string, the lookahead gives up and the engine moves on down to the next character. Without the plus, before the engine moves down the string, the lookahead gives up one of the ten characters and tries to find nine chanracters (the next least greedy option for {6,10}) and the end of the string… and so on until if fails to find six characters followed by the end of the string. It's only at that point that the engine realizes the lookahead has failed at this position in the string, and decides to move down to the next character. The anchored, atomic baseline fails after ten steps. The non-anchored, non-atomic version takes about 4000 steps before it gives up!

Two Main Ways of Using Lookarounds

As you probably know, there are four kinds of lookarounds:
(?= (Lookahead)
(?! (Negative Lookahead)
(?<= (Lookbehind)
(?<! (Negative Lookbehind)

What may not be so clear is that each of these lookarounds can be used in two main ways: before the expression to be matched, or after it. These two ways have a slightly different feel. Please don't obsess over the differences; rather, just cruise through these simple examples to become familiar with the types of effects you can achieve.

Lookaround Before the Match
(?=\d{3} dollars).{3} (Lookahead). Looks ahead for three digits followed by " dollars". Matches "100" in "100 dollars".

(?!=\d{3} pesos)\d{3} (Negative Lookahead). Makes sure what follows is not three digits followed by " pesos". Matches "100" in "100 dollars".

(?<=USD)\d{3} (Lookbehind). Makes sure "USD" precedes the text to be matched. Matches "100" in "USD100".

(?<!USD)\d{3} (Negative Lookbehind). Makes sure "USD" does not precede the text to be matched. Matches "100" in "JPY100".

Lookaround After the Match
\d{3}(?= dollars) (Lookahead). Makes sure " dollars" follows the three digits to be matched. Matches "100" in "100 dollars".

\d{3}(?! dollars) (Negative Lookahead) Makes sure " dollars" does not follow the three digits to be matched. Matches "100" in "100 pesos".

.{3}(?<=USD\d{3}) (Lookbehind). Looks behind for "USD" followed by three digits. Matches "100" in "USD100".

\d{3}(?<!USD\d{3}) (Negative Lookbehind). Makes sure what precedes is not "USD" followed by three digits. Matches "100" in "JPY100".

What this all Means
The point of these eight examples is not to make you memorize different uses for lookarounds; but, rather, to expose you to the ways lookarounds operate depending on their position in the expression. As you can see, there are often two ways (at least!) of achieving the same result. For example, (?=\d{3} dollars).{3} and \d{3}(?= dollars) both match "100" in "100 dollars".

These methods have a different feel, but I wouldn't try to give them names, because as soon as you put one method in one box, you find that the other one also sometimes fits in the box. Now that you have felt these two basic "feels", efficient ways of using lookarounds to solve your regex problems will probably come to you naturally.

Lookbehinds are Fixed-Width Expressions (usually)

One thing to be aware of is that in most regex flavors, the expression in a lookbehind must match a fixed number of characters, meaning you cannot include something like \d+ in a lookbehind. The .NET and ABA flavors are two exceptions. Lookaheads do not have this restriction: you can include a .* in a lookahead.

In PHP versions later than 5.2.4, you can often get around this limitation of lookbehinds by using the very cool \K escape sequence.


next
 The Multiple Faces of Regex Greed




1-10 of 10 Threads
sahil dhar
June 16, 2014 - 03:26
Subject: MOST AWESOME WEBISTE FOUND ON REGEX

THANKS A LOT :) bookmarked
jamin – Gandhinagar
June 10, 2014 - 19:06
Subject: Lookahead

Very Very helpful… Thank You
anon – Hyderabad
May 09, 2014 - 06:42
Subject: none

Very good article.
Patterns
May 04, 2014 - 03:02
Subject: Thank you

Lookbehinds had been very confusing to me until I read this, specifically the fact that

(a) the engine has not moved at the end of the lookaround(s), so

(b) it is very important where in the regex you put any literals that do in fact move the engine, in relation to the lookbehind. So very clear now! You're a great teacher, very clear writing too.
Mohit – Delhi, INDIA
April 17, 2014 - 07:55
Subject: Lookarounds

Nicely explained. Very easy to read and understand. Thnx.
NewWorld – Germany
February 21, 2014 - 20:58
Subject:

Very well explained. Been put off lookarounds until now. Thanks a lot
Vaibhav – India
November 08, 2013 - 08:39
Subject: Nicely Explained

Thanks for the explanation, it saved a lot of time.
Mike – Holland
September 17, 2013 - 20:27
Subject: Perfect

Thanks for the good explanation! Lookahead/lookbehinds are pretty confusing for newbies. Most sites fail to provide some examples. With the examples you showed, I finally figured out my own lookbehind. You may consider to add one or more examples, like the use of '|' E.G. In: (lookahead/behind((this)|or|that))).
Reply to Mike
Rex
September 18, 2013 - 07:30
Subject: RE: Perfect

Hi Mike, It was a treat to wake up to such a positive message this morning. Thank you for your encouragements, and for your suggestion. Wishing you a fun week, Rex
Amulya – CA, USA
October 04, 2012 - 12:13
Subject: Awesome article

I like how simply the concept was explained. Thank you so much!
valery – east europe
September 28, 2012 - 03:47
Subject: lookbehinds

Very helpful. Before I was lost. Now am happy to understand how lookbehinds work.


Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.