Mastering Quantifiers


The behavior of regex quantifiers is a common source of woes for the regex apprentice. You may have heard read that they can be "greedy" or "lazy", sometimes even "possessive"—but sometimes they don't seem to behave the way you had expected. Is there a bug in your regex engine?

As it turns out, there is more to quantifiers than just "greedy" and "lazy". This page digs deep into the details of quantifiers and shows you the traps you need to be aware of and the tricks you need to master in order to wield them effectively.

(direct link)
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

Quantifier Basics( greedy / docile / lazy / helpful / possessive )
Quantifier Cheat Sheet
The Greedy Trap
Lazy Quantifier Solution
Lazy Quantifiers are Expensive
Negated Class Solution
Tempered Greedy Token Solution
Explicit Greedy Alternation Solution
Unrolled Star Alternation Solution
The Lazy Trap
The Explosive Quantifier Trap
The Longest Match and Shortest Match Traps


(direct link)

Quantifier Basics

Before we dive into quantifier tricks and traps, let's have a quick reminder of the basics because I don't know what you've read so far and there's no shortage of incomplete regex tutorials.

Stick to what's immediately to the left
A regex quantifier such as + tells the regex engine to match a certain quantity of the character, token or subexpression immediately to its left. For instance,

✽ in A+ the quantifier + applies to the character A
✽ in \w* the quantifier * applies to the token \w
✽ in carrots? the quantifier ? applies to the character s—not to carrots
✽ in (?:apple,|carrot,)+ the quantifier + applies to the subexpression (?:apple,|carrot,)

One place where the "stick-to-the-left" rule is not immediately obvious is with the \Q…\E sequence that escapes all of the characters it contains. If you stick a + after such a sequence, should it apply to the whole sequence, or only to its last character? The engine treats the content of the sequence as a series of literals, so the quantifier only applies to the last character. For instance, \QC+\E+ matches all of C++++, but against C+C+ it only matches C+.

(direct link)
Greedy: As Many As Possible (longest match)
By default, a quantifier tells the engine to match as many instances of its quantified token or subpattern as possible. This behavior is called greedy.

For instance, take the + quantifier. It allows the engine to match one or more of the token it quantifies: \d+ can therefore match one or more digits. But "one or more" is rather vague: in the string 123, "one or more digits" (starting from the left) could be 1, 12 or 123. Which of these does \d+ match?

Because by default quantifiers are greedy, \d+ matches as many digits as possible, i.e. 123. For the match attempt that starts at a given position, a greedy quantifier gives you the longest match.

Do beware of this notion of "longest match": it refers to the longest match that can be found with a match attempt that starts at a given position in the string — not to the longest possible match that can be found if a pattern is applied repeatedly to various sections of a string. For more on this, see the section about the longest and shortest match traps.

(direct link)
Docile: Give Back When Needed
Greedy…
but docile.
Because by default a quantifier is greedy, the engine starts out by matching as many of the quantified token as it can swallow. For instance, with A+, the engine swallows as many A characters as it can find.

But if the quantified token has matched so many characters that the rest of the pattern can not match, the engine will backtrack to the quantified token and make it give up characters it matched earlier—one character or chunk at a time, depending on whether the quantifier applies to a single character or to a subpattern that can match chunks of several characters. After giving up each character or chunk, the engine tries once again to match the rest of the pattern. I call this behavior of greedy quantifiers docile.

For instance, against the string A tasty apple, using the regex .*apple, the token .* starts out by greedily matching every single character in the string. The engine then advances to the next token a, which fails to match as there are no characters left in the string. The engine backtracks into the .*, which gives up the e in apple. The engine once again advances to the next token, but the a fails to match the e. The engine again backtracks into the .*, which gives up the l. The process repeats itself until the .* has given up the a, at which stage the text tokens a, p, p, l and e are all able to match and the overall match is successful.

When you hear that A+ means "one or more A characters", that is therefore not the whole story. It is "one or more, but as many as possible (greedy), and giving back characters if needed in order to allow the rest of the pattern to match (docile)".

Suppose our entire string is AAA. Depending on which pattern we use to match the string, the quantified token A+ could end up matching A, AA or AAA. Consider these three patterns:

A+ — A+ matches AAA (as many as possible).

(A+). — A+ (captured to Group 1) matches AA, because to allow the dot to match, A+ (which starts out by matching AAA) has to give up one A.

(A+).. — A+ (captured to Group 1) matches A, because to allow the two dots to match, A+ (which starts out by matching AAA) has to give up two A characters.

(direct link)
Lazy: As Few As Possible (shortest match)
In contrast to the standard greedy quantifier, which eats up as many instances of the quantified token as possible, a lazy (sometimes called reluctant) quantifier tells the engine to match as few of the quantified tokens as needed. As you'll see in the table below, a regular quantifier is made lazy by appending a ? question mark to it.

Since the * quantifier allows the engine to match zero or more characters, \w*?E tells the engine to match zero or more word characters, but as few as needed—which might be none at all—then to match an E. In the string 123EEE, starting from the very left, "zero or more word characters then an E" could be 123E, 123EE or 123EEE. Which of these does \w*?E match?

Because the *? quantifier is lazy, \w*? matches as few characters as possible to allow the overall match attempt to succeed, i.e. 123—and the overall match is 123E. For the match attempt that starts at a given position, a lazy quantifier gives you the shortest match.

Do beware of this notion of "shortest match": it refers to the shortest match that can be found with a match attempt that starts at a given position in the string — not to the shortest possible match that can be found if a pattern is applied repeatedly to various sections of a string. For more on this, see the section about the longest and shortest match traps.

(direct link)
Helpful: Expand When Needed
Lazy…
but helpful.
With a lazy quantifier, the engine starts out by matching as few of the tokens as the quantifier allows. For instance, with A*, the engine starts out matching zero characters, since *allows the engine to match "zero or more".

But if the quantified token has matched so few characters that the rest of the pattern can not match, the engine backtracks to the quantified token and makes it expand its match—one step at a time. After matching each new character or subexpression, the engine tries once again to match the rest of the pattern. I call this behavior of lazy quantifiers helpful.

For instance, against the string Two_apples, using the regex .*?apples, the token .*? starts out by matching zero characters—the minimum allowed by the * quantifier. The engine then advances in the pattern and tries to match the a token against the T in Two. That fails, so the engine backtracks to the .*?, which expands to match the T. The engine advances both in the pattern and in the string and tries to match the a token against the w in Two. Once again, the engine has to backtrack. The .*? expands to match the w, then the a token fails against the o in Two. This process of advancing, failing, backtracking and expanding repeats itself until the .*? has expanded to match Two_. At that stage, the following token a is able to match, as are the p and all the tokens that follow. The match attempt succeeds.

As this example showed, because lazy quantifiers expand their match one step at a time in order to match only as much as needed, they cause the engine to backtrack at each step. They are expensive.

To fully grasp how lazy quantifiers work, let's look at one more example. The quantified token A*? matches zero or more A characters—as few as possible, expanding as needed. Against the string AA, depending on the overall pattern, A*? could end up matching no characters at all, A or AA. Consider how these three patterns match AA:

^(A*?)AA$ — A*? (captured to Group 1) matches no characters. After the anchor ^ asserts that the current position is the beginning of the string, A*? tries to match the least number of characters allowed by *, which is zero characters. The engine moves to the next token: the A, which matches the first A in AA. The next token matches the second A. The match attempt succeeds, and Group 1 ends up containing no characters.

^(A*?)A$ — A*? (captured to Group 1) matches one A. Initially, the A*? matches zero characters. The next token A matches the first A in AA. The engine advances to the next token, but the anchor $ fails to match against the second A. The engine sees that the A*? can expand. It backtracks and gives up the A, which the A*? now expands to match. The engine moves to the next token: the A matches the second A in the string. The $ anchor now succeeds. Group 1 ends up containing one A.

^A*?$ — A*? matches AA. After the A*? matches zero characters, the $ fails to match. The engine backtracks and allows the A*? to match one A. Once again, the $ fails to match (there is one A left in the string). The engine backtracks again and allows the A*? to expand to match the second A. This time, the $ anchor matches. Group 1 ends up containing AA.

(direct link)
Possessive: Don't Give Up Characters
In contrast to the standard docile quantifier, which gives up characters if needed in order to allow the rest of the pattern to match, a possessive quantifier tells the engine that even if what follows in the pattern fails to match, it will hang on to its characters.

As you'll see in the table below, a quantifier is made possessive by appending a + plus sign to it. Therefore, A++ is possessive—it matches as many characters as needed and never gives any of them back.

Whereas the regex A+. matches the string AAA, A++. doesn't. At first, the token A++ greedily matches all the A characters in the string. The engine then advances to the next token in the pattern. The dot . fails to match because there are no characters left to match. The engine looks if there is something to backtrack. But A++ is possessive, so it will not give up any characters. There is nothing to backtrack, and the pattern fails. In contrast, with A+., the A+ would have given up the final A, allowing the dot to match.

Possessive quantifiers match fragments of string as solid blocks that cannot be backtracked into: it's all or nothing. This behavior is particularly useful when you know there is no valid reason why the engine should ever backtrack into a section of matched text, as you can save the engine a lot of needless work.

In particular, when a match must fail, a possessive quantifier can help it to fail faster. For instance, suppose we want to match a string of digits that ends with E, as in 123E. We can use a possessive quantifier with the \d digit token:
\b\d++E
When we use this pattern against 123E, it matches in the same way as if we had used a non-possessive \d. Actually, in theory the match could be a hair faster because the \d++ quantifier doesn't need to remember positions where it may later need to backtrack—it's all or nothing.

Now let's use the same pattern against 13245. We expect the match to fail because the string doesn't end with an E. Let's see how the possessive and non-possessive versions compare.

✽ In the possessive version \b\d++E, after matching all the digits, the engine advances in the pattern and attempts the next token E. There are no characters left in the string, so this fails. Since the engine has nowhere to backtrack to, the match fails.

✽ In the non-possessive version \b\d+E, after failing to match the E token at the end of the string, unless the engine has been optimized to detect that the \d+ token and the E are mutually incompatible, it has positions to backtrack to. It backtracks into the \d+ and gives up the last character matched, which was the 5. It then advances in the pattern and tries the next token E against the 5. That fails, so the engine backtracks into the \d+ again, gives up the 4, advances in the pattern and tries the E against the 4. This fails, and the process repeats itself until the \d+ has given up everything except the 1, at which stage there is nothing left to backtrack and the pattern can finally fail.

As you can see, in the regular version the engine spends a lot of time in needless backtracking, whereas in the possessive version the "all or nothing" \d++ allows the match to fail right away.

It's worth noting that certain engines (such as PCRE) study the pattern before starting the match, notice that the token \d is mutually exclusive with the token E, and optimize the pattern by automatically turning the \d+ into a possessive \d++. This process is called auto-possessification. PCRE even allows you to turn it off with the special start-of-pattern modifier (*NO_AUTO_POSSESS)

Possessive quantifiers are supported in Java (which introduced the syntax), PCRE (C, PHP, R…), Perl, Ruby 2+ and the alternate regex module for Python.

In .NET, where possessive quantifiers are not available, you can use the atomic group syntax (?>…) (this also works in Perl, PCRE, Java and Ruby). The atomic group (?>A+) tells the engine that if the pattern fails after the A+ token, it is not allowed to backtrack into A+. This means that A+ will not give up any of its characters—it is like a solid block (hence the name atomic). As you can see, this is the same behavior as A++. In fact, A++ is syntactic sugar for (?>A+), as internally most engines convert the first to the second.


(direct link)

Quantifier Cheat Sheet

+once or more
A+One or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A+?One or more As, as few as needed to allow the overall pattern to match (lazy)
A++One or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)
*zero times or more
A*Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*?Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)
?zero times or once
A?Zero or one A, one if possible (greedy), giving up the character if the engine needs to backtrack (docile)
A??Zero or one A, zero if that still allows the overall pattern to match (lazy)
A?+Zero or one A, one if possible (greedy), not giving the character if the engine tries to backtrack (possessive)
{x,y}x times at least, y times at most
A{2,9}Two to nine As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A{2,9}?Two to nine As, as few as needed to allow the overall pattern to match (lazy)
A{2,9}+Two to nine As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)
A{2,}
A{2,}?
A{2,}+
Two or more As, greedy and docile as above.
Two or more As, lazy as above.
Two or more As, possessive as above.
A{5}Exactly five As. Fixed repetition: neither greedy nor lazy.



(direct link)

The Greedy Trap

The classic trap with greedy quantifiers is that they may match more than you expect. Suppose you want to match tokens that begin with {START} and end with {END}. You may try this pattern:
{START}.*{END}
Note that Java will require that you escape the opening braces: \{

However, you will find that this pattern matches this entire string from start to finish:
{START} Mary {END} had a {START} little lamb {END}

…whereas we wanted to find two matches:
{START} Mary {END}
{START} little lamb {END}

Here is what happens. After matching {START}, the engine moves to the next token: .*
Because of the greedy quantifier, the dot-star matches all the characters to the very end of the string. The engine then moves to the next token: the { at the beginning of {END}. This fails to match because there are no characters left in the string.

But the engine sees that it can backtrack into the dot-star. First, the dot-star gives up the very last character in the string, i.e. }. The engine now tries to match the { token against this character, but fails. The dot-star then gives up the D. Again, the engine fails to match the { token against that character. Repeating this process, the dot-star gives up the N, the E and the {, and and the { token can finally match. Then the rest of the pattern END} matches. Therefore, the final match is the entire string. The dot-star has only given up as many characters as were needed to allow an overall match to succeed.

The best-known way to solve this problem is with lazy quantifiers. But lazy quantifiers have their own problems, and it is worth understanding other techniques to overcome the greed of an unfettered dot-star. We will look at five distinct solutions, which you all need to master on your way to your regex black belt.


(direct link)

Lazy Quantifier Solution

The easiest way is to make the dot-star lazy by adding a ? question mark:
{START}.*?{END}
The lazy .*? guarantees that the quantified dot only matches as many characters as needed for the rest of the pattern to succeed. Therefore, the pattern only matches one {START}…{END} item at a time, which is what we want.

Containing a Lazy Quantifier that Can Eat the Delimiter: Atomic Group
Suppose our regex pattern must match not only a {START}…{END} block, but some characters beyond that block, for instance \d+ digits. In such cases, we must tweak the lazy quantifier solution by embedding the lazy dot-star and the {END} delimiter together in an atomic group — like so:

{START}(?>.*?{END})
This is because if tokens (such as \d+) beyond {END} fail to match, the engine will backtrack and require the .*? to expand beyond the first {END}, perhaps reaching a second {END} where a match is possible. We don't want this. The atomic group (?>.*?{END}) forbids the engine from backtracking into the lazy .*? after the first {END} has been matched. There are other ways to solve this problem, which is discussed in the Lazy Trap section.

Whenever the token quantified by a lazy quantifier is able to eat the delimiter, as in the above example or something like \d*?9, remember to embed the token and the delimiter together in an atomic group: (?>\d*?9)

(direct link)
Lazy Quantifiers are Expensive
It's important to understand how the lazy .*? works in this example because there is a cost to using lazy quantifiers.

When it first encounters .*? the engine starts out by matching the minimum number of characters allowed by the quantifier—which is zero. The engine then advances in the pattern and tries the next token (which is {) against the M in Mary. This fails, so the engine backtracks and allows the .*? to expand its match by one item, so that it matches the M. Once again, the engine advances in the pattern. It now tries the { against the a in Mary. This fails, so the engine backtracks and allows the .*? to expand and match the a. The process then repeats itself—the engine advances, fails, backtracks, allows the lazy .*? to expand its match by one item, advances, fails and so on.

As you can see, for each character matched by the .*?, the engine has to backtrack. From a computing standpoint, this process of matching one item, advancing, failing, backtracking, expanding is "expensive".

On a modern processor, for simple patterns, this will likely not matter. But if you want to craft efficient regular expressions, you must pay attention to use lazy quantifiers only when they are needed. Lower on the page, I will introduce you a far more efficient way of doing things.


(direct link)

Negated Class Solution

Suppose we know that the character { will never be present between the delimiters {START} and {END}. Instead of the lazy quantifier, we can use a negated character class in our pattern:

{START}[^{]*{END}
The negated character class [^{]* greedily matches zero or more characters that are not an opening curly brace. Therefore, we are guaranteed that the dot-star will never jump over the {END} delimiter. This is a more direct and efficient way of matching between {START} and {END}.

Note that in this solution, we can fully trust the * that quantifies the [^{]. Even though it is greedy, there is no risk that [^{] will match too much as it is mutually exclusive with the { that starts {END}. This is the contrast principle from the regex style guide.


(direct link)

Tempered Greedy Token Solution

For the negated character class solution, we assumed that the character { would never be present between the delimiters {START} and {END}. Let's now remove this assumption.

Going back to our original naive .* pattern to match between the delimiters, one way to ensure that the .* doesn't jump over the first {END} is to temper it with a negative lookahead. That is what this pattern does:

{START}(?:(?!{END}).)*{END}
If you look closely, you'll see that we still have a kind of dot-star—a more complex one. In (?:(?!{END}).)*, the * quantifier applies to a dot, but it is now a tempered dot. The negative lookahead (?!{END}) asserts that what follows the current position is not the string {END}. Therefore, the dot can never match the opening brace of {END}, guaranteeing that we won't jump over the {END} delimiter.

When Not to Use this Technique
For the task at hand, this technique presents no advantage over the lazy dot-star .*?{END}. Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is {END}.

The comparative performance of these two versions will depend on your engine's internal optimizations. The pcretest utility indicates that PCRE requires far fewer steps for the lazy-dot-star version. On my laptop, when running both expressions a million times against the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the lazy version and 800 milliseconds for the tempered version.

Therefore, if the string that tempers the dot is a delimiter that we intend to match eventually (as with {END} in our example), this technique adds nothing to the lazy dot-star, which is better optimized for this job.

When to Use this Technique
Suppose our boss now tells us that we still want to match up to and including {END}, but that we also need to avoid stepping over a {MID} section, if it exists. Starting with the lazy dot-star version to ensure we match up to the {END} delimiter, we can then temper the dot to ensure it doesn't roll over {MID}:

{START}(?:(?!{MID}).)*?{END}
If more phrases must be avoided, we just add them to our tempered dot:
{START}(?:(?!{MID})(?!{RESTART}).)*?{END}
This is a useful technique to know about.


(direct link)

Explicit Greedy Alternation Solution

Staying with the idea that the character { may be present between the delimiters {START} and {END}, let's look at an elegant technique that is more efficient than both the tempered greedy dot and the lazy dot star. We'll start with an unoptimized version:

{START}(?:[^{]|{(?!END}))*{END}
We still have a greedy quantifier *. This time, it does not apply to a dot but to a non-capturing group (?:…) that contains an alternation.

✽ On the left side of the alternation, [^{] matches one character that is not an opening brace. We can safely do this because we know that a non-{ character will never make us roll over the {END} delimiter.

✽ On the right side of the alternation, we are allowed to match a { as long as it is not followed by END}: the negative lookahead (?!END}) asserts that what follows the position after { is not END}. In our language of quantifier techniques, this is a tempered opening brace.

The pattern can be further optimized. If we have several non-{ characters in a row (which will be the typical case), at the moment we have to enter and exit the alternation for every single character because the quantifier * applies to the non-capturing group (?:[^{]|{(?!END})). This seems inefficient. If we also had a quantifier on the [^{], we could match multiple non-{ characters without leaving the alternation.

To do so, the first idea would be to use [^{]+. However, this leads to a situation where the * quantifier applies to the + quantifier. If the pattern fails, the engine will explore all the ways that the two quantifiers can divide up the "pie of characters", leading to needlessly long backtracking and the situation I call an explosive quantifier (we'll explore in a later section). What we want is to match any non-{ characters as a solid block that cannot be backtracked into. We do this with a possessive quantifier [^{]++ or an atomic group (?>[^{]+).

While we're at it, we should also lock up the entire quantified alternation once we exit, because if {END} fails to match, backtracking into the alternation won't help. We also do this either with possessive quantifiers (turning the * into *+) or by wrapping the quantified alternation into an atomic group.

We can use the possessive version in Java, PCRE (C, PHP, R…), Perl, Ruby 2+ and the alternate regex module for Python:
{START}(?:[^{]++|{(?!END}))*+{END}
We can use the atomic version in every major engine except Python and JavaScript:
{START}(?>(?:(?>[^{]+)|{(?!END}))*){END}
In any version of this solution, we do away with the generic dot by explicitly stating what we want: either any number of non-{ characters; or a { as long as it is not followed by END}. This is a prime example of the Say What You Want (and What You Don't Want) principle from the regex style guide.

Note that for all the "normal" characters matched by the general case [^{]+ on the left side of the alternation, we don't need to look ahead. Indeed, we only look ahead when we encounter an opening brace—which might only be once, when we hit the {END} delimiter. Because we avoid the look-ahead-fail-backtrack rigmarole, we should expect this pattern to match faster than both the lazy dot-star and tempered-dot solutions, which both require "looking" at each step.

This is confirmed by pcretest:

✽ Running the patterns a million times each on the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the atomic lazy-dot-star version, 800 for the tempered-dot version and 400 for the Explicit Greedy Alternation solution.

✽ Lengthening the test string to {START} Mary Ate a Little Lamb {END}, the gaps between the versions increase drastically: 800 milliseconds per 10,000 runs for the lazy-dot-star, 2,300 for the tempered-dot, and only 500 for the explicit-greedy-alternation solution.

This solution takes a little more effort to write as you need to separate the brace case from the non-brace case, but it is well worth it if performance matters.


(direct link)

Unrolled Star Alternation Solution

The Explicit Greedy Alternation solution just above uses a subpattern of the form (A|B)*. In the trick to mimic an alternation quantified by a star, we see that this can be unrolled into A*(?:B+A*)*: zero or more As, then zero or more repetitions of one or more Bs followed by zero or one As.

Applying this formula to (?:[^{]|{(?!END}))*, we obtain this fifth solution:

{START}[^{]*(?:(?:{(?!END}))+[^{]*)*{END}
This solution has pros and cons. On the plus side, it is even faster than the Explicit Greedy Alternation solution it unrolls. pcretest reports that per 10,000 runs, the performance on the short test string is identical, but on the longer test string this solution clocks in at 400 milliseconds, compared to 500 milliseconds for the other. If you are looking to squeeze out every last drop of performance, this is the way to go.

On the minus side, the pattern is harder to read. While the intent of the alternation in the original is not immediate, it is not the case once the alternation is unrolled. Moreover, one of the elements of the alternation is now repeated—when (A|B)* becomes A*(?:B+A*)* there are now two As. If you ever change A in one place, you may forget to do it in the other—a maintenance hazard.

In my view, this is the kind of tweak that should be performed by the engine as an optimimization behind the curtain.


(direct link)

The Lazy Trap

In the right context, lazy quantifiers solve some problems. But if you become overconfident in their power, you can run into new problems. This section explains a common source of confusion.

We will use this string:
{START} Mary {END}00A {START} little lamb {END}01B
As before, we want to match an entire {START}…{END} group, except this time we want to extend the match after the {END} to include some digits and the letter B.

At first, it may seem reasonable to tack on \d+B at the end of our lazy quantifier solution:
{START}.*?{END}\d+B Looking at the bold string a few lines above, what do you think this pattern matches? Keep reading when you've made up your mind.

The pattern matches the entire string from the very beginning to the very end. Do you see why?

Lazy quantifiers can jump the fence.
The .*? is supposed to expand until {END}\d+B is able to match. Starting the match at the very start of the string, the .*? has no reason to stop expanding after the first {END} — where \d+B cannot match. The .*? therefore continues to expand until a position where {END}\d+B is able to match. Starting the match at the beginning of the string, the shortest match is the whole string.

The lesson: remember that the engine allows a lazy quantifier to expand its match as much as needed to allow an overall match. If forced to, a lazy quantifier may jump the fence you thought you had made for it.

To contain the .*? in .*?{END} to the section before the first instance of {END}, you need to tweak it or replace it using one of four techniques we have already seen:

✽ Bundle the characters preceding {END} together with {END} into an atomic group, forbidding the engine to backtrack and expand the .*? past the first {END}: (?>.*?{END})
✽ Use a Tempered Greedy Token: (?:(?!{END}).)*{END}
✽ Use an Explicit Greedy Alternation: (?:[^{]++|{(?!END}))*+{END}
✽ Use an Unrolled Star Alternation Solution: [^{]*(?:(?:{(?!END}))+[^{]*)*{END}


(direct link)

The Explosive Quantifier Trap

If you're not careful, you can easily build a regular expression that has the potential to backtrack for a billion years. The page would feel too cramped if I tried to cover this important topic here, so please visit the companion page about explosive quantifiers.


(direct link)

The Longest Match and Shortest Match Traps

You'll often hear that a greedy quantifier gives you the "longest match" while a lazy quantifier gives you the "shortest match". While true in a specific context, these assertions are a frequent source of confusion.

(direct link)
What does "longest match" mean?
Sometimes people will try to match a string such as 12 9876 with a regex such as \d+ and expect the engine to return 9876. The + quantifier is greedy, so isn't that the longest match?

This is a misunderstanding. When we talk of greedy or lazy quantifiers providing us the longest or shortest match, it is always in the context of a single match attempt starting at a fixed position in the string.

For the match attempt that starts at the beginning of the string, the longest match is 12, as opposed to the shortest match 1. The string 9876 is never considered in this match attempt as it is not at the beginning of the string.

If you ask your language or tool to return all the matches in the string, it will also return 9876 as a match found at position 3 (counting from 0). That will be the longest match for \d+ at that position, whereas 9 would be the shortest.

(direct link)
What does "shortest match" mean?
Sometimes people will try to match a string such as 123EEE 98EE with a regex such as \w+?E and expect the engine to return 98E. The +? quantifier is lazy, so isn't that the shortest match?

This is another misunderstanding. Once again, when we talk of greedy or lazy quantifiers providing us the longest or shortest match, it is in the context of a single match attempt starting at a fixed position in the string.

For the match attempt that starts at the beginning of the string, the shortest match is 123E, as opposed to the longest match 123EEE. The string 98EE is never considered in this match attempt as it is not at the beginning of the string.

Longest or Shortest Possible Match in a String
If you wanted the longest or shortest of the various matches returned by the engine, you could ask your programming language to sort them by length.

This would not be a good task to try to do directly in regex as there is no syntax to compare the number of times various quantifiers are applied. I have been lobbying for quantifier arithmetic, and hopefully this will some day enter the syntax through the big gate.




next
 A good long look at Regex Conditionals




1-1 of 1 Threads
Claude
August 25, 2014 - 21:35
Subject: greedy/lazy?

At the beginning of "The Longest Match and Shortest Match… ", you are using "greedy" twice. I am assuming that you mean "greedy" first and then "lazy".
Reply to Claude
Rex
August 26, 2014 - 08:24
Subject: RE: greedy/lazy?

Bonjour Claude, Thank you very much for reporting this typo. I spent last week entirely rewriting that page, so it's still fresh and I rely on kind readers like you to let me know about little bugs. :) Kindest regards, Rex


Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.