Regex Syntax Tricks


On this page, I'd like to collect some useful regex tricks. The idea here is not to assemble a cookbook of regex recipes to match this or that—for that, see the cookbook page and the many pages of tricks linked on the left.

Rather, the idea is to present more general regex syntax tricks—by which I mean that each of these tricks can be seen as a way to accomplish something which, if you read the manual to the letter, might first seem impossible as it is not covered by the syntax.

The earlier of these tricks will be well-known to regex heads, but I hope that even avid regexers will find something new further down.

Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

Forcing failure: regex constructs that never match
Parity and beyond: check that string length is a multiple of n
Uppercase word that later appears in lowercase
Atomic Groups for Engines that Don't Support It
Pre-Defined Subroutines for Engines that Don't Support It
Mimic an Alternation Quantified by a Star
Miscellaneous Tricks


(direct link)

Forcing Failure: Regex Constructs that Never Match

Sometimes, you want to stop the regex engine in its track by throwing in an expression that always fails. Once the engine fails to match an expression or token, it attempts to backtrack in the hope of finding another path that succeeds. You might wish to exploit this path exploration, or design your pattern so that this backtracking never succeeds—below we'll examine why we might want a token that never matches. Before looking at the "standard solution", let's explore some options.

To force the engine to choke and turn back, one strategy is to require two incompatible things, such as using a lookahead to assert that the next character is an a, then immediately trying to match one character that is not an a: (?=a)[^a]. This can simply never match.

You can do the same with shorthand classes: (?=\d)\D, (?=\w)\W or (?=\s)\S. And you can twist the logic any way you like, as in (?=[^a])a, or by using a negative lookahead, as in (?!a)a. This latter version asserts that what follows is not an a, then tries to match an a.

This last pattern suggests that there is no need to involve entire character classes such as \D or [^a]. Indeed, (?=z)a is just as quite effective as our original (?=a)[^a].

With word boundaries, you could use \b\B (requiring that a position be both a boundary and not a boundary), A\bA (requiring a boundary between two letters, where there is none) or =\BA (requiring a non-boundary where there is a boundary). Ideas involving anchors ^ and $ are less reliable because of multi-line mode.

Standard Solution
To force a regex engine to fail and turn back, the most-used expression is probably this:

(?!)
Its compactness is unbeatable. How does it work? (?! … ) is a negative lookahead asserting that what is in the parentheses after the ?! cannot be matched. In (?!) there is nothing between the (?! and the closing parentheses, an absence that the engine interprets as the empty string.

The regex engine translates (?!) as "assert that at this position, the empty string cannot be matched." The empty string can always be matched, so engine fails to match the (?!) assertion. This causes it to backtrack in search of a different match, if any can be found.

I first came across this little gem in the Regular Expressions Cookbook, but I don't know who invented it.

In Perl, PCRE (C, PHP, R…) and Python's alternate regex engine, there is a special token that can never match, forcing the engine to fail and causing it to backtrack in search of another match. It can be written (*FAIL) or (*F). Internally, PCRE optimizes (?!) to (*F).

You can read all about (*FAIL) on my page about backtracking control verbs.

Why would you want a token that never matches?
Among other uses, tokens that always fail come in handy:
- in conditionals that check for balancing conditions in the string (e.g. the engine found an "opening construct", now you want it to match the corresponding closing construct),
- to explore the branches of a match tree.

For details, please read the use cases for on my page about backtracking control verbs.

When I use this trick, is the regex match guaranteed to fail?
Not necessarily. While the (?!) construct and its equivalents never match, that doesn't imply that the patterns containing them won't match: after failing to match (?!), the engine attempts to backtrack, and it may find an alternate path that succeeds.

If your pattern is designed in such a way that backtracking is fruitless, the match attempt will fail.

If you want to guarantee failure in all cases, please read how to force the match attempt to fail on my page about backtracking control verbs.


(direct link)

Parity and beyond: check that string length is a multiple of n

Without using string functions, can you check that a string has an even number of characters? This is easily done:

^(?:..)+$
The non-capturing group (?:..) matches two characters. The + quantifier repeats that one or more times.

To check that a string's length is a multiple of n, a more general version would be ^(?:.{n})+$

If your string spans several lines, you could start out with something like (?s)\A(?:.{n})+\z, using the single-line (DOTALL) mode s to ensure that the dot matches line breaks, and the A and z anchors to anchor the whole string. Each line break character will count as one character, so this is probably less useful.


(direct link)

Find an Upper-Case Word that Later Appears in Lowercase

This idea was shown on the inline modifiers section of the main syntax page.

(\b[A-Z]+\b)(?=.*(?=\b[a-z]+\b)(?i)\1) The upper-case word is captured to Group 1. Then we match characters up to a point where the lookahead (?=[a-z]+\b) can assert that what follows is a lower-case word, then we set case-insensitive mode on and we match the content of Group 1.

In .NET we could use a lookbehind instead of the double lookahead:
(\b[A-Z]+\b)(?=.*\b[a-z]+\b(?i)(?<=\1))


(direct link)

Atomic Groups for Engines that Don't Support It

Among the major engines, JavaScript and Python share the dubious distinction of not supporting atomic grouping syntax, such as (?>A+)

This hack lets you get around it: (?=(A+))\1. The lookahead asserts that the expression we want to make atomic—i.e. A+—can be found at the current position, and the parentheses capture it. The back-reference \1 then matches it. So far, this is just a long-winded way of matching A+. The magic happens if a token fails to match further down in the pattern. The engine will not backtrack into a lookaround, and for good reason: if its assertion is true at a given position, it's true! Therefore, Group 1 keeps the same value, and the engine has to give up what was matched by A+ in one block.

The page on Reducing (? … ) Syntax Confusion looked at two examples of atomic groups. Referring to that page for what they do, here are the equivalent "pseudo-atomic" groups in JavaScript and Python:

(?>A+)[A-Z]C translates to
(?=(A+))\1[A-Z]C

and
(?>A|.B)C translates to
(?=A|.B)\1C


(direct link)

Pre-Defined Subroutines for Engines that Don't Support It

Perl, PCRE (C, PHP, R…) and Python's alternate regex engine support the terrific (?(DEFINE) … ) syntax that allows you to pre-define named regex subroutines and to use them to build beautiful modular patterns such as the the regex to match numbers in plain English.

At present, the only other engine I know that supports subroutines, Ruby 1.9+, does not support the (?(DEFINE) … ) syntax. I came up with this trick in order to use pre-defined subroutines in these two engines.

As with (?(DEFINE) … ), the idea is to place the definitions in a section at the beginning of the regex—although you can put them anywhere before the subroutine is called. Each definition looks like this:

(?:(?<foo> … )(?!))?
Here the name of the subroutine is foo, and the actual subroutine goes in the place of the ellipsis . For instance, (?:(?<uc_word>\b[A-Z]+\b)(?!))? defines a subroutine for a word in all-caps.

Once the subroutine is defined, you call it with \g<foo> (Ruby). This makes it very convenient to call patterns repeatedly (especially if they are long), to maintain them in one place, and so on.

How does the trick work?
Before we look at a full-length example, let's discuss how this works. The external parentheses in (?:(?<foo> … )(?!))? define a non-capturing group. The group is made optional by the ? quantifier. As we'll soon see, the non-capturing group always fails, and the zero-or-one quantifier ? is what allows us to go on after that internal failure.

Within the non-capturing group, we start with a standard named group definition (?<foo> … ). This both creates a named group and a named subroutine.

The engine doesn't know that at this stage we only aim to define the subroutine without actually matching characters. Therefore, if it finds characters that match the pattern in the named group, it matches them. We don't want that—as our goal in our definitions is to simply define without moving our position in the string.

This is why after defining the named group and subroutine, we force the non-capturing group to fail by using (?!), which we saw earlier in the page with the forced failure trick. Of course we don't want the whole regex to fail, which is why we use the ?quantifier to make the non-capturing group optional.

Full Example
This example is identical to the one in the section on pre-defined subroutines: it defines a basic grammar, allowing us to match sentences such as five blue elephants solve many interesting problems and many large problems resemble some interesting cars

This is for Ruby:

(?x)   # whitespace mode
#######  DEFINITIONS  #########
# pre-define quant subroutine 
(?:(?<quant>many|some|five)(?!))?

# pre-define adj subroutine 
(?:(?<adj>blue|large|interesting)(?!))?

# pre-define object subroutine
(?:(?<object>cars|elephants|problems)(?!))?

# pre-define noun_phrase subroutine
(?:(?<noun_phrase>\g<quant>\ \g<adj>\ \g<object>)(?!))?

# pre-define verb subroutine
(?:(?<verb>borrow|solve|resemble)(?!))?
#######  END DEFINITIONS  #######

##### The regex matching starts here #####
\g<noun_phrase>\ \g<verb>\ \g<noun_phrase>	
	


(direct link)

Mimic an Alternation Quantified by a Star

You want to match a series or joe_ or bob_ tokens, as in:
joe_joe_joe_bob_joe_bob_
We allow an empty match.

Your first thought is probably (?:bob_|joe_)*
But can you perform the same task without using an alternation?

This will do it:
(?:bob_)*(?:(?:joe_)+(?:bob_)*)*
After (?:bob_)* matches zero or more bob_ tokens, we match (zero or more times) one or more joe_ tokens followed by zero or more bob_ tokens.

In more general terms, (A|B)* becomes A*(?:B+A*)*.

We use this technique in the Unrolled Star Alternation solution to the Greedy Trap problem.


(direct link)

Miscellaneous Tricks

In this section, I would like to add links to tricks that are described on other pages but do not fully satisfy this page's specification of regex syntax tricks that perform tasks that may not at first seem like they are covered by the engine's syntax.

double-negative delimiters
✽ I've implemented an infinite lookbehind demo for PCRE.




next
 Regex Cookbook





Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.