The Elements of Good Regex Style


Knowing the English alphabet doesn't make you Hemingway. Likewise, knowing regex syntax doesn't make you literate with regex. Style is a hard thing to teach aspiring writers, but the rudiments of style can certainly be taught. Likewise, "good regex" is not easy to put into rules, and perhaps that's why I've never seen literature on the subject. This makes it all the more interesting to give it a try.

With the disclaimer that I do not present myself as the ultimate authority on efficient, graceful regex, this page presents some "rules" I have gradually distilled with practice. There is always more to learn, so new rules may appear and old ones may be rephrased or subsumed by others.

Before diving into the Elements of Regex Style, let's warm up with some considerations about matching strategy.


(direct link)

Should I Match it, or Should I Capture it?

Matches are not sacred. Feel free to throw them away!
One topic I've never seen discussed (but maybe I'm not reading widely and carefully enough) is the various strategies available to you in order to retrieve the data you need. Should you try to match it? Should you try to capture it? It's an implicit question in nearly every complex piece of regex you write.

In the examples on this site, you'll see a diverse use of matching and capturing. Sometimes, the match returns exactly the data we want. But often, the match is a "throwaway" that just gets us down the string, down to the portions we really care about, which we then capture.

Sometimes, then, the captured groups contain the data we're after. But at other times, you only use groups (and back references) to build an intricate expression that adds up to the overall match you're looking for.

If this sounds confusing, it doesn't need to be. There is only one thing you need to tell yourself:

The Match is Just Another Capture Group

Basically, you can imagine that there is a set of parentheses around your entire regex. These parentheses are just implied. They capture Group 0, which by convention we call "the match".

In fact, your language may already think that way. In PHP, if $match is the match array, $match[1] will contain Group 1, $match[2] will contain Group 2… and $match[0] ("Group 0") will contain the overall match. Likewise, in JavaScript the overall match will be in matchArray[0]. In Python and C#, you can (although those are not the only options) retrieve the overall match as match.group(0) and matchResult.Groups[0].Value.

Likewise, in regex replace operations, \1, \2, \3 (or $1, $2, $3, depending on the flavor) usually refer to capture groups 1, 2 and 3. Not by coincidence, \0 ($0) usually refer to the overall match.

Once you see that the match is just another group, the question of whether to match or to capture loses importance: You will be capturing anyhow.

What does this mean? You are the one who knows what data you want to match. Knowing this, use whatever means you need (whether it's an overall match or a sneaky capture in Group 3) in order to grab what you want. In the example about keeping the regex in sync with your string, we'll look at a technique that makes many captures—some useful, some not—and then leaves it to the code outside the regex to decide which of the capture groups are important. It's not a particularly efficient technique, but it works.

The only moderation I would add to the advice to "use whatever means you need" is that it's generally considered poor programming practice to spawn unnecessary capture groups, as they create overhead. So if you need parentheses in order to evaluate an expression but don't need to capture the data, make it a non-capturing group by using the (?: … ) syntax.


(direct link)

Should I Split, or should I Match All?

Here is a regex axiom that may come as a surprise:

Matching All and Splitting are two sides of the same coin.

Consider a list of fruits separated the word "and": apple and banana and orange and pear and cherry. You are interested in obtaining an array with all the names of fruits. To do so, you could match all the words that are not and (something like \b(?!and)\S+ would do). Another approach would be to split the string using the delimiter " and ". Both approaches would provide you with the same array: it's a bit like one of those drawings that can be interpreted in different ways depending on whether you focus on the white background or on the inked parts.

When you to want to match, I'll split you... When you want to split, I'll match you...
This is a simple example, but often you will gain considerable advantage by deciding to match rather than to split, or vice-versa. You'll often find that one way is easy and the other nearly impossible. Therefore, if someone tells you "I want to match all the…" or "I am trying to split by…", try not to rush down the first alley because they said "split" or "match": remember the other side of the coin.


(direct link)

The Elements of Regex Style

In the world of regex, it's appropriate to paraphrase Strunk & White:

To write good regex, say what you mean. Say it clearly.

The more specific your expressions, the faster your regex will match; and, often more importantly, the faster your regex will fail when no match is there to be found.

Here are a few "golden rules" that every regex craftsperson should keep in mind.

If some of these rules don't make complete sense to you right now, don't worry about it—just come here again after you've read some other sections, or in a couple months' time.

(direct link)
Whenever Possible, Anchor.
Anchors, such as the caret ^ for the beginning of a line and the dollar sign $ for the end of a line often provide the needed clue that ensures the engine finds a match in the right place. For instance, when we validate a string, they ensure that the engine matches the whole string, rather than a substring embedded in the string being examined. And anchors often save the engine a lot of backtracking. Be aware that anchors are not limited to ^ and $. Most engines have other useful built-in anchors, such as \A and \G (see the cheat sheet).

(direct link)
When You Know what You Want, Say It. When You Know what You Don't Want, Say It Too!
When you feed your regex engine a lot of .* "dot-star soup", the engine can waste a lot of energy running down the string then backtracking. Be as specific as possible, whether by using a literal B character, a \d digit class or a \b boundary. Another great way to be specific is to say what you don't want—whether what you don't want is… a double quote: [^"]… a digit: \D… or for the next three letters to be "boo": (?!boo)[a-z]{3}.

(direct link)
Contrast is Beautiful—Use It.
When you can, use consecutive tokens that are mutually exclusive in order to create contrast. This reduces backtracking and the need for boundaries in the broad sense of the term, in which I include lookarounds. For instance, let's say you're trying to validate strings that contain exactly three digits located at the end, as in ABC123. Something like ^.+\d{3}$ would not work, because . and \d are not mutually exclusive—this regex would match ABC123456. You may think to add a negative lookbehind: ^.+(?<!\d)\d{3}$. But if you use tokens that are mutually exclusive in the first place, you no longer need a lookaround: ^\D+\d{3}$ works straight out of the box. With time, you come to relish the beautiful contrast between \D and \d, between [^a-z] and [a-z]. This is a variation on When you know what you want, say it.

(direct link)
Want to Be Lazy? Think Twice.
Let's say you want to match all the characters between a set of curly braces. At first you might think of {.*?} because the lazy quantifier ensures you don't overshoot the closing brace. However, a lazy quantifier has a cost: at each step inside the braces, the engine tries the lazy option first (match no character), then tries to match the next token (the closing brace), then has to backtrack. Therefore, the lazy quantifier causes backtracking at each step (see Lazy Quantifiers Are Expensive). This is more efficient: {[^}]*}. This is a variation on Use Contrast and When you know what you want, say it.

(direct link)
A Time for Greed, a Time for Laziness.
A reluctant (lazy) quantifier can make you feel safe in the knowing that you won't eat more characters than needed and overshoot your match, but since lazy quantifiers cause backtracking at each step, using them can feel like bumping on a country road when you could be rolling down the highway. Likewise, a greedy quantifier may shoot down the string then backtrack all the way back when all you needed was a few nudges with a lazy quantifier.

(direct link)
On the Edges: Really Need Boundaries or Delimiters? Use Them—or Make Your Own!
Most regex engines provide the \b boundary, and sometimes others, which can be useful to inspect an edge of a substring. Depending on the engine, other boundaries may be available, but why stop there? In the right context, I believe in DIY boundaries. For instance, using lookarounds, you can make a boundary to check for changes from upper- to lower-case, which can be useful to split a CamelCase string: (?<=[a-z])(?=[A-Z]) However, do not overuse boundaries, because good contrast often make them redundant (see Use Contrast.)

(direct link)
Don't Give Up what You Can Possess.
Atomic groups (?> … ) and the closely-related possessive quantifiers can save you a lot of backtracking. Structured data often gives you chances to incorporate those in your expressions.

(direct link)
Don't Match what Splits Easily, and Don't Split what Matches Nicely.
I explained this point in the section about splitting vs. matching.

(direct link)
Design to Fail.
As Shakespeare famously wrote, "Any fool can write a regex that matches what it's meant to find. It takes genius to write a regex that knows early that its mission will fail." Take (?=.*fleas).*. It does a reasonable job of matching lines that contain fleas. But what of lines that don't have fleas? At the very start of the string, the engine looks all the way down the line. The lookahead fails, the regex engine moves to the second position in the string, and once again looks for fleas all the way down the line. At each position in the string, the engine repeats the lookahead, so that the pattern takes a long time to fail… In comparison, consider ^(?=.*fleas).*. The only difference is the caret anchor. It doesn't look like a big deal, but once the engine fails to find fleas at the start of the string, it stops because the lookahead is anchored at the start. This pattern is designed for failure, and it is much more efficient—O(N) vs. O(N2) for the first.

(direct link)
Trust the Dot-Star to Get You to the End of the Line
With all the admonishments against the dot-star, here is one of many cases where it can be useful. In a string such as @ABC @DEF, suppose you wish to match the last token that starts with @, but only if there is more than one token. If you simply wanted the last, you could use an anchor: @[A-Z]+$… but that will match the token even if it is the only one in the string. You might think to use a lookahead: @[A-Z].*\K@[A-Z]+(?!.*@[A-Z]). However, there is no need because the greedy .* already guarantees you that you are getting the last token! The dot-star matches all the way to the end of the line then backtracks, but only as far as needed: You can therefore simplify this to @[A-Z].*\K@[A-Z]+ Trust the dot-star to take you to the end of the line!


(direct link)

Two Mnemonic Devices to Check your Regexps

Greedy atoms anchor again.
Until you acquire a lot of practice, it's probably impossible to keep all these rules in mind at the same time. But remembering a few is better than remembering none, so if you're starting out, may I suggest a simple phrase to help remind yourself of tweaks that may improve the expression?

Greedy atoms anchor again.

✽ "Greedy" reminds you to check if some greedy quantifiers should be made lazy, and vice-versa. It also reminds you of the performance hit of lazy quantifiers (backtracking at each step), and of potential workarounds.
✽ "Atoms" reminds you to check if some parts of the expression should be made atomic (or use a possessive quantifier).
✽ "Anchor" reminds you to check if the expression should be anchored. By extension, it may remind you of boundaries, and whether to add them—or remove them.
✽ "Again" reminds you to check if parts of the expression could use the repeating subpattern syntax.

If you prefer short mnemonic devices, you may prefer the acronym AGRA, helpful to build the Taj Mahal of regular expressions, and named after the Indian city Agra, best known for the Taj Mahal:

A for Anchor
G for Greed
R for Repeat
A for Atomic

next  Everything you always wanted to know about
 the many pieces of regex syntax that
 start with the letters (?





Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.