Regex Cookbook


This page presents recipes for regex tasks you may have to solve. If you learn by example, this is a great spot to spend a regex vacation.

The page is a work in progress, so please forgive all the gaps: I thought it would be preferable to have an incomplete page now than a complete page in 25 years—if that is possible. I also haven't proofed this page as thoroughly as the others, so please report any bugs using the form at the bottom.

In O'Reilly's Regex Cookbook, many of the recipes focus on ultra-specialized tasks such as matching Canadian postal codes or US social security numbers. I have a lot of respect for Jan Goyvaerts, but for me that's a little weak.
If you learn by example, this is a great spot to spend a regex vacation.
To me, many of the "recipes" are a repeat of the same general concept. I don't find this approach conducive to challenging the mind and expanding one's understanding of regular expressions. So here I try the approach I would have liked to see in the book. This page tries to present expressions that are "topologically different" from one another to expose you to as many uses of regex syntax as possible—hoping to help the regex student improve his or her fluency. Making all examples "different" is not always possible (or desirable), but that's the general idea.

It's hard to place expressions into neat categories as there can be considerable overlap, but here is the general organization of the expressions on this page:

1. Capturing
2. Validating
3. Finding
4. Replacing and Inserting

I'll acknowledge that these distinctions are often a bit arbitrary, but they give you different things to look at.

Capturing

How do I parse the values of a complex string, such as a url's GET parameters? [parsing]
How do I capture text inside a set of parentheses? [parentheses]
How to I match text inside a set of parentheses that contains other parentheses? [complex parentheses]

How do I parse the values of a complex string, such as a url's GET parameters?
Suppose you wanted to extract the values for day, name and fruit from this string: site.org?day=7&name=adam&fruit=apple
It is very likely you would have ready-made tools to extract these values, such as the GET array. But if you had to do it with regex, you could use this:
\?day=(\d)&name=([^&]+)&fruit=(\w+)
The values are captured in Groups 1, 2 and 3. If not all strings contained all the parameters, you could make the components optional:
\?(?:day=(\d))?&?(?:name=([^&]+))?&?(?:fruit=(\w+))?

How do I capture text inside a set of parentheses?
This is a common request on forums. You have a file with text such as Acapulco airport (ACA) and you want to grab the text in the parentheses.

Here is a recipe to capture the text inside a single set of parentheses: \(([^()]*)\)

First we match the opening parenthesis: \(. Then we greedily match any number of any character that is neither an opening nor a closing parenthesis (we don't want nested parentheses for this example: [^()]*. This is the content of the parentheses, and it is placed within a set of regex parentheses in order to capture it into Group 1. Last, we match the closing parenthesis: \).

How to I match text inside a set of parentheses that contains other parentheses?
This requires a small tweak and a regex flavor that supports recursion. We're still going to match the opening parenthesis at the very start and the closing parenthesis at the very end. Inside, we'll match "stuff that's not parentheses" (or nothing), followed by zero or more sequences of (i) a repeat the whole pattern (expressed below as (?R), and (ii) more "stuff that's not parentheses" (expressed below as (?2)).

\((([^()]*+)(?:(?R)(?2))*)\)

I can't guarantee that this works in every situation as recursive patterns are fickle, but here's PHP code that tests the expression on various sets of nested parentheses.

<?php
$regex='~\((([^()]*+)(?:(?R)(?2))*)\)~';
$strings=array('Airport: (ACA)','equation1: (1+(a+b))','equation2: (1+(a+b)+c)','equation3: (1+(a+b)+(2+2)+c)',
'equation4: (1+(a+b)+(2+(7/5)-2)+c)');
foreach($strings as $string)
if(preg_match($regex,$string,$match)) echo $string.' <b>capture:</b> '.$match[1].'<br />';
?>
This is a bit different from the expression offered by Jeffrey Friedl in Mastering Regex Expressions: (?:[^()]++|\((?R)\))*, which you'd have to tweak before you could pop it in the code above in order to capture the contents of the parentheses: \(((?:[^()]++|(?R))*)\). In my tests, I have found this expression to be up to twenty percent faster when the match works as planned, but slower by the same amount when a parenthesis is missing.

Validating

What you can validate, you can also search for, so this section is also about finding

How do I validate that a number is over 15? [values]
How do I validate that a time string is well-formed? [formats]
How do I validate that a list is made of certain items, in any order? [unordered list]
How can I validate that a string contains the text "75", but only once? [password validation technique]
How can I validate that a binary string contains ten 1s at the most? [reverse password validation technique]

How do I validate that a number is over 15?
This example gives you an idea of what you have to do to validate numbers within a certain range using regular expressions—and of why you should probably look for other methods first. Because you are working on the string, rather than values, you have to think of the position of the digits that may be used to create the numbers within your range. Here are two approaches to validating a number over 15.

^(?:1[5-9]|[2-9]\d|[1-9]\d\d+)$
With this approach, we progress in numerical order with multiple alternations, first trying to match numbers between 15 and 19, then numbers between 20 and 99, then numbers 100 and above.

^(?:1(?:[5-9]|\d\d+)|[2-9]\d+)$
With this approach, we look at two cases: either the first digit is a 1, or it is anything else.

How do I validate that a time string is well-formed?
Here's an expression I came up with.
^(?:([0]?\d|1[012])|(?:1[3-9]|2[0-3]))[.:h]?[0-5]\d(?:\s?(?(1)(am|AM|pm|PM)))?$

It matches times in a variety of formats, such as 10:22am, 21:10, 08h55 and 7.15 pm.

How do I validate that a list is made of certain items, in any order?
Scenario: you want to make sure that the string only contains items from a list, delimited by a comma (for instance). These items could be objects, numbers, names. For instance: 212, 415, 850. Here is a general solution:

Example 1: ^(?:peas|onions|carrots)(?:,(?:peas|onions|carrots))*+$
Example 2: ^(?:415|212|850)(?::(?:415|212|850))*+$ (note that here the delimiter is a colon.)

Explanation: You need one of the words to be present at least once. Then it is optionally followed by a comma and another word, multiple times.

If you are using PCRE, you can use the repeating syntax for a more compact, maintainable expression: ^(peas|onions|carrots)(?:,(?1))*+$

How can I validate that a string contains the text "75", but only once?
This is similar to the password validation presented on the Lookaround page: you set a number of conditions before matching the string.
^(?=.*?75)(?!.*?75.*?75).*$

How can I validate that a binary string contains ten 1s at the most? ("reverse password validation technique")
This is a variation on the password validation technique: we look ahead to make sure that the string does not contain what we don't want, then we match.

^(?!0*(?:10*){10}1)[01]+$

After anchoring the expression, in the negative lookakead, we build a generic binary string that has at least 11 ones. This is what we don't want. To build that string, we state that it can start with any number of zeros. Once the zeroes are consumed, we have a one, followed by optional zeroes. That's our first one. We repeat this ten times, bringing us to ten ones. Finally, we add one last one to get over the limit.

Finding

How do I match a number with one to ten digits? [boundaries]
How can I match all lines except those that contain a certain word? [exclusion]
How can I match paragraphs that contain MyWord, but only proper paragraphs starting with two carriage returns? [paragraphs]
Match numbers followed by letter or end of string
Match pairs of characters in the correct slots

Okay, let's start easy.

How do I match a number with one to ten digits?
You could do something like \b\d{1,10}\b. The boundaries are there to make sure you don't match a portion of a twenty-digit number when you really only want to match a number that has ten digits at the most. For this kind of simple max, I really recommend you print out the cheat sheet.

How can I match all lines except those that contain a certain word?
Typically, this would be used in a case where you want to capture something on each line, except those that present certain features. Let's go with the simple case where you want to match all lines, except those that contain "BadWord". This will match your lines:

(?m)^(?!.*?BadWord).*$

If you want to exclude BadWord only when it stands on its own, set it apart with the \b boundary:

(?m)^(?!.*?\bBadWord\b).*$

Also note that this is a potential application of the best regex trick ever, for which I won't repeat the details—but know that you'll need to examine Group 1 captures, for which the page provides you with sample code in various languages.

(?m)^.*?\bBadWord\b.*$|(^.*$)

Match numbers followed by letter or end of string
In the string 00-11A22B33_44, suppose you are interested in matching numbers, provided they are followed by a letter or the end of the string.

You can solve that with: \d+(?=[A-Z]|$)
The lookahead (?=[A-Z]|$) asserts: what follows is either an uppercase character, or the end of the string—exactly what we want. The trick here is to not be shy to use the $ anchor in a context where it is not on its own, at the end of the string. Dollars are people too!

If you've only seen basic regex tutorials, you could be forgiven for assuming that the ^ anchor only ever appears at the very beginning of an expression, while the $ anchor always sits quietly at the very end.

You can use anchors anywhere in your pattern. They are assertions like any other.

How can I match paragraphs that contain MyWord, but only proper paragraphs starting with two carriage returns?
This question is about finding text within specific formatting. If a paragraph starts with a single carriage return, you are not interested. You are only interested in the first paragraph or those set off by two carriage returns.

On systems where a carriage return only inserts a newline character (such as Unix), you could start with this:

(?m)^(?<=^\A|\n\n).*?SomeWord.*$

The lookbehind ensures that the line is either the first in the text, or that it is preceded by two newlines. On Windows, in the place of \n\n, you would want \r\n\r\n.

For something portable, on PCRE, use \R, which matches any newline sequence. Your expression would look like this:

(?m)^(?<=^\A|\R\R).*?SomeWord.*$

Match pairs of characters in the correct slots
Suppose you want to match all two-digit numbers that start with a 6. Further, you think of your string as a series of pairs, so you would want to match 68 in 116822, but not in 168122.

Let's proceed step by step. To match the first pair that starts with a 6, you could use
^(?:[^6].)*(6\d) and retrieve the match from Group 1. The anchor ^ ensures that we start looking at the beginning of the string. The non-capture group (?:[^6].)* matches zero or more pairs of characters that do not start with a 6 (using the parity trick to stay in sync with the two-character slots in the string), then the parentheses around (6\d) capture our match to Group 1.

In Perl, PCRE (PHP, R…) or Ruby 2+, we could do away with the capturing group and match the string directly by using \K, which forces the engine to drop what was matched previously: ^(?:[^6].)*\K6\d. Likewise, in .NET, we could use infinite lookbehind: (?<=^(?:[^6].)*)6\d

But we don't want to match just one pair: from 00611122665564, we want to extract 61, 66 and 64. This is a place where the match continuation anchor \G comes in very handy. \G matches the beginning of the string, or the position immediately following the previous match. It is supported in .NET, Perl, PCRE (PHP, R…), Java and Ruby. It will ensure that our second and next matches do not fall out of sync with the two-character slots in the string. Here is the general option with capture groups:

\G(?:[^6].)*(6\d)
In engines that support \K, we would use \G(?:[^6].)*\K(6\d) to get a direct match.
And in .NET, we would use an infinite lookbehind: (?<=\G(?:[^6].)*)(6\d)


Replacing and Inserting

I suggest you try to think of the regex replace feature of your language or text editor as not only a way to replace text, but also to insert. Remember that a regex pattern can match not only text strings but also positions in text. For instance, the pattern ^ matches the beginning of a string or line (depending on the engine and mode), and (?=@) matches a position preceding an AT—without matching the characters themselves. When you use a replacement function on a position match, where no actual characters are matched, you are not really replacing anything: rather, you are inserting characters at the matched position.

Insert text at the beginning (or end) of a line
How do I replace one tag delimiter with another? ["surround" replacement]
How do I replace the string "//" in a whole file, but only when it is part of a path? [selective replacement]
How do I replace curly Quotes ("smart quotes") with straight quotes? [utf8]
How do I convert a whole string to lowercase except certain words? [selective transformation]
How do I replace all words that appear on the black list, but not those on the white list? [black list]
How do I fix unclosed tags? [forced failure]

Insert text at the beginning (or end) of a line
To insert text at the beginning of a line, we simply match the position at the beginning of the line, without matching any characters. To do so, in all engines except Ruby, we must turn on multi-line mode, which allows the ^ anchor to match at the beginning of lines.

For instance, in .NET, Java, Perl, PCRE (PHP, R, …) and Python, you can use this regex to search:
(?m)^ and replace with your chosen line prefix.

Likewise, to insert a suffix at the end of lines, you can use this regex to search:
(?m)$ and replace with your chosen line suffix.

How do I replace one tag delimiter with another?
Let's say you want to replace [square brackets] with <pointy brackets> without changing the stuff in the brackets.

Search: \[([^]]+)]
This search expression matches an opening bracket, then anything that is not a closing bracket, then a closing bracket. The content of the brackets is captured in Group 1.

Replace: <\1>
The replacement expression just places the capture (Group 1) within a brand new set of pointy brackets.

How do I replace the string "//" in a whole file, but only when it is part of a path?
Let's say in a page of text you want to replace all instances of // or \\ with a single forward slash. No problem, that's what your replace function is designed to do. In PHP: $string=preg_replace('~//|\\\\~','/',$string); (the backslashes need to be escaped).
By the way, this is a great example of why something like a tilde (~) often works better than / as a delimiter. With / as a delimiter, the regex would look like this: $string=preg_replace('/\/\/|\\\\/','/',$string);.

The real "problem" is if you wanted to replace all instances of //, but only in parts of your text file that look like this: Document=root//folder1//folder2//(maybe_more_folders)//file.extension
✽ You can't do a plain replace, as instances of // that you don't want to touch would also be replaced.
✽ You can't capture the various parts of the file path into groups and build a generic replace string, because you don't know how many subfolders are in the string.

For this kind of problem, I use two distinct solutions depending on the context and my mood.

Solution #1: Variable-Width Lookbehind.
This simple solution works in ABA and RegexBuddy (.NET flavor), which have variable-width lookbehinds. You search for (?<=Document=.*)// and replace with a single slash.

Solution #2: Replace function with Callback.
This solution works if your programming language has a replace function that allows you to call another function. The replace function passes the whole match. The "callback function" works on the match and returns the replacement string. In this instance, Document=[^/]*+(?>//[^\s]+) matches the type of string we are looking for. In PHP, we can use:
$string=preg_replace_callback('~Document=[^/]*+(?>//[^\s]+)~',
               function ($match) {return str_replace('//','/',$match[0]);},
               $string);

Solution #3: Multiple replacements.
This solution works in environments where you can run a replace operation multiple times (until you exhaust any replacements to be made). For instance, in this case, we can safely assume that no path will have a hundred subfolders, so we can run the replace operation a hundred times. On my system, I can run this kind of operation in Directory Opus (for file renaming) and EditPad Pro.

The trick here is to build an expression that will continue to match the string you want to alter, even after you have made several replacements. In our example, (Document=[^/]*+(?>/(?!/)[^/\s]+)*+)(//) will capture before the first // in Group 1, then capture the first //. You replace the match with \1 and a single /, then you repeat the operation as many times as necessary.

How do I replace curly Quotes ("smart quotes") with straight quotes?
This is not a hard regex problem: we just want to replace some characters with some other character. It's a character set problem. You need to know every unicode code point (or the few ASCII codes) for curly quotes. The regex is self-explanatory: I'll just give you the solution, first for utf-8 then for ASCII.

For utf-8 text (which is what I have on my website), I use the two replace lines in the code below.

<?php
$string='“Take me to ‘the station’ ”, he said.';
echo 'Before: '.$string.'<br />';
$string=preg_replace('~[\x{0091}\x{0092}\x{2018}\x{2019}\x{201A}\x{201B}\x{2032}\x{2035}]~u',"'",$string); // single curly quotes
$string=preg_replace('~[\x{0093}\x{0094}\x{201C}\x{201D}\x{201E}\x{201F}\x{2033}\x{2036}]~u','"',$string); // double curly quotes
echo 'After: '.$string;
?>

Output:
Before: “Take me to ‘the station’ ”, he said.
After: "Take me to 'the station' ", he said.

For ASCII-encoded text, you can use this:
<?php
$string='“Take me to ‘the station’ ”, he said.';
echo 'Before: '.$string.'<br />';
$string=preg_replace('~[\x145\x146]~',"'",$string); // single curly quotes
$string=preg_replace('~[\x147\x148]~','"',$string); // double curly quotes
echo 'After: '.$string;
?>

How do I convert a whole string to lowercase except certain words?
Input: Tomatoes AND orangeS AND ParsleY
We want to convert the whole sentence to lowercase, except the word AND. Here are three ways to handle this.

1. Match all words except AND, and replace them to their lowercase version using a callback function (preg_replace_callback in PHP).
Match: (?!\bAND\b)\s*\b\w+\s*

Here is a working example:
<?php
$string='Tomatoes AND orangeS AND ParsleY';
$regex='~(?!\bAND\b)\s*\b\w+\s*~';
$string=preg_replace_callback($regex,function ($m) {return strtolower($m[0]);} ,$string);
echo $string;
?>

2. Progressively match the whole string, capturing word groups in Group 1 and 'AND' in Group 2, then rebuild the string.
This is heavier programmatically, but, according to my benchmarks (running each piece of code a million times), it is a 33% faster—thanks to the averted callbacks.

<?php
$string='Tomatoes AND orangeS AND ParsleY';
$regex=',((?!\bAND\b)\s*\b\w+\s*)(\bAND\b|$),';
preg_match_all($regex, $string, $matches, PREG_PATTERN_ORDER );
$size=count($matches[1]);
$string='';
for ($i=0;$i<$size;$i++) $string.=strtolower($matches[1][$i]).$matches[2][$i];
echo $string."<br />";
?>

3. Use the best regex trick ever, for which I won't repeat the details—but know that you'll need to examine Group 1 captures, for which the page provides you with sample code in various languages.

(?m)^.*?\bBadWord\b.*$|(^.*$)

How do I replace all words that appear on the black list, but not those on the white list?
Let's say you want to replace all instances of the word sax with with '###', even when it is part of other words such as "saxophone", but not when it is part of "Essax" and other words on a white list. And let's say you have a whole blacklist of "bad words" words besides "sax", each word with its own whitelist of acceptable uses.

Crafting a custom regex for each word is a bit long. The easier procedure is to replace each instance of the "bad words" that occur in a white list word with something distinctive. For instance, add "@@@" to the end of every white list word that contains "sax"—turning "Essax" into "Essax@@@". With a simple lookahead, you can then replace sax everywhere, except when it is part of a word that ends in "@@@": sax(?!\w*@@@). Last, all you have to do is zap all the "@@@".

How do I fix unclosed tags?
Here is an example I'm particularly fond of because it's a great use of conditionals. The problem: in this string
a<1bc<2>3>de<<4f5g
the numbers are supposed to live in complete tags, like so: <1>
Sometimes the opening tag is missing, sometimes the closing tag is missing, sometimes there are multiple opening tags, sometimes the tag is properly formed. To match these numbers, if we make both tags optional, as in <*(\d+)>*, then we will erroneously match the 5, which is supposed to be tagged. To ensure there is at least one tag, one solution is to say "match opening tags and optionally match closing tags, OR optionally match opening tags and match closing tags. This looks like this:
Match: <+(\d+)>*|<*(\d+)>+
Replacement: <\1\2>

This works great, but the alternation can give the engine a lot of work. Isn't there a way to say "at least one of the tags has to be present"? With conditionals, there is:

Match: (<)*(\d+)(>)*(?(1)|(?(3)|(?!)))
Replacement: <\2>
The first part of the expression matches optional opening tags, a number, and optional closing tags. The opening tags are captured in Group 1. The number is captured in Group 2. The closing tags are captured in Group 3. After all this matching takes place (without using an alternation), a conditional expression checks that at least one of the two tags was present (and therefore captured). Here is the logic:
IF Group 1 was captured: (?(1)…THEN no need to match anything,
OTHERWISE (no Group 1 capture),
IF Group 3 was captured: (?(3)…THEN no need to match anything,
OTHERWISE (neither tag group was captured), THEN fail: (?!).

The key here is to force the regex to fail unless we are happy with the match. (See forced regex failure on the tricks page for more about forcing a regex to fail.)

Now tell me… how neat is that?

Smiles,

Rex


next
 Regex Tools, Books and Online Resources





Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.