Using Regular Expressions with PHP


With the preg family of functions, PHP has a great interface to regex! Let's explore how it works and what it has to offer.

Pattern Delimiters

The first and most important thing to know about the preg functions is that they expect you to frame the regex patterns you feed them with one delimiter character on each side. For instance, if you choose "~" as a delimiter, for the regex pattern \b\w+\b, this is the string you would feed to a preg function: '~\b\w+\b~'

For the delimiter, you can choose any character apart from spaces and backslashes. But choose wisely, because if your delimiter appears in the pattern, it needs to be escaped. The forward slash is a popular delimiter, and strangely so since it needs to be escaped in all sorts of strings having to do with file paths. For instance, to match http://, do you really want your regex string to look like '/http:\/\//'?

Doesn't '~http://~' look better?

Rare characters such as "~", "%", "#" or "@" are more sensible and fairly popular choices.

I don't like the "#" because it clashes with the # you use in comment mode. Esthetically, my favorite is the tilde ("~") because it meets three criteria. First, it is discrete, which allows the actual regex to stand out. Many delimiters look like they belong to the expression, and that is confusing. Second, tildes rarely occurs in my patterns, so I almost never have to escape them. Third, it is my favorite, which allows me to introduce some circular logic in this paragraph.


Pattern Modifiers: either Inline or as Flags

The second thing to know about PHP regex is that you can change their meaning by using modifiers, either as flags or inline. For instance, to look for "blob\d+" in case-insensitive fashion, you can add the "i" modifier in these two ways:

• As a flag at the end of the pattern: ~blob\d+~i
• Inline at the start of the pattern: ~(?i)blob\d+~

I tend to prefer inline modifier syntax, first because it jumps out at you when you start reading the regex, second because it is more portable across other regex flavors, and third because you can turn it off further down the string (for instance, (?-i) turns off the case-insensitive modifier).

The modifiers page explains all the flags and shows how to set them. It also presents PCRE's Special Start-of-Pattern Modifiers, which include little-known modifiers such as (*LIMIT_MATCH=x).

Whatever you do, never use the cursed U flag or the (?U) modifier because they will draw a gang of raptorexes to your cubicle—not a good look! The u flag and (?u) modifier, on the other hand, are fine—they make the engine treat the input as a utf-8 string.


The Preg functions

There are five major functions in the preg family:
preg_match
preg_match_all
preg_replace
preg_replace_callback
preg_split


Matching Once with Preg_Match()

This function is the most commonly seen in the world of php regex. It returns a boolean to indicate whether it was able to match. If you include a variable's name as a third parameter, such as $match in the example below, when there is a match, the variable will be filled with an array: element 0 for the entire match, element 1 for Group 1, element 2 for Group 2, and so on.

But a code box is worth a thousand words, so consider the following example.

$subject='Give me 10 eggs';
$pattern='~\b(\d+)\s*(\w+)$~';

$success = preg_match($pattern, $subject, $match);
if ($success) {
	echo "Match: ".$match[0]."<br />"; 
	echo "Group 1: ".$match[1]."<br />"; 
	echo "Group 2: ".$match[2]."<br />"; 
	}

Output:
Match: 10 eggs
Group 1: 10
Group 2: eggs

Notice how $match[0] contains the overall match? Considering that $match[1] contains Group 1, this is equivalent to saying that the whole match is "Group 0", which is in tune with an idea presented in the section about capturing vs. matching: "The Match is Just Another Capture Group".


Finding All Matches with Preg_Match_All()

This terrific function gives you access to all of the pattern's matches. The matches (and the captured groups if any) are returned to an array. Depending on your needs, you can ask the function to organize the array of results in two distinct ways.

Consider this string and a regex pattern to match its lines:

$airports= 'San Francisco (SFO) USA
Sydney (SYD) Australia
Auckland (AKL) New Zealand';

$regex = '%(?m)^\s*+([^(]+?)\s\(([^)]+)\)\s+(.*)$%';

You want to isolate the airport's city, the airport code and the country. Here are the two ways to organize the array.

First Presentation: in the Order of the Pattern's Groups

In both presentations, $hits will contain the number of matches found (including 0 if none are found).

$hits = preg_match_all($regex,$airports,$matches,PREG_PATTERN_ORDER);

The output is below. Element 0 contains an array with the whole matches; element 1 contains an array with the Group 1 matches; element 2 contains an array with the Group 2 matches; and so on. This order (whole match, Group 1, Group 2, Group 3) can be said to be "the order of the regex pattern".

The flag for this presentation is PREG_PATTERN_ORDER (think of it as "the order of the regex pattern"). This is actually the function's default behavior, so you can freely drop the PREG_PATTERN_ORDER flag when you call the function.

Array
(
    [0] => Array    // The Whole Matches
        (
            [0] => San Francisco (SFO) USA
            [1] => Sydney (SYD) Australia
            [2] => Auckland (AKL) New Zealand
        )
    [1] => Array     // The Group 1 Matches
        (
            [0] => San Francisco
            [1] => Sydney
            [2] => Auckland
        )
    [2] => Array     // The Group 2 Matches
        (
            [0] => SFO
            [1] => SYD
            [2] => AKL
        )
    [3] => Array     // The Group 3 Matches
        (
            [0] => USA
            [1] => Australia
            [2] => New Zealand
        )
)	

Second Presentation: ordered by SET (one set for each match)

Again, $hits contains the number of matches found (including 0 if none are found).

$hits = preg_match_all($regex,$airports,$matches,PREG_SET_ORDER);

The output is below. Note that the outer array is organized "one SET for each match at a time". Element 0 contains an array with the first match (that array's element 0 is the whole match, element 1 is Group 1, element 2 is Group 2…) Element 1 contains an array with the second match (that array's element 0 is the whole match, element 1 is Group 1, element 2 is Group 2…)

Sometimes, this structure is exactly what you want. The flag for this presentation is PREG_SET_ORDER (think of it as "ordered by set").

Array
(
    [0] => Array     // The First Match
        (
            [0] => San Francisco (SFO) USA
            [1] => San Francisco
            [2] => SFO
            [3] => USA
        )
    [1] => Array     // The Second Match
        (
            [0] => Sydney (SYD) Australia
            [1] => Sydney
            [2] => SYD
            [3] => Australia
        )
    [2] => Array     // The Third Match
        (
            [0] => Auckland (AKL) New Zealand
            [1] => Auckland
            [2] => AKL
            [3] => New Zealand
        )
)	

To remember the flags, try to understand them as "in the order of the regex pattern" (PREG_PATTERN_ORDER), or "ordered by set" (PREG_SET_ORDER)




Replacing with Preg_Replace()

For straight replacements (for instance, replacing '10' with '20'), you don't really need regex. In such cases, str_replace can be faster than the preg_replace regex function: $string=str_replace('10','20','$string');

The preg_replace function comes in when you need a regex pattern to match the string to be replaced, for instance if you only wanted to replace '10' when it stands alone but not when it is part "101" or "File10".

By default, the function replaces all of the matches in the original string, so make sure this is what you want. If you want to replace only 1 or 5 instances, specify this limit as a fourth argument.


Here is an example.

$subject='Give me 12 eggs then 12 more.';
$pattern='~\d+~';
$newstring = preg_replace($pattern, "6", $subject);
echo $newstring;

The Output:
Give me 6 eggs then 6 more.

This code replaces the two instances of "12" with "6". If you wanted to only replace the first instance, you would set the limit (1) as a fourth argument:

$newstring = preg_replace($pattern, "6", $subject,1);

This would output "Give me 6 eggs then 12 more."

If you want to know how many replacements are made, add a variable as a fifth parameter. This forces you to set the fourth parameter (the limit number of replacements). To set no limit, use -1. For instance, with

$newstring = preg_replace($pattern, "6", $subject,-1, $count);

The value of $count would be 2.

Using Captured Groups in the Replacement
In the replacement string, you can refer to capture groups. Group 1 is \1 or $1, Group 2 is \2 and $2, and so on. This means that the replacement string "\2###\1" will replace the matched text with the content of Group 2 followed by three hashes and the content of Group 1.

This technique is often used when you want to rearrange the sequence of a string. You might match a whole big string full of unwanted fluff, capture the portions you are interested in, and rearrange them how you like.


Note that as it makes one replacement after another, the regex engine keeps working on the original string—rather than switching to the latest version of the string.

For instance, using the string abcde, let's use the regex (?<=a)\w, which matches one word character preceded by an a:

$string = preg_replace('~(?<=a)\w~','a','abcde');

This produces aacde: only the "b" was replaced, because in the original string it is the only character that is preceded by an "a". If, on the other hand, the regex engine switched to the latest version of the string after making each substitution, when it came to "c", that character would also be preceded by an "a", and we would end with aaaaa.

Replacing an Invisible Delimiter
This is a trick that regex lovers are sure to enjoy. It is closely related to the technique of Splitting with an Invisible Delimiter, so I explain it in that section.


Sophisticated Replacements with Preg_Replace_Callback()

It's neat that preg_replace allows you to manipulate the replacement string by referring to captured groups. But let's face it, often you want to operate some far more complex substitutions on the text you match. This is when preg_replace_callback comes to the rescue.

Instead of specifying a litteral replacement (or a replacement composed of litterals and capture groups), preg_replace_callback lets you specify a replacement function. That function does its magic on the matched pattern and returns the replacement, which preg_replace_callback then plugs into place in the original string.

For instance, suppose you have a string where you need the last letter of each word to be converted to uppercase. First we'll look at the basic syntax, then we'll see an "inline syntax" that is more economical. In both cases, we'll use this regex:

\b(\w+)(\w)\b

This pattern simply matches each word separately (thanks to the \b word boundaries). As it does so, it captures all of a word's letters except its last into Group 1, and it captures the final letter into Group 2. (For this task, we're assuming that each word has at least two letters, so we're okay.)

Here's the basic way of doing the replacement.

$string = ("cool kids capitalize final letters");
$regex = "~\b(\w+)(\w)\b~";
$newstring = preg_replace_callback($regex,"LastToUpper",$string);
function LastToUpper($m) {
   return $m[1].strtoupper($m[2]);
   }
echo $newstring;

The Output: cooL kidS capitalizE finaL letterS

In the example above, you can see how preg_replace_callback specifies the name of the function that produces the replacement strings: "LastToUpper". The function LastToUpper is then defined. We know that preg_replace_callback sends one parameter to the substitution function, so we specify it and call it—arbitrarily—$m.

This $m that preg_replace_callback sends to the substitution function is the current match array, in the same form as the match array of preg_match. This means that $m[0] is the overall match, while $m[1] is Group 1, $m[2] is Group 2, and so on. This makes it easy for LastToUpper to return the word with the last letter capitalized: it is Group 1 (the initial letters) concatenated with the uppercase version of Group 2 (the last letter).

Here we did something simple, but you can appreciate how easy it would be to infuse our substitution with more logic. Suppose, for instance, that we want to capitalize the last letter of each word, but that when that letter is an "s", we want to substitute a "Z". Easy done: we just burn that logic into the callback function.

function LastToUpper($m) {
    $last = $m[2]=="s" ? "Z" : strtoupper($m[2]);
    return $m[1].$last;
   }
   
The Output: cooL kidZ capitalizE finaL letterZ

Lighter Version: Use an Anonymous Function
Usually, we have no use for the substitution function except for the particular regex we're working on. The second method is the same, except that instead of passing a function name in the second argument, we define the function "inline" in the call to preg_replace_callback.

$string = ("cool kids capitalize final letters");
$regex = "~\b(\w+)(\w)\b~";
$newstring = preg_replace_callback($regex,
      function($m) {return $m[1].strtoupper($m[2]);}
	  ,$string);
echo $newstring;

Same Output: cooL kidS capitalizE finaL letterS	

As you can see, our callback function has no name: it's an anonymous function, so we don't pollute the name space.

With this, you're equipped to make some powerful substitutions.


Splitting with Preg_Split()

You are probably familiar with the explode() function, which takes some text with elements delimited by a string (such as a comma, or three stars: ***) and splits the text along the delimiter, fanning the elements into an array.

For instance, the following would print an array with "break", "my" and "string".

$string = ("break***my***string");
print_r(explode("***",$string));

Well, preg_split is the "adult" version of explode(). It too will split a string, but it will allow you to use variable delimiters, making it easy to extract interesting bits of text with unwanted (but specifiable) gunk in the middle.

For instance, let's assume that this time, the delimiter (or unwanted part) is a C-style comment (with optional spaces on the side for good measure), such as "/* This part is useless to us */". For the purpose of this example, we assume that we know that the delimers are single C-style comments, meaning that there are no nested comments (that's a different exercise related to matching balanced parentheses).

No worries. The following will output "better", "regex", "today".

$string = ("better /* I want to improve */ regex/***COOL***/today");
$regex = "~\s*/\*.*?\*/\s*~";
print_r(preg_split($regex,$string));

The Output: Array ( [0] => better [1] => regex [2] => today )

Like preg_replace, preg_split has an optional parameter (in third place) that allows you to set a limit on the number of elements you want to fan to the array. There are also some flags that you can read about on the preg_split manual page.

And now, here's a way of looking at things that's sure to interest the algorithm lovers among you:

Often, you can use preg_split instead of preg_match_all. In a way, both return matches. While preg_match_all specifies what you want, preg_split specifies what you want to remove. (Or, as we'll see below, what we want to set apart.)


Splitting without Losing
Sometimes you want to split a string without removing anything from it. Or we might only want to remove a certain section. Imagine a long ribbon with consecutive colors: red, blue, red, blue, red… So far, the splitting we have seen would remove all the reds to produce an array with all the blues. But another use of preg_split is to split the string into an array with the correct "bands of red and blue". For this, we use a flag: PREG_SPLIT_DELIM_CAPTURE.

Here's how it works. In the example below, our delimiter is a series of digits, for instance "123". Instead of throwing them away, we want to keep them.

$str = "We123Like456Delimiters";
$regex = "~(\d+)~";
print_r(preg_split($regex,$str,-1,PREG_SPLIT_DELIM_CAPTURE));

The Output:
Array: [0]=>We [1]=>123 [2]=>Like [3]=>456 [4]=>Delimiters

In our preg_split call, the third parameter -1 just states we don't want to limit the number of matches. What PREG_SPLIT_DELIM_CAPTURE actually does is to insert any captured groups into the array. This is why the (\d+) was in parentheses: we include the whole delimiter into the array.

But we don't have to keep the entire delimiter. Imagine for instance that your delimiter is of the form @@ABC123, where ABC are three capital letters and 123 are three digits. If you want to fan "ABC" and "123" into the array but lose the "@@", you would do this:

$str = "token1@@ABC123token2@@DEF456token3";
$regex = "~@@([A-Z]{3})(\d{3})~";
print_r(preg_split($regex,$str,-1,PREG_SPLIT_DELIM_CAPTURE));

The Output:
Array: [0]=>token1, [1]=>ABC, [2]=>123, [3]=>token2, [4]=>DEF,
       [5]=>456, [6]=> token3


Splitting with an Invisible Delimiter
Here is a lovely feature of splitting string with regex. The preg_split function allows you to split a string with an invisible delimiter. For instance, consider a movie title written in camel case (perhaps because it was in a file name): TheDayMyVoiceBroke. You're interested in retrieving each word. But what's the delimiter?

There is an "invisible" delimiter: any space where the next character is a capital letter. This can be expressed as a simple lookahead: (?=[A-Z]). You could call that a "zero-width delimiter".

Let's see it at work:

$string = ("TheDayMyVoiceBroke");
$regex = "~(?=[A-Z])~";
$words = preg_split($regex,$string);
print_r($words);

The Output:
Array ( [0] => [1] => The [2] => Day [3] => My [4] => Voice [5] => Broke )

Magical!

But maybe we want to concatenate the words of the movie into a string, with spaces between the words? Before you reach for implode($words," "), consider that what we just did with preg_split, we can do with preg_replace. Here is the code and the output.

Replacing an Invisible Delimiter
$string = ("TheDayMyVoiceBroke");
$regex = "~(?=[A-Z])~";
echo preg_replace($regex," ",$string);

The Output:
The Day My Voice Broke


More About preg Functions

The above functions have a few settings I haven't shown. PHP also has a few other preg functions, but they are of minor interest compared with the ones presented here. You can read about them in the preg function section of the PHP manual.

In Chapter 10.4 ("Missing" preg Functions) of Mastering Regex Expressions, Jeffrey Friedl also presents three functions he has programmed to "round off" the preg functions. I recommend you read the book, but if you're in a hurry you can find the functions in the code section of regex.info, Jeffrey's website. Hit Ctrl + F to search for "preg_regex_to_pattern", "preg_pattern_error" and "preg_regex_error".


A Powerful Lookbehind Alternative: \K

If your version of PHP is 5.2.4 or later (phpinfo is your friend), you can use a wonderful PCRE escape sequence: \K. In the middle of a pattern, \K says "reset the beginning of the reported match to this point". Anything that was matched before the \K goes unreported, a bit like in a lookbehind.

For example, on the string "Marlon Brando", the pattern (?i)marlon \Kbrando will return "Brando". Well, you could get "Brando" with a capture group or a lookbehind, so what's the big deal?

The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before the \K.

For instance, let's say you want to match "Brando xx" in "Marlon Brando xx" (where xx are digits) but only if the string sits somewhere between a <tag> and a </tag>. You can't look behind for the start of the tag because you don't know how many characters are before "Marlon Brando", and variable-length lookbehinds are forbidden in PCRE.

One option is to match everything and capture "Brando xx" in a Group. Option 2 is to use \K, saving us the overhead of a capture group:

(?i)<tag>(?:(?!</tag).)*marlon \Kbrando \d+


A Full "Advanced" PHP regex program that shows
how to perform common regex tasks

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well.

This is what I have for you in the following complete PHP regex program. The program is featured on my page about the best regex trick ever.

This program performs the six most common regex tasks. The tweak is that it has no interest in the overall matches: the data we're seeking is in capture Group 1, if it is set.

As a side-benefit, the program and the article happen to provide an excellent overview of the (*SKIP)(*FAIL) syntax available in Perl and PHP. Just search throughout the article.

✽ Here is the article's Table of Contents
✽ Here is the explanation for the code
✽ Here is the PHP code


More about PHP Regex

For more details on PHP's PCRE regex flavor, I recommend a stroll through three pages of the PHP manual:

Pattern syntax
Modifiers (e.g. case insensitive)
Functions (e.g. preg_match)

If you are serious about learning all there is to know about PHP's PCRE regex flavor, then sooner or later you will want to head over to my PCRE documentation repository. With the permission of Philip Hazel, the creator of PCRE, this page contains the documentation for the latest PCRE release as well as other historical releases. It also contains a table showing in which versions of PCRE new syntax features were introduced, as well as links to other PCRE-related material on the site.



next
 Regex Humor





Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.