Using Regular Expressions with Perl


If you're less interested in Perl regex in itself than in using Perl to build powerful command-line regex one-liners, visit the page on that topic.


A Word about Perl Delimiters

Before we start, a quick word about delimiters around Perl patterns is in order. You'll usually see Perl regex patterns expressed between forward slashes, as in /this pattern/, which is short for m/this pattern/. But you don't have to use forward slashes.

In the "long form", where m for "match" or s for "substitute" precedes the pattern, you can use any delimiter you like. For instance, m~some pattern~ is valid. As discussed elsewhere, this is extremely convenient when your input would otherwise require you to escape forward slashes, as in anything involving html tags or urls.


(direct link)

What makes Perl regex special?

By now, the bulk of Perl regex syntax has drifted to other engines. For instance the PCRE engine, while not identical to Perl regex, supports esoteric Perl syntax such as backtracking control verbs and branch reset. In some cases, the flow runs the other way: recursion and named groups started in PCRE and were later adopted by Perl.

Other engines have also extended regex syntax in useful directions and can, in some respects, be said to be ahead of Perl. In that category, I would place .NET's infinite lookbehind, capture collections, character class subtraction, balanced groups and right-to-left matching mode; and the fuzzy matching from Matthew Barnett's regex engine for Python.

So what makes Perl regex special today is not its syntax—unless we are talking about Perl 6 regex, which is another planet altogether and miles away from mainstream adoption.

I'll fully admit to not being fluent in Perl (I fumble around everytime I need to do something more complicated than a Perl regex one-liner), but my impression as an outsider is that what makes Perl regex special today is two things:

✽ Regex integrates intimately within Perl code
✽ You can use code inside your regular expressions

These two things, of course, reduce to one: regex is tightly interwoven into the fabric of Perl. Indeed, to an outsider, Perl code often looks like one big regular expression.

Let me give you what I consider an exquisite example of the power afforded by integrating code within regular expressions. Consider this line of code:

if ('abc' =~ /\w+(?{print "$&\n";})(*F)/) {}
The first thing to notice is that the =~ operator (which stands for matches) does the heavy lifting performed by a match function in other languages. So the regular expression is not an argument in a function—it is specified directly on the right side of the =~ operator, between the / delimiters. How compact!

Forget the (?{print "$&\n";}) fragment for a moment. The regex pattern itself is no more than \w+(*F): match some word characters, then fail to match the (*F) token (the forced-failure token, which never matches), causing the engine to backtrack and gradually give up word characters while looking for another way to match. The magic is that each time the engine passes the \w+, before failing, it reads a capsule that contains a small piece of injected Perl code:

(?{print "$&\n";})
The code itself is inside the braces: a single print statement print "$&\n"; that outputs the current match (it helps to know that $& is a special variable that contains the match, just as $1 contains the content of capture group 1). As a result, the program prints the list of temporary matches at each point where the engine finishes matching \w+, corresponding to a full path exploration:
abc ab a bc b c

And if that doesn't make you in awe of Perl regular expressions… Maybe nothing will.

Please note that via the (?C…) callout syntax, PCRE aims to provide similar functionality to Perl's "code capsules".


About this Page

At the moment, I am not planning a fully fleshed-out guided tour of Perl regex, although I certainly intend to add plenty of tasty material to this page over time. My pages are always in motion.

In the meantime, I don't want to leave you Perl coders out dry, so I have something special to get you started. Actually, two things: the first is a page on using Perl to build powerful command-line regex one-liners.


A Perl program that shows
how to perform common regex tasks

Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well.

This is what I have for you in the following complete Perl regex program. It's taken from my page about the best regex trick ever, and it performs the six most common regex tasks. The first four tasks answer the most common questions we use regex for:

✽ Does the string match?
✽ How many matches are there?
✽ What is the first match?
✽ What are all the matches?

The last two tasks perform two other common regex tasks:

✽ Replace all matches
✽ Split the string

If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with Perl. Bear in mind that the code inspects values captured in Group 1, so you'll have to tweak… but you'll have a solid base to understand how to do basic things&and fairly advanced ones as well.

As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a Perl pro might look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment.

Please note that usually you will choose to perform only one of the six tasks in the code, so your own code will be much shorter.


Click to Show / Hide code
or leave the site to view an online demo
#!/usr/bin/perl
$regex = '{[^}]+}|"Tarzan\d+"|(Tarzan\d+)';
$subject = 'Jane" "Tarzan12" Tarzan11@Tarzan22 {4 Tarzan34}';
# put Group 1 captures in an array
my @group1Caps = ();
while ($subject =~ m/$regex/g) {
	print $1 . "\n";
	if (defined $1) {push(@group1Caps,$1);	}
}

######## The six main tasks we're likely to have ########

# Task 1: Is there a match?
print "*** Is there a Match? ***\n";
if ( @group1Caps > 0)  { print "Yes\n"; }
else { print ("No\n"); }

# Task 2: How many matches are there?
print "\n*** Number of Matches ***\n";
print scalar(@group1Caps);

# Task 3: What is the first match?
print "\n\n*** First Match ***\n";
if ( @group1Caps > 0)  { print $group1Caps[0]; }

# Task 4: What are all the matches?
print "\n\n*** Matches ***\n";
if ( @group1Caps > 0)  { 
	foreach(@group1Caps) { print "$_\n"; } 
	}

# Task 5: Replace the matches
($replaced = $subject) =~ s/$regex/
		if (defined $1) { "Superman"; } else {$&;} /eg;
print "\n*** Replacements ***\n";
print $replaced . "\n";

# Task 6: Split
# Start by replacing by something distinctive,
# as in Step 5. Then split.
@splits = split(/Superman/, $replaced);
print "\n*** Splits ***\n";
foreach(@splits) { print "$_\n"; } 

Read the explanation or jump to the article's Table of Contents





Smiles,

Rex

Buy me a coffee


Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, may I gently ask that you go through these crazy hoops…
Buy me a coffee