PCRE Callouts


PCRE has a terrific feature: callouts, specified with the syntax (?C…), where the dots stand for an optional argument. For instance, (?C), (?C12) and (?C'beyond the digits') are all valid callouts.

If you call PCRE's matching function in the standard way, when the engine encounters (?C…), it ignores it and continues its match attempt.

However, if you specify a callout function before calling PCRE's matching function, then when the engine encounters (?C…), it temporarily suspends the match and passes control to that callout function, to which it provides information about the match so far. The callout function then performs any task you see fit, then it returns a code to the engine, letting it know whether to proceed normally with the rest of the match.

This feature enables PCRE to supply similar functionality to Perl's code capsules.

The goal of this page is to provide you with working code to get you started with callouts.

(direct link)
Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

Basic Syntax
The Callout Function
Testing PCRE Callouts
Testing PCRE Callouts using PCRE.NET
Program 1: Exploring Substrings
Program 2: Exploring Callout Properties
Program 3: Debugging with Auto Callout
Program 4: Infinite Lookbehind
Uses for Callouts
Further Details


(direct link)

Basic Syntax

The basic syntax for a callout within a PCRE pattern is (?C…)
The optional argument in the dots takes two forms: either an integer or a string, as in (?C12) or (?C'beyond the digits').

The argument is passed to the callout function, which can choose to use it or ignore it.

Argument as Identifier
One way to use the callout arguments is as identifiers. If you have several callouts in your pattern, the callout function can then test the value of the identifier to handle various cases.

Argument as Value
You can also instruct the callout function to do something directly with the value of the argument. For instance, if you have several callouts with integers—say (?C8), (?C16), (?C32)—the callout function might use the value in an expression. Likewise, if the argument is a string, the callout function might use that value directly, for instance by displaying it.

Argument as Code
In a dynamic language, a string callout argument might even contain a piece of code to be evaluated at run-time… Go for it, implement that on your company's website, everyone will love that security feature!

Form of the Argument
- You can ommit an argument and just use (?C), in which case the argument will be set to 0.
- If using an integer argument, the value must be 255 or less.
- If using a string identifier, various delimiters are possible: a set of {curly braces}, or a pair of one of the identifiers in character class [`'"^%#]
- The delimiter can be escaped within the identifier by doubling it, as in (?C'What''s Up?')


(direct link)

The Callout Function

The callout function can perform any tasks you see fit. It receives a lot of information about the match: the current position in the string and the pattern, the temporary match, and more—we'll explore these values in Program 2.

(direct link)
Return Values
After the callout function has done its job, you make it return a value to the engine.

- A zero tells the engine to resume its match attempt where it left off.
- A positive value tells the engine to fail at the current position in the pattern, just like a (?!) construct or a (*FAIL). This causes the engine to start backtracking in search of a matching path.
- A negative value tells the engine to fail the overall match (the current match attempt fails, and no further attempts are attempted).



(direct link)

Testing PCRE Callouts

To work with PCRE callouts, you either need to be using the PCRE library directly or to work in a language or tool that has implemented the callout feature.

In a tool (such as Notepad++, which supports PCRE), this is unlikely: how would you specify the callout function?

In a language other than C and C++ (which can call PCRE's functions directly), callouts may not be a priority for the developers of the language. For instance, in PHP, the preg_match function makes no room for callouts.

If you're using a .NET language such as C#, Visual Basic or F#, you're in luck.


(direct link)

Testing PCRE Callouts using PCRE.NET

Out of the box, .NET has a terrific regex engine. But on the page about C# regex, I also praise PCRE.NET, an alternate engine for .NET, a wrapper around the PCRE library provided by Lucas Trzesniewski. With this interface to PCRE, you can access all of PCRE's rich syntax, including callouts.

PCRE.NET is a snap to add to any .NET project. In Visual Studio,

✽ Press Ctrl + Q for the Quick Launch window, type nuget and select Manage Nuget Packages for Solution.
✽ In the search window, type pcre.net, making sure that the filters pull-down is set to All.
✽ Install.

To whet your appetite for callouts, I will provide two short but fully functional C# programs that demonstrate PCRE's callout functionality. The notes after each program explain some salient features of the API.


(direct link)

Program 1: Exploring Substrings

This program replicates the delightful Perl capsule explained on this page:

if ('abc' =~ /\w+(?{print "$&\n";})(*F)/) {}
It prints out substrings of the test string abc:
abc
ab
a
bc
b
c

For the why, please read the explanations.

Here is C# code to do the same. Granted, it is longer than the Perl one-liner, but you already knew that Perl and C# are different beasts.

using System; using PCRE; class Program { static void Main() { string subject = "abc"; var combo_regex = new PcreRegex(@"\w+(?C'temp: ')(*FAIL)"); combo_regex.Match(subject, callout => { Console.WriteLine(callout.String + callout.Match.Value); return PcreCalloutResult.Pass; } ); Console.WriteLine("Press Key"); Console.ReadKey(); } }

Here is the output:

temp: abc temp: ab temp: a temp: bc temp: b temp: c Press Key

Callout Specified as Lambda
The key feature is that when we call the Match constructor, in addition to the standard subject string, we pass the callout function. There are several ways to pass the callout function. In this example, for brevity, we pass a lambda.

If you plan to reuse the callout function, it probably makes sense to pass it as a delegate. We will see how to do that in a later example.

Argument used as Value
One interesting feature is that the callout's argument is the string "temp: "
This string is output on every temporary match report via callout.String

Return Values
Note that the callout returns PcreCalloutResult.Pass
This maps to the zero value that tells the engine to resume the match attempt where it left off. The other possible return values are:

PcreCalloutResult.Fail, equivalent to 1, telling the engine to fail the current match attempt, after which the engine, as usual, advances to the next position in the string and starts a new match attempt.

PcreCalloutResult.Abort, equivalent to -1, telling the engine to fail the overall match (the current match attempt fails, and the engine does not advance in the string to try other attempts).

Match object discarded
Usually, when we call combo_regex.Match(), we assign the resulting match object to a variable. In this case, we don't care about the match object, so no assignment was made.

Alternate implementation
In the section on the callout function's return values, I mentioned that a positive value acts like a (*FAIL). This means we can obtain the same result as above by removing the (*FAIL) and returning a positive value, which PCRE.NET expresses as PcreCalloutResult.Fail.

This fragment outputs the same temporary matches as before:

string subject = "abc"; var combo_regex = new PcreRegex(@"\w+(?C)", PcreOptions.NoAutoPossess); combo_regex.Match(subject, callout => { Console.WriteLine(callout.Match.Value); return PcreCalloutResult.Fail; } );

But there is one subtlety: PcreOptions.NoAutoPossess, coming up next.

(direct link)
The Ghost of Autopossess (and of other Optimizations)
The PcreOptions.NoAutoPossess option sets PCRE's PCRE2_NO_AUTO_POSSESS option, which can also be turned inline by the (*NO_AUTO_POSSESS) start of pattern modifier. (Except that at the moment of writing there seems to be a bug with this latter syntax.)

As a reminder, the autopossess optimization turns some quantifiers into possessive quantifiers when the token that follows is incompatible with the quantified token (there is no shared ground, so no reason to backtrack). For instance, \d+\D is automatically optimized to \d++\D. For the same reason, the \w+ in our pattern is automatically optimized to \w++.

We need to turn that off, otherwise when the callout returns a positive value, the engine cannot backtrack into the atomic \w++, so the match attempt fails without further exploration. In this case, the engine advances to the next position in the string to try the next match attempt, yielding this much shorter output:

abc bc c

For the same reason, if you want to make sure that callouts always work as you expect, you should turn off other optimizations as well. Putting all the optimization killers in one place:

- PCRE2_NO_AUTO_POSSESS, set inline with (*NO_AUTO_POSSESS) or in PCRE.NET with PcreOptions.NoAutoPossess

- PCRE2_NO_START_OPTIMIZE, set inline with (*NO_START_OPT) or in PCRE.NET with PcreOptions.NoStartOptimize

- PCRE2_NO_DOTSTAR_ANCHOR, set inline with PCRE2_NO_DOTSTAR_ANCHOR or in PCRE.NET with PcreOptions.NoDotStarAnchor


(direct link)

Program 2: Exploring Callout Properties

This second program is designed to explore the properties of the PcreCallout object passed to the callout function.

The simple pattern (?:([A-Z])\d(?C8))+ matches one uppercase letter followed by one digit, multiple times, for instance Q1G5. After matching each digit, we find the callout token (?C8). The 8 is a simple identifier that is passed to the callout function just in case we want to do something with it— which would come in handy if we had multiple callouts.

Please excuse the minimal indentation: I wanted all lines to fit inside the code box.

using System; using PCRE; class Program { static void Main() { // This function shows info about the args it receives Func<PcreCallout, PcreCalloutResult> callout_info = delegate (PcreCallout info) { // In the pattern string, the position after (?C12) is 17 Console.WriteLine("\nPosition in the Pattern: " + info.PatternPosition); // The position in the string when the callout is called: // 2, 4, 6 Console.WriteLine("Position in the String: " + info.CurrentOffset); // This will print the 12 in C12 Console.WriteLine("Callout Number: " + info.Number); // If we has a sting identifier, as in (?C'combo'), we // would access it via s.String. See Program # 1. // We didn't call Match with a string offset: 0 Console.WriteLine("StringOffset: " + info.StringOffset); // The last group capture: Group 1 Console.WriteLine("Last Capture Group: " + info.LastCapture); // Value of the last capture Console.WriteLine("Last Capture: " + info.Match.Groups[info.LastCapture].Value); // Temporary Match Console.WriteLine("Temporary Match: " + info.Match.Value); return PcreCalloutResult.Pass; }; var callout_info_regex = new PcreRegex(@"(?:([A-Z])\d(?C12))+"); string subject = "A1B2C3"; var firstmatch = callout_info_regex.Match(subject, callout_info); if (firstmatch.Success) { Console.WriteLine("\nOverall Match: " + firstmatch.Value); } Console.WriteLine("Press Key"); Console.ReadKey(); } }

Here is the output:

Callout Number: 12 StringOffset: 0 Last Capture Group: 1 Last Capture: A Temporary Match: A1 Position in the Pattern: 18 Position in the String: 4 Callout Number: 12 StringOffset: 0 Last Capture Group: 1 Last Capture: B Temporary Match: A1B2 Position in the Pattern: 18 Position in the String: 6 Callout Number: 12 StringOffset: 0 Last Capture Group: 1 Last Capture: C Temporary Match: A1B2C3 Overall Match: A1B2C3 Press Key


Callout Specified as Delegate
In the previous example, we passed the callout function as a lambda. In this example, we create a function that can be reused (it shows information about the callout arguments) so we pass it as a delegate.

The Match constructor has one more argument than usual: the callout delegate.

callout_info_regex.Match(subject, callout_info);

Capture Collection
Normally, in PCRE, when a capture group is quantified, as in (?:(\d)\D)+, the engine only returns the last value of the capture group. That is how most engines work. In contrast, the standard .NET engine has a feature called capture collections that let you examine all intermediate captures.

PCRE callouts take you some of the way in the direction of capture collections. In this example, each pass displays the last capture group.


(direct link)

Program 3: Debugging with Auto Callout

If you set the PCRE2_AUTO_CALLOUT option, the engine acts as though there were a callout after each token. Each callout has the same argument: 255, as in (?C255)

This option can be interesting if you want to inspect the progression of a match attempt—perhaps for debugging. Note that this kind of functionality is also offered in PCRE's bundled pcretest utility.

Here is a simple example in PCRE.NET, where the option is called AutoCallout. The simple pattern \d+9\b is designed to cause backtracking against our test string 1492 1999.

using System; using PCRE; class Program { static void Main() { var end_with_9 = new PcreRegex(@"\d+9\b", PcreOptions.AutoCallout); string subject = "1492 1999"; var the_match = end_with_9.Match(subject, call => { Console.WriteLine(call.Match); return PcreCalloutResult.Pass; }); if (the_match.Success) { Console.WriteLine("\nOverall Match: " + the_match.Value); } Console.WriteLine("Press Key"); Console.ReadKey(); } }

Here is the output:

1492 149 14 149 1 492 49 4 49 92 9 2 1999 199 1999 1999 Overall Match: 1999 Press Key



(direct link)

Program 4: Infinite Lookbehind

One feature famously absent from Perl and PCRE is infinite lookbehind. The following program shows two simple ways of implementing this feature using a callout. The code is shown in C#, but Method 1 would work in any language that provides a full API to PCRE. Note that the code is meant as a stub—for instance error handling is absent.

Before diving in, here are the general ideas.

One Callout to Rule them All
If you're going to use a lot of callouts, and especially some fancy features such as infinite lookbehind, it makes sense to me to make one big callout function that can handle a number of common cases.

You might object that a collection of small methods is better. But remember, when we call the match function, we can only pass a single callout. Your pattern, on the other hand, might include several callouts to which you'd like to assign different tasks. This is when you need one callout that can handle multiple cases. If it grows too big, sure, you can let it distribute the work to other methods, but it remains the one entry point.

In the demo program, the CalloutSwitch delegate checks for callouts of this form: (?C'keyword:action'). Of course other forms can be checked as well. We will implememt lookbehind with two methods, which will be passed with callouts in these shapes:

✽ Method 1 (pure PCRE): (?C'infinite:c+ba')
✽ Method 2 (Frankenstein): (?C'.net_lb:(?<=abc+)')

Method 1: Pure PCRE Lookbehind Solution
✽ In the position where you want an infinite lookbehind such as (?<=abc+), place a callout such as (?C'infinite:c+ba'). Note that the lookbehind pattern has been reversed.
✽ In the callout, we parse the reversed lookbehind regex out of the argument: c+ba
✽ We use the string position argument to extract a substring from the subject start up to that point, and reverse that substring.
✽ We attempt to match the reversed regex on the reversed substring, and pass a zero or "force backtrack" return value depending on whether that match attempt succeeds or fails.

Two details are worthy of note:
1. we must anchor the lookbehind pattern, so that the forward matching function only looks for it at the position immediately preceding the cursor. Instead of appending a ^, we accomplish this with PCRE's PCRE2_ANCHORED option (expressed as PcreOptions.Anchored in PCRE.NET)
2. We disable optimizations (see the Ghost of Autopossess).

Method 2: Frankenstein Solution (PCRE marries .NET regex)
Reversing the pattern in the lookbehind as in the first method is not always obvious (we'll explore some limitations below). For such situations, I provide a Frankenstein solution that calls .NET regex from within a PCRE callout.

This time, the callout looks like (?C'.net_lb:(?<=abc+)')

And now… the code.

using System; using System.Linq; using PCRE; using System.Text.RegularExpressions; class Program { static string subject; // simple display of match results public static void display_match(PcreMatch theMatch) { Console.WriteLine(subject + " => " + (theMatch.Success ? "Match = " + theMatch.Value : "No Match")); } // One Callout to Rule them All /* Stub of callout delegate to handle lookbehinds and other constructs. Checks for callouts of this form: (?C'keyword:action') A lookbehind is specified as: Want this: (?<=abc+) => Write this (?C'infinite_lb:c+ba') Note that the lookbehind pattern is reversed */ static Func<PcreCallout, PcreCalloutResult> CalloutSwitch = delegate (PcreCallout callData) { int pos = callData.CurrentOffset; // Check if the callout has this form: (?C'keyword:action') string[] sides = callData.String.Split(':'); if (sides.Length > 1) { switch (sides[0]) { // Method 1: Pure PCRE case "infinite_lb": string subject_behind = subject.Substring(0, pos); // Reverse the subject string lookbehind_subject = new string(subject_behind .ToCharArray() .Reverse() .ToArray()); var lookbehind_regex = new PcreRegex(sides[1], PcreOptions.Anchored); var lookbehind = lookbehind_regex .Match(lookbehind_subject); return lookbehind.Success ? PcreCalloutResult.Pass : PcreCalloutResult.Fail; // Method 2: Frankenstein (PCRE marries .NET) case ".net_lb": // In the case of a DotNet lookbehind, we expect // something like (?C'.net_lb:(?<=abc+)') // Ensure the lookbehind operates at the right spot var dotnetRegex = new Regex("^.{" + pos.ToString() + "}" + sides[1]); var dotnetLB = dotnetRegex.Match(subject); return dotnetLB.Success ? PcreCalloutResult.Pass : PcreCalloutResult.Fail; // implement other interesting callouts case "neg_infinite_lb": break; default: break; } } // We didn't handle the callout: resume return PcreCalloutResult.Pass; }; // Test it! static void Main() { var purePCRELookbehind = new PcreRegex( @"(?C'infinite_lb:c+ba')\d+", PcreOptions.NoAutoPossess | PcreOptions.NoStartOptimize | PcreOptions.NoDotStarAnchor ); var frankensteinLookbehind = new PcreRegex( @"(?C'.net_lb:(?<=abc+)')\d+", PcreOptions.NoAutoPossess | PcreOptions.NoStartOptimize | PcreOptions.NoDotStarAnchor ); // First subject: this should match 42 subject = "05 AB99 abcc42 hp16"; var theMatch = purePCRELookbehind.Match(subject, CalloutSwitch); display_match(theMatch); theMatch = frankensteinLookbehind.Match(subject, CalloutSwitch); display_match(theMatch); // Second subject: this should fail subject = "05 AB99 abcd42 hp16"; theMatch = purePCRELookbehind.Match(subject, CalloutSwitch); display_match(theMatch); theMatch = frankensteinLookbehind.Match(subject, CalloutSwitch); display_match(theMatch); Console.WriteLine("Press Key"); Console.ReadKey(); } }

Here is the output:

05 AB99 abc42 hp16 => Match = 42 05 AB99 abc42 hp16 => Match = 42 05 AB99 abcd42 hp16 => No Match 05 AB99 abcd42 hp16 => No Match

It works!

Some Limitations
For the pure PCRE part of the demo, the lookbehind works so long as the pattern can easily be reversed. For (?<=abc+), the translation to c+ba was direct. But if our lookbehind pattern starts to contain some convoluted syntax, as in (?<=a(?=bc)), the reversal may not be so direct.

This kind of lookbehind creates what I've dubbed a "back to the future regex". It requires that we inspect not just the portion of string before the cursor, but also the portion after the cursor. As a guess, I might approach it by reversing the whole string (or an adequate portion if some smart rules can be found) and passing that string to the match function with an adequate offset.

Reversing the regex inside the (?<=a(?=bc)) lookbehind, we would pass a(?<=bc)) to the match function. Now suppose the c was instead a c+. Our reversal would now contain an infinite lookbehind. You see the problem. For a case like this, the Frankenstein solution is the way to go.

One limitation that applies to both methods is cases when the lookbehind contains capture groups, as in (?<=(\d+)). There is no mechanism to relay those capture groups to the calling function.

Another limitation is if the lookbehind contains references to previous captures, as in (?<=\1\d+). When building the regex inside the callout, we'll need to replace the reference with the current content of the group, and, in the case of the pure PCRE method, to reverse it.

I'm sure there are many other limitations. Callout within the lookbehind itself, backtracking control verbs… Let your imagination run wild. The goal of these demos is only to explore some workarounds for infinite lookbehind. The main point is that for "standard cases", it looks like we can implement the feature.

Automatically Reversing the Pattern
For a more general solution, your callout would pass the lookbehind the way you want it, without reversing it: (?C'infinite:abc+'), and you would then call a tokenizer that reverses the regex to c+ba… Easier said than done! If you implement such a pattern reverser, please let me know.


(direct link)

Uses for Callouts

I'm sure you already see that the potential of callouts is huge. Program 4 showed directions to implement infinite lookbehind. This section mentions two others of PCRE's "missing features".

Capture Collections
Program 2 showed how a callout can use the value stored in a quantified capture group on each pass. In that example, we displayed the value. If we added the values to a list, that would start to feel like a capture collection, except that when the engine backtracks, the list would end up with more elements than actually contributed to the match.

Maybe another callout to reset the list when you enter the quantified group, could bring us closer to the goal.

Balancing Groups
The standard .NET regex engine contains an unusual feature called balancing groups, which can be used instead of recursion (a feature absent from that engine) to check that certain constructs (such as (parentheses)) are properly balanced.

You could implement that feature with PCRE callouts. Upon matching an opening parenthesis, you place a callout such as (?C'open'), and upon matching a closing parenthesis, you place a callout such as (?C'close'). At the end of the pattern, you place a callout such as (?C'check_count').

In the callout function, you increment or decrement a counter depending on whether the identifier is open or close, returning a negative value to the engine if the counter falls below zero. At the end of the match, the callout function handles the check_count argument by checking that the counter is back to zero, indicating a balanced count.

This is a direct translation of the typical balancing groups recipe.


(direct link)

Further Details

If you plan to use callouts, you may want to be aware of some details that may influence their operation. For instance, PCRE's autopossess optimization may interfere with callouts.

This and other details are covered in the callout page of the PCRE documentation.

Another place to look for examples is the suite of tests for PCRE.NET.

Smiles,

Rex




next  Two marvelous PCRE tools:
 grep with pcregrep, debug and optimize with pcretest





Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.