Regex Gotchas


On this page, I'd like to collect some regex phenomena that may trip or puzzle you for a moment. Often, such little problems lead the regex apprentice to discover that he hadn't fully understood one aspect of regex he thought he had already mastered. For me, such Gotcha! moments are a great source of satisfaction and learning.

I tried to write these problems in a Question and Answer format. The first ones are really simple in the sense that they would only trip you on your first day with regex. As the collection grows, I hope that further down the list even accomplished regexers will find something to arouse their interest.


Please note that this page is a recent venture. There aren't many Gotcha! yet, but I intend to flesh out the collection over time.



Jumping Points
For easy navigation, here are some jumping points to various sections of the page:

The Right Case
Stuck on a Line
Only the Word, Please
Empty-Handed
It Doesn't Match Enough
Phantom Replacements
The Engine Doesn't Try All Options Inside the Lookahead


(direct link)

The right case

Question: I made this regex: [a-z]+. Why isn't it matching Cat?

By default, a regex pattern is case-sensitive by default. How you turn on case-insensitivity depends on your engine.


(direct link)

Stuck on a Line

Question: I made this regex: My .* cat. Why isn't it matching this string?
My dog
and my cat


By default, the . does not match line breaks. How you make it match carriage returns, new lines and other line break characters depends on your engine.


(direct link)

Only the Word, Please

Question: I made this regex: cat. My tool finds a match in the word certificate, but I only want to find cat on its own. What to do?

The easiest fix is to use word boundaries \b, which match positions where one side is a word character (letter, digit, underscore) and the other side is not a word character (for instance it is the beginning of the string or a space character). This gives you:
\bcat\b
Improved word boundary
The regex above will not find cat in _cat25, because there is no boundary between an underscore and a letter, nor between a letter and a digit: these are all what regex defines as word characters. If you think that digits and underscores should count as a word boundary, \b will therefore not work for you.

If you would like to use a boundary that detects the edge between an ASCII letter and a non-letter, you can make it yourself. See the section about a DIY "real word boundary" on the page about regex boundaries.


(direct link)

Empty-Handed

Question: I made this regex: \d*|\w+. The engine reports that there is a match, but it is empty. Why isn't it matching Cat?

On the face of it, this pattern seems to match either digits or some word characters, so we might expect Cat to match. However, when the engine attempts a match at the position before the initial C, the \d* is able to match, since it is true that there are zero or more digits at that position. The match is returned, and the right side of the alternation is never visited.

In all engines, if the engine is instructed to find multiple matches, it will also find other "empty" matches after the C and the other letters.



One difference among engine is that Perl and PCRE (C, PHP, R…) will also match Cat. That is because after finding a zero-width match, these engines will attempt another match at the same position in the string, and backtrack into alternations and other subexpressions as needed to find a non-zero-width match.


(direct link)

It Doesn't Match Enough

Question: I made this regex: [129]|18 to match the numbers 1, 2, 9 or 18. Why isn't it matching 18?

The character class [129] matches the 1 in 18. The match is returned, and the right side of the alternation is never visited. Anchors or boundaries would resolve this problem: ^(?:[129]|18)$

Conclusion: when setting up an alternation, be mindful of what each branch matches, especially on the left side, and all the more so if you are using stars and question marks. If the leftmost alternation can never fail, for instance, you can be sure that other sides of the alternation will never match.


(direct link)

Phantom Replacements

Question: I made this regex: X* and want to use it with this replacement string: Y. When I run the replacement on string X, I get YY, and when I run it on A, I get YAY. What is happening?

This is the same problem as in Unwanted Matches, but with a replacement. Against string X, the regex X* matches twice. At the position before the X, it matches X. At the position after the X, X* is allowed to match zero characters X, so it matches an empty string. All major languages except Python will replace both matches with a Y, giving you YY.

Note that the Python exception goes away when you use Matthew Barnett's regex module instead of re: print (regex.sub("(?V1)X*", "Y", "X") ) yields YY like every other engine.

Likewise, against the string A, the regex X* matches twice. At the position before the A, it matches the empty string. At the position after the A, it also matches an empty string. All major languages will replace both matches with a Y, giving you YAY.

Try it in Python:
print( re.sub("X*", "Y", "A") )


(direct link)

The Engine Doesn't Try All Options Inside the Lookahead

Question: Here is my string:
_rabbit _dog _mouse DIC:cat:dog:mouse The DIC section at the end is a list of allowed animals. I want to match all the _tokens named after an allowed animal, so I expect to match _dog and _mouse.
I made this regex:
_(?=.*:(\w+)\b)\1\b
But it only matches _mouse. It looks like the lookahead is not trying all the options. Why?

Because lookarounds are atomic (the link explains this example in detail). Once the engine leaves a lookaround, its assertion has either returned true or false. From the engine's standpoint, that is all it wants to know.

If a lookaround returns true, the engine tries to match the next tokens. If something fails further down the pattern, the engine has no reason to revisit the lookaround: true is always true.




next
 Regex Cookbook





Be the First to Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.