Using Regular Expressions with Python
Lately, I have been working hard on beefing up the site. There are exciting new pages, and old ones have shiny new sections. The Python regex tutorial is not fully ready for prime-time, but it's one of four at the top of my priority list. I'm working on it!
In the meantime, I don't want to leave you Python coders out dry, so below there are two programs that show everything you need to get started with Python regex. But first, I feel that a word is in order about the feature set in the re module.
What's Missing from the re Module (lots!)…
…and why the alternate regex Module is One of the Best
Missing from the re module
So what's missing from the re module?
Here is a list I've cobbled together. It's incomplete but will give you an idea:
✽ Atomic groups and possessive quantifiers
✽ Unicode properties
✽ Variable-width lookbehind
✽ Splitting on zero-width matches
✽ Subroutine calls and recursion
✽ Character class operations
✽ Branch reset
✽ Advanced features for inline modifiers such as (?i): setting them in the middle of a pattern, turning them off as in (?-i), applying them to a subexpression as in (?i:foo)
Why I use the regex package
In my view, the alternate regex package by Matthew Barnett may possibly be the very best regex engine available in a mainstream programming language. Before you Perl fans send me flame letters, I'll explain why: the regex package combines some of the advanced features of .NET (infinite lookbehind, capture collections, character class operations, right-to-left matching) with some of the advanced features of Perl, PCRE and Ruby (subroutines and recursion). It even has a fuzzy matching feature.
The recent addition of \K, (?(DEFINE)…) and (*SKIP)(*FAIL) make it a delight to translate advanced patterns from Perl or PCRE. If I could add anything to my perfect regex engine dreamlist to round up this amazing engine, it would be balancing groups and some kind of ground-breaking quantifier capture feature.
An iPython Notebook presentation about the regex package
Around the time I was thinking of putting together a presentation about Python regex for our local Python meetup, I received a message from Rex Dwyer, who kindly shared a presentation he had made for his local Python users' group. Synchronicity!
You can download the presentation here. It is an iPython notebook. I have confirmed that all the cells run in Jupyter for Python 3, but I haven't yet had the time to read the presentation.
Two Python programs that show
Whenever I start playing with the regex features of a new language, the thing I always miss the most is a complete working program that performs the most common regex tasks—and some not-so-common ones as well.
how to perform common regex tasks
This is what I have for you in the following complete Python regex programs. There are two programs: a "simple" one and an "advanced" one. Yes, these terms are subjective.
Both programs perform the six same most common regex tasks, but in different contexts. The first four tasks answer the most common questions we use regex for:
✽ Does the string match?
✽ How many matches are there?
✽ What is the first match?
✽ What are all the matches?
The last two tasks perform two other common regex tasks:
✽ Replace all matches
✽ Split the string
If you study this code, you'll have a terrific starting point to start tweaking and testing with your own expressions with Python.
Differences between the "Simple" and "Advanced" program
Here is the difference in a nutshell.
• The simple program assumes that the overall match and its capture groups is the data we're seeking. This is what you would expect.
• The advanced program assumes that we have no interest in the overall matches, but that the data we're seeking is in capture Group 1, if it is set.
Have fun tweaking
With these two programs in hand, you should have a solid base to understand how to do basic things—and fairly advanced ones as well. I hope you'll have fun changing the pattern, deleting code fragments you don't need and tweaking those you do need.
As you can imagine, I am not fluent in all of the ten or so languages showcased on the site. This means that although the sample code works, a Python pro may look at the code and see a more idiomatic way of testing an empty value or iterating a structure. If some idiomatic improvements jump out at you, please shoot me a comment.
Python Regex Program #1: Simple
import re # import regex # if you like good times # intended to replace `re`, the regex module has many advanced # features for regex lovers. http://pypi.python.org/pypi/regex pattern = r'(\w+):(\w+):(\d+)' subject = 'apple:green:3 banana:yellow:5' regex = re.compile(pattern) ######## The six main tasks we're likely to have ######## # Task 1: Is there a match? print("*** Is there a Match? ***") if regex.search(subject): print ("Yes") else: print ("No") # Task 2: How many matches are there? print("\n" + "*** Number of Matches ***") matches = regex.findall(subject) print(len(matches)) # Task 3: What is the first match? print("\n" + "*** First Match ***") match = regex.search(subject) if match: print("Overall match: ", match.group(0)) print("Group 1 : ", match.group(1)) print("Group 2 : ", match.group(2)) print("Group 3 : ", match.group(3)) # Task 4: What are all the matches? print("\n" + "*** All Matches ***\n") print("------ Method 1: finditer ------\n") for match in regex.finditer(subject): print ("--- Start of Match ---") print("Overall match: ", match.group(0)) print("Group 1 : ", match.group(1)) print("Group 2 : ", match.group(2)) print("Group 3 : ", match.group(3)) print ("--- End of Match---\n") print("\n------ Method 2: findall ------\n") # if there are capture groups, findall doesn't return the overall match # therefore, in that case, wrap the pattern in capturing parentheses # the overall match becomes group 1, so other group numbers are bumped up! wrappedpattern = "(" + pattern + ")" wrappedregex = re.compile(wrappedpattern) matches = wrappedregex.findall(subject) if len(matches)>0: for match in matches: print ("--- Start of Match ---") print ("Overall Match: ",match) print ("Group 1: ",match) print ("Group 2: ",match) print ("Group 3: ",match) print ("--- End of Match---\n") # Task 5: Replace the matches # simple replacement: reverse group print("\n" + "*** Replacements ***") print("Let's reverse the groups") def reversegroups(m): return m.group(3) + ":" + m.group(2) + ":" + m.group(1) replaced = regex.sub(reversegroups, subject) print(replaced) # Task 6: Split print("\n" + "*** Splits ***") # Let's split at colons or spaces splits = re.split(r":|\s",subject) for split in splits: print (split)
Python Regex Program #2: AdvancedThe second full Python regex program is featured on my page about the best regex trick ever.
✽ Here is the article's Table of Contents
✽ Here is the explanation for the code
✽ Here is the Python code
re does not split on zero-width matchesIn most regex engines, you can use lookarounds to split a string on a position, i.e. a zero-width match, obtained for instance by using boundaries or lookarounds.
For instance, you would use (?=-) to split when the next character is a dash.
However, for historical reasons—a bug that is now too old to fix—Python's re.split does not split on zero-width matches. For instance,
re.split("(?=-)", "a-beautiful-day")returns ['a-beautiful-day'].
To split on zero-width matches in Python, we need to use the regex module in V1 mode. For instance,
regex.split("(?V1)(?=-)", "a-beautiful-day")will return ['a', '-beautiful', '-day']—which is what we want.