Two Tools: Grep in PCRE, Test your PCRE patterns


This page focuses on two little-known but delightful tools that are part of the official PCRE distribution: pcregrep and pcretest. It also lets you download Windows versions of both tools.

The first tool is a grep utility that uses the PCRE engine. For those who haven't spent much time in unix, grep is a command-line utility to search text patterns (using regular expressions of course) inside a set of files—all the files on your hard drive if that's your pleasure.

The second is another command-line utility, one that will allow you to debug or test the performance of PCRE patterns far more thoroughly than you could through an interface such as PHP.

Like the rest of the PCRE library, the tools are available in source form on the PCRE website. On this page, I aim to make available Windows executables of the current version of the tools as soon as they are released.

Downloading the Tools

In this section, you can download Windows binaries of pcregrep and pcretest that I've compiled from the PCRE source code. In other words, these programs are ready to run. I would like to always have the latest version for you to download.

There are a number of compile options for pcregrep and pcretest. The ones below are "fully loaded" and include the "just-in-time", "utf" and "unicode_properties" options. As of 8.35, they are also compiled with the ANYCRLF option.

✽ Read the license
✽ Download pcregrep and pcretest

Versionpcregreppcretest
8.35downloaddownload
8.34downloaddownload
8.33downloaddownload
8.32downloaddownload
8.31downloaddownload

Before we dive into how to use pcregrep and pcretest, some thanks are in order.

A Few Words of Thanks

This "little page" is only possible because of a long chain of work by a lot of highly-skilled people. I will start in reverse order, with the last (in that long chain) but not least.

When I started on this page, it had been twenty years since I'd compiled any of my own C code, and I was anxious at the thought of compiling in the age of Windows 7. I tried compiling PCRE with the CMake option mentioned in the help file, but got stuck for a silly reason. Of course I had no way of knowing that it was a silly reason. Daniel Richard took the time to examine my workflow and pinpoint what I was doing wrong. Thanks to him, I was able to solve the problem in minutes, which gave me the great joy of compiling PCRE on my own machine. Thank you, Daniel.

Before Daniel, there are the makers of MinGW and CMake, the two open-source projects that allowed me to so easily compile the source code into Windows executables. I don't know who these guys are, but they're awesome.

On the PCRE side, some people—I don't know their names—maintain the CMake file that makes it possible for jokers like me to compile PCRE. Thanks, guys!

Finally, to Philip Hazel (the father of the PCRE project), Zoltan Herczeg (the keeper of the Just-In-Time compiler) and others on their team, I am immensely grateful for that wonderful engine that has given me so much pleasure.

May you all live long lifes on beautiful streets lined with chocolate fountains.

Installation Notes

No installation is required for either pcregrep or pcretest. However, if you want the grep tool to be at your fingertips when you need it, here is what I suggest you do.

Rename pcregrep.exe to grep.exe. Life is too short to type extra letters.
✽ Copy grep.exe to the C:\Windows\System32 folder. This system folder is in the system's path variable, which means that when you try to run a program, Windows looks there. That will allow you to run pcregrep (now called grep) from any folder. I copy pcretest to the same folder, that way I don't need to remember where it lives.

To run grep, open a command prompt, which is never more than five keyboard strokes away: Windows key, type "cmd", press Enter. If you use a marvellous program called Directory Opus (probably the ultimate productivity application for Windows), you can also invoke a command prompt from the current folder by using a keyboard shortcut such as Ctrl + Shift + R. That's what I do.

From the command prompt… Start grepping, debugging or optimizing!

Quick Outline

The page is rather long, so I want to give you an idea of the outline.

✽ The section about pcregrep starts immediately below.
✽ To skip to pcretest, click the link.

What's special about pcregrep?

There are other versions of grep for Windows floating around. It's only natural, as so many unix people are attached to their command-line tools. This is no recent phenomenon: I remember a time in the mid-1980s when I was given a floppy disk with a number of unix-like commands, such as "ls" (filled with options) or "cp" or "mv". These were designed to replace the DOS commands we all used at the time. There was no "move" command in DOS. These unix-like utilities were awesome, and I used them for many years.

One grep version I tried lately is the one bundled with Cygwin. I don't like it because the regular expression syntax it uses is rudimentary. It has a "-P" flag for Perl-like regex, but the manual states that it is experimental, and indeed it worked poorly when I tried it. So I looked for a command-line Windows grep with solid regex matching, but I couldn't find anything… Until I remembered pcregrep, which I had one come across and hoped to compile some day. I spend more time in PHP than in other coding environments, so PCRE is my "home" regex flavor and I have come to love it. So what could beat pcregrep?

I should mention that pcregrep searches, but it does not replace. For replacing text across files on Windows, my workhorse is "ABA Replace", an amazing GUI tool with solid regex matching. You can read my review of ABA Replace on the Tools page. And yes, there are other GUI tools, such as the search function of Directory Opus, or PowerGrep, which is "not for me", even though I love some of Jan's other software.

Using pcregrep

Remember, we renamed "pcregrep" to "grep" to save on keyboard strokes. For the most part, the pcregrep utility has the same syntax as the usual GNU grep. If you don't know that syntax, don't worry, we'll start from scratch. Here is the basic pcregrep syntax.

grep list_of_options regex_pattern files_to_match

The full syntax is in the manual file which is included in the download. But manual pages can be confusing, so here are some examples that work on my Windows machine.

CommandDescription
grep --helpDisplays a list of the options you can use with the command. You can send the output to a file with "grep --help > grephelp.txt". But note that the manual page in the download has much more detail.
grep toto *Looks for the string "toto" in all files in the current directory. Returns all the lines that match, with a little context. Complains that it cannot open directories.
grep -s toto *As above, but the "s" option makes the engine shut up (or silent) about the fact that it cannot open directories. That should probably be the default option on Windows—if you don't want to see the warnings, just get into the habit of adding an s to your options when you are searching all files.
grep -s toto *.txtAs above, but only looks in all files with a "txt" extension.
grep -s \btoto\d{3}\b *.txtAs above, except that instead of looking for a simple string, we are using a regex pattern. Note that there is no delimiter. See the rest of the site for pattern syntax. This particular regex will match strings such as "toto123" as long as they are not embedded in a string of "word characters". You get the idea: going forward, to focus on grep features, many examples will use simple text matches instead of regex.
grep -r toto .Looks for the string toto in all files, recursively from the current folder.
grep -r --include=.*\.txt toto .Looks for the string toto in all files, recursively from the current folder, but only in files with a "txt" extension. Note than pcregrep uses a PCRE regex to specify the names of the files in which to search.
grep -r --exclude=.*\.dat toto .Specifies file names to exclude from the search, using a PCRE regex to define the set of files to exclude.
grep -ri toto .As four lines above, with the addition of the "i" option, which makes the search case-insensitive and allows the command to match "toTO". The "-ri" also showcases how to combine short options.
grep -r (?i)toto .As above, but setting case-insensitivity in the regex itself. See the page about (? syntax.
grep -f patterns.dat *.txtReads patterns from a file called patterns.dat (one pattern per line, up to 100 patterns) and matches each pattern against all files with a "txt" extension!
grep -v toto *.txtInverts the match, so that only lines that do not match the pattern are reported.
grep -o \btoto\w\b *.txtThe "o" option only reports each line's match, without the context.
grep -l toto *.txtThe "l" option says to only list the names of the files that contain a match, without showing the matches
grep -L toto *.txtThe "L" option says to only list the names of the files that do not contain a match. Not the same as "-vl", which shows files that contain lines that do not match (some lines may match).
grep -n toto *.txtAdds the line number to the reported match.
grep -c toto *.txtOnly reports the number of matches in each file.
grep -NANYCRLF toto *.txtBy default, because this is Windows, grep treats \r\n (CRLF) as a new line. This option makes grep treat CR, LF or CRLF as new lines, which comes in handy if you are testing Unix files. See the manual for other options such as -NLF
grep -so1 toto(\d{3}) *We saw the s option before (silent). The "o1" option tells pcregrep to only report the Group 1 matches—in this case, the tree digits after "toto". You could likewise specify -o2 to only report Group 2 captures. This option should have an alias "g" for "group", in order to avoid confusion with "o" which "only report the match (no context)".

Note that you don't have to use quotes around patterns, but you can, and indeed you must if your pattern happens to contain white space. Quotes are fine, but remember not to use delimiters on your patterns—this is not PHP.

There are many other cool options. For instance, you can specify how much context to display around each match. For a full reference, see the manual page.

About the --color Option
Feel free to skip ahead to the much more exciting pcretest section, as these are just notes so I can remember a feature that I haven't yet managed to use the way I'd like.

Like GNU grep, pcregrep has an option to color the match, making it stand out from the context: grep --color toto *

Sadly, this does not work in the Windows shell (cmd.exe) and results in this strange output: "←[1;31mtoto←[00m", while no color is displayed. This is a limitation of the Windows shell rather than pcregrep. The PCREGREP_COLOR option, set to "1;31" by default in the make files, is an ANSI code that can manipulate colors on terminals that accepts these codes, as on unix. Windows is a different OS, so we shouldn't expect it to work.

You can easily change the overall background and text color of the cmd shell, either from the menu (click on the icon at the top left), or from the command-line, by passing strings such as "color=1B" when launching cmd ("1" stands for a blue background, "B" stands for very light blue text). But to my knowledge there is no way to manipulate the color of individual text in cmd.exe.

The work-around is to use a different terminal. I tried pcregrep in Console, and the color option works, but I haven't yet figured out how to integrate Console in my system so that it launches in the right path. (I normally launch command shells from Directory Opus with a Ctrl + Shift + R shortcut, and they open in the right folder, in admin mode).

Last Words about pcregrep
I hope you get as much pleasure out of having that powerful grep utility at your fingertips as I do.

Okay, it's time to look at pcretest!

Using pcretest

In my mind, pcretest has two great uses: optimization and debugging.

First, let's talk about debugging. You could feed pcretest a pattern such as ~\btoto(\w+)\b~, and a subject such as "slkj tototata lkj". With the right parameters, pcretest would show you the exact path it took in order to produce a match:

--->slkj tototata lkj
 +0      ^                \b
 +2      ^                t
 +3      ^^               o
 +4      ^ ^              t
 +5      ^  ^             o
 +6      ^   ^            (\w+)
 +7      ^   ^            \w+
+10      ^       ^        )
+11      ^       ^        \b
+13      ^       ^        
 0: tototata
 1: tata

I hope you'll agree that this is rather nifty. It could come in handy for an expression that fails for unknown reasons. You'll be able to see exactly what is going on.

Now let's talk about optimization. The pcretest utility lets you run a regex on some data a million times (or however many time you like), and it reports the average time it needed to find a match.

This makes it easy to compare alternate expressions. When you read about techniques to optimize your regular expressions, you may be interested in running tests on alternate regex phrasings. You can do that in your programming language—I used PHP to test many tweaks suggested in Jeffrey Friedl's book—but pcretest gives you an even more powerful test bench. By the way, PCRE must have been seriously optimized since Jeffrey's book came out, because as mentioned in my page about regex tricks, none of the tweaks I tried seemed to make much difference. Perhaps partly thanks to Jeffrey's hints?

Before looking at the pcretest command itself, you need to know that it usually operates on an input file. The file contains the regex to be tested, and the lines of text to test. The regex is on the first line, and must be enclosed in delimiters. Here is an example of a file that would work, with the regex delimited by tildes ("~"). The regex is shown in bold for emphasis, but that would not be part of the text file.

~toto\d{3}~
aslkj 242
slkj totos lkj
sdlkj toto444 sdfs
sadflkj

If you wanted, you could add more regexes and data to that file. Just leave a blank line after the end of the data, then start your next regex, then add the data for that regex. Here is a file that would work with pcretest and contains two regexes. You can add as many regexes and lines of data as you like.

~toto\d{3}~
aslkj 242
slkj totos lkj
sdlkj toto444 sdfs
sadflkj

~\btoto(\w+)\b~
aslkj 242
slkj tototata lkj

The regexes can use the entire PCRE syntax, whether in the pattern itself (for instance with (?s) or \K), or after the delimiter, for instance with the G flag.

There is one particularly interesting flag for debugging: "C". It's the flag that produced the "trace output" I showed you a few paragraphs ago. To get that output, I just added "C" to the pattern in the IN.txt file: ~\btoto(\w+)\b~C.

But enough about the input file. Let's now talk about the command itself. Here are some sample uses of pcretest that you could try with either of the files above. For a full reference, I highly recommend you read the manual page, which is part of the download.

CommandDescription
pcretest -helpDisplays a list of the options you can use with the command. You can send the output to a file with "pcretest -help > testhelp.txt". But note that the manual page in the download has much more detail.
pcretest -COuputs some information about the version of pcretest you are running, such as whether it supports UTF-8.
pcretest IN.txt OUT.txtReads the regex and data from IN.TXT, outputs the result to OUT.txt, reporting the matches for each line. In the case of the second regex, which includes a capture group, pcretest also reports the Group 1 match.
pcretest -t 100000 IN.txt OUT.txtAs above, but runs 100,000 times, and reports both the matches and the average time per run. Start without the -t option, just in case there is an error in your expression.


There are a number of other options, some of which I don't even understand. So if you feel so inclined, dig into the manual page, and have fun with it!

next
 PHP Regex




1-4 of 4 Threads
NewWorld – Germany
March 11, 2014 - 03:07
Subject: Many thanks

I was having problems with Cygwin's grep and multi-line patterns, so thanks for the recommendation to use pcregrep instead. I love your site; everything is explained clearly. I finally learnt how to use lookaheads and behinds thanks to your clear instructions. Really great site. Thanks again.
Glen
February 13, 2014 - 05:55
Subject: problem with pcregrep for Windows

I tried the latest version on your site, saying:

pcregrep the pcregrep.txt

It prints every line of the input. If I use an invalid pattern like "zzz", it prints nothing.
Reply to Glen
Rex
February 13, 2014 - 09:39
Subject:

Hi Glen,

That sounds like expected behavior... if each line of the input contains "the". Is that not the case? Remember that a "line" can take several screen lines up to the next carriage return.
I tested what you said, and if I remove "the" from one of the lines of input, that line does not show in the output (as expected).
If you want to see more sober output, you can try something like

pcregrep -no the test.txt

For how this works, please see the examples on the page, as well as the documentation inside the zip file.

Warm regards,

Andy
Cesar Romani
December 21, 2013 - 17:50
Subject: pcregrep doesn't work

I'm using pcregrep on Windows 7. I have a folder with text files in utf-8 encoding, which contain the word "español. " If I open cmd on that folder and do:
pcregrep -u -I español *.txt

I don't get anything. Many thanks in advance,


Cesar
Reply to Cesar Romani
Andy
December 21, 2013 - 20:28
Subject: RE: pcregrep doesn't work

Hi Cesar, pcregrep works. Your PCRE syntax is not correct. No time to help you, please download the regexbuddy demo from the right column of my website to troubleshoot your PCRE syntax. Cordially, Rex
Gary
December 11, 2013 - 16:13
Subject: using --color with cmd.exe

There are a few ways to use the —color flag in cmd.exe. I used to use rlwrap.exe from cygwin but it had a few cygwin dependencies and was a bit heavy for just wrapping grep. If you wanted to wrap a full readline in cmd.exe it works great though. Now I use ansicon and it's very simple and much faster than rlwrap was. You can get the source from github under adoxa/ansicon and compile or they have a prebuilt one you can download. It's builds a single exe and 2 dll's (x86 & x64). I just run it like "ansicon.exe grep —color toto *" which just uses it for that one command. You can also run ansicon.exe by itself to enable for the current cmd.exe or you can install to autorun so it runs on every new cmd.exe. You can also use color prompts or other tools with ansi color output. I enjoyed your article and the info on pcretest, which was exactly what I was looking for.
Reply to Gary
Rex
December 22, 2013 - 18:52
Subject: RE: using --color with cmd.exe

Hi Gary, Thanks for the great tip! Wishing you a fun weekend, Rex


Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.