On Which Line Number is the First Match?


Using only regex, can you tell the on which line a match was found? You could do that with a Perl one-liner using $. to print the line, but in pure regex, the answer should be "No: a regex cannot give you a line number."

And that is probably a fair answer. But this page presents tricks that allow you to return the line number using only regular expressions. They may not be tricks you want to put into practice, but they're a great excuse to look at three forms of advanced regex syntax (which form the backbone of the three solutions): recursion, self-referencing groups and balancing groups.

Input for the techniques
To demonstrate the techniques, we will use this input:

Paint it white
Paint it black
Why not blue?
Or red or brown?

Our aim will be to match the line number where the first instance of blue can be found. The techniques relies on a hack: at the bottom of the input, we will paste a list of digits, separated by a unique delimiter (something that will not appear somewhere else in the file). For our tests, we will use a ~ as a delimiter. Our input becomes:

Paint it white
Paint it black
Why not blue?
Or red or brown?

~1~2~3~4~5~6~7~8~9~10

If need be, generating that list of digits programmatically would be a simple matter.

Inspiration for these tricks: SQL
The inspiration for the main idea behind all three solutions is a classic database hack. Databases such as MySQL do not provide syntax to return a row number, so a well-known workaround is to join to a table of integers. Another use for a table of integers is to provide the equivalent of a for loop within a SELECT statement, letting you for instance to generate a list of the 30 dates after the current date.


Outline

Here are jumping points to the techniques we'll look at.

Match Line Number Using Recursion
Match Line Number Using Self-Referencing Group
Self-Reference Variation: Reverse the Line Numbers
Match Line Number Using Balancing Groups


Match Line Number Using Recursion

This solution uses recursion, which is available in Perl, PCRE and Matthew Barnett's regex module for Python. In turn, PCRE is used in contexts such as PHP, R and Delphi. You can test this solution in Notepad++ or EditPad Pro.

The point of the recursion is not immediate to grasp, so I'll give an overview before diving into the regex. The idea of the recursive structure, which lives inside a lookahead, is to balance each non-blue line with a digit. This is similar to what we do when we balance nested parentheses ((( … ))) using recursion, except that here we have: non-blue-line non-blue-line non-blue-line ~1~2~3. The last ~digit segment is captured to Group 2. Group 1, which contains the recursion, is optional, which makes the surrounding lookahead optional. This is because if blue is on the first line, no lines are skipped. After the lookahead, we match blue, then if Group 2 was set, we match it. Either way, we look for the next ~digit segment and return the digit as the match.

(?xsm)             # free-spacing mode, DOTALL, multi-line
(?=.*?blue)        # if blue isn't there, fail without delay

######    Recursive Section     ######
# This section aims to balance empty lines with digits, i.e.
# emptyLine,emptyLine,emptyLine ... ~1~2~3
# The last digit block is captured to Group 2, e.g. ~3
(?=                # lookahead
(                  # Group 1
   (?:               # skip one line that doesn't contain blue
      ^              # start of line
      (?:(?!blue)[^\r\n])*  # zero or more chars
                            # that do not start blue
      (?:\r?\n)      # newline
    ) 
    (?:(?1)|[^~]+)   # recurse Group 1 OR match all non-tilde chars
    (~\d+)           # match a sequence of digits
)?                 # End Group 1
)                  # End lookahead. 

# Group 2, if set, now contains the number of lines skipped
.*?               # lazily match chars up to... 
blue              # match blue
.*?               # lazily match chars up to... 
(?(2)\2)          # if Group 2 is set, match Group 2
~                 # Match the next tilde
\K                # drop what was matched so far
\d+               # match the next digits: this is the match    

In this live regex demo, you can see that the match is 3 (blue is on line 3). You can also inspect the content of Groups 1 and 2, and play with the input (move the first blue to other lines).


Match Line Number Using Self-Referencing Group

This technique uses a self-referencing capture group, that is, a group that refers to itself. It's not hard, but it may not be immediate if you haven't seen the technique before, so I'll give you an overview. We match the non-blue lines one by one. For each line we match, we lookahead to the string of digits at the bottom, and we use Group 1 to capture a portion of that string. This is Group 1: ((?(1)\1)~\d+). Group 1 says "if Group 1 is already set, match what Group 1 has captured so far. Then, regardless, match a tilde and some digits." This means with each non-blue line we match, Group 1 grows to capture an ever-longer portion of the digit string.

(?xsm) # free-spacing mode, DOTALL, multi-line (?=.*?blue) # if blue isn't there, fail without delay ########### LINE SKIPPER / COUNTER ############ (?: # start non-capture group # the aim is to skip lines that don't contain blue # and capture a corresponding digit sequence (?: # skip one line that doesn't contain blue ^ # beginning of line (?:(?!blue)[^\r\n])* # zero or more chars # that do not start blue (?:\r?\n) # newline chars ) # With each line skipped, let Group 1 capture # an ever-growing portion of the string of numbers (?= # lookahead [^~]+ # skip all chars that are not tildes ( # start Group 1 (?(1)\1) # if Group 1 is set, match Group 1 # (?>\1?) # alternate phrasing for the above ~\d+ # match a tilde and digits ) # end Group 1 ) # end lookahead )*+ # end counter-line-skipper: zero or more times # the possessive + forbids backtracking .*? # lazily match any chars up to... blue # match blue [^~]+ # match any non-tilde chars (?(1)\1) # if Group 1 has been set, match it # \1? # alternate phrasing for the above ~ # match a tilde \K # drop what we matched so far \d+ # match digits. This is the match!

In this live regex demo, you can see that the match is 3 (blue is on line 3). You can also inspect the content of Group 1 and play with the input (move the first blue to other lines).


Self-Referencing Group Variation: Reverse the Line Numbers

In this interesting variation, we reverse the line numbers at the bottom of the file: ~10~9~8~7~6~5~4~3~2~1

This has several benefits. First, we can shoot all the way to the back of the file with a simple .* and know we have reached the digits' section. That is more satisfying than looking for the digits' section with [^~]+. Second, we don't have to worry that our "unique" delimiter (here a simple tilde ~) might be used somewhere else in the input: We shoot down to the end and backtrack from there. This makes the situation even more similar to being able to inspect a separate table or file.

The code is nearly the same: In the self-referencing group, instead of appending digits to the existing capture with ((?(1)\1)~\d+), we prepend them with (~\d+(?(1)\1)).

Our input becomes:

Paint it white
Paint it black
Why not blue?
Or red or brown?

~10~9~8~7~6~5~4~3~2~1

(?xsm) # free-spacing mode, DOTALL, multi-line (?=.*?blue) # if blue isn't there, fail without delay ########### LINE SKIPPER / COUNTER ############ (?: # start non-capture group # the aim is to skip lines that don't contain blue # and capture a corresponding digit sequence (?: # skip one line that doesn't contain blue ^ # beginning of line (?:(?!blue)[^\r\n])* # zero or more chars # that do not start blue (?:\r?\n) # newline chars ) # With each line skipped, let Group 1 capture # an ever-growing portion of the string of numbers (?= # lookahead .* # Go to the end of the file ( # start Group 1 ~\d+ # match a tilde and digits (?(1)\1) # if Group 1 is set, match Group 1 ) # end Group 1 ) # end lookahead )*+ # end counter-line-skipper: zero or more times # the possessive + forbids backtracking .*? # lazily match any chars up to... blue # match blue .* # Get to the end of the data ~ # match a tilde \K # drop what we matched so far \d+ # match digits. This is the match! (?= # Lookahead (this positions us in the right place) (?(1)\1) # If Group 1 has been set, match it ) # End lookahead

In this live regex demo, you can see that the match is 3 (blue is on line 3). You can also inspect the content of Group 1 and play with the input (move the first blue to other lines).


Match Line Number Using Balancing Groups

This version uses an outstanding regex feature exclusive to the .NET engine: balancing groups.

We use a group named c to serve as a counter of lines that don't contain blue. Of course there is no such thing as a "counter"… But each capture for Group c is added to the Capture Collection stack, and that stack has a length (which you can later inspect with match.Groups["c"].Captures.Count).

After "incrementing the counter" while skipping the empty lines (meaning that for each empty line we add a capture to Group c collection), we match the line with blue and get to the beginning of the digit sequence. There the fun begins: as long as we can decrement c (i.e., pop an element from Group c captures), we match a digit sequence. The digits matched at this point (if any) therefore correspond to the skipped lines. And the digit for the line containing blue is the next one in the sequence.

Don't let the explanation scare you: the code is probably simpler than the explanation!

For some reason .NET doesn't seem to do well with the capture-popping syntax (?<-c> … ) when it's inside a lookbehind, so instead of matching the line number directly, we will capture it to Group 1.

(?xsm)             # free-spacing, DOTALL, multi-line
(?=.*?blue)        # if blue isn't there, fail without delay
\A                 # Assert position at the beginning of the input 

########### LINE SKIPPER / COUNTER ############
(?<c>              # Add a capture to Group c for each line that
                   # doesn't contain blue. Think of Group c as
                   # a counter (we are only interested in the
                   # number of captures it contains)
   ^                      # beginning of line
   (?:(?!blue)[^\r\n])*   # zero or more chars
                          # that do not start blue
     (?:\r?\n)            # newline chars
)                  # end Group c
*                  # repeat Group c as long as we can
                   # find non-blue lines

###########    AFTER SKIPPING    ############
.*?               # lazily match any chars up to...
blue              # match blue
[^~]+             # match any non-tilde chars

###########    Number of Skipped Lines (if any)    ############
# To get to the number of skipped lines in the digit sequence, 
# for each Group c capture (each skipped line), we pop one 
# element from Group c ("decrement c") and match the next digits
(?(c)             # Conditional: If Group c has been set
   (?<-c>           # Pop one capture from Group c / "decrement c"
     ~\d+              # Match the next tilde and digits
   )                # End of popping / "decrementing" group
   *                # Zero or more times: We will only pop elements
                    # (and therefore match new digits) as long as
                    # Group c still contains captures
)                 # End Conditional checking if Group c has been set

#######    Finally: the next digits are the right one  ########
~                  # Matrch the tilde
(\d+)              # Capture the digits to Group 1

In this live regex demo, inspect the captures: you will see that Group 1 is 3.





1-1 of 1 Threads
perlancar
April 15, 2016 - 02:26
Subject: &. should be $.

The special variable that contains "current line number" is $. , not &. Also, in Perl, another alternative to get the line number is to use pos() inside embedded code. Pos() gives the character offset, and we can scan $_ for newlines to get line number + column number.
Reply to perlancar
Rex
April 29, 2016 - 10:27
Subject: RE: &. should be $.

Typo fixed, thank you -- and thanks for your other idea as well.


Leave a Comment






All comments are moderated.
Link spammers, this won't work for you.

To prevent automatic spam, we require that you type the two words below before you submit your comment.