CapturedParens


Recursive Paren Capture Regex

Written to illustrate extended regex syntax. Shows use of code assertions, delayed execute code assertions, lookahead, non-capture grouping, and non-backtracking.

Posted for critique and discussion. Post corrections, ideas, improvements, note, kudos and flames in space marked below.

thanks, fishbot


#!/usr/bin/perl
#
# Illustration of various extended regex constructs
# Regex below "parses" strings for balanced parentheses
# If there are balanced parens, returns the strings 
# contained in the parens.  

use strict;

my $re;
my @strings;

$re = qr{
        \(                       # opening parens
           (                     # capture
             (?:
                (?= \))          # [1] null okay, if followed by )
                | (?> [^()]+ )   # [2] non-parens without backtracking
                | (??{ $re })    # [3] group with matching parens
             )*
           ) 
        (?{ push @strings, $1 }) # [4] save captured
        \)                       # close parens
        }x;

my $matched = "(this (a) is (b () d) a paren)(i)";
my $unmatched = "(this (is) fine () up until the last moment";

for ( $matched, $unmatched )
{
  @strings = ();
  
  print "Trying [ $_ ]:\n";
  unless ( m/^(?:$re)*$/o )
  {
     print "\tMatch failed.\n\n";
     next;
  }
  print "\t[ $_ ]\n" for @strings;
  print "\n";
}
__END__

Outputs:
Trying [ (this (a) is (b () d) a paren)(i) ]:
       [ a ]
       [  ]
       [ b () d ]
       [ this (a) is (b () d) a paren ]
       [ i ]

Trying [ (this (is) fine () up until the last moment ]:
       Match failed.


Notes:

  1. Positive lookahead assertion: This allows us to have empty brackets without having a null alternation. A null match would always succeed, forcing an "incorrect" recursion step. Theoretically, without the assertion, this would throw the pattern into an infinite loop, but the engine seems smart enough to avoid that. Why? (see below for answer)
  2. Non-backtracking grouping: all non-parens are swallowed. We never want to backtrack over this. If we get to the end of the string and there is an unmatched parenthesis, backtracking is useless and expensive. Note that if we allow backtracking here, our third failing string above causes perl to spin (nearly) infinitely.
  3. Deferred interpreted code assertion: This is the recursive step, matching a balanced paren group. Note that our "my $re" expression can't be folded into this assignment... we have to declare the variable before using it.
  4. Code assertion: Here we are pushing the captured text into an array. Since we only have one capturing group, each recursive matched gets assigned to $1, clobbering previous matches. Note that deepest, left-most recursions are pushed first.

discussion, abuse, etc here

Help me make this a better node!


Thanks fishbot! Great page.

Using and parsing advanced regular expressions ain't always easy. I usually need to remind myself of some of the keystrokes, and 'perldoc perlre' always seems too wordy. Does anybody know of a more advanced version of the handy one-page Perl Regex Quick Reference by Erudil? (http://www.erudil.com/preqr.pdf)

If you need to brush up on the concepts here, our own Elbie gave a talk with nice slides:

http://kw.pm.org/talks/1002-regex/

The good bits (common pitfalls, lookaround assertions, etc) start at: http://kw.pm.org/talks/1002-regex/slide17.html

-- DanielAllen


Answer to my question #1 above:

The answer is to be found in the Perl Documentation. See especially:

For example:

   $_ = 'bar';
   s/\w??/<$&>/g;

results in <><b><><a><><r><>. At each position of the string the
best match given by non-greedy ?? is the zero-length match, and 
the second best match is what is matched by \w. Thus zero-length 
matches alternate with one-character-long matches.

The short answer is that a zero-width assertion or pattern is only allowed to match once in a given pos(), unless, of course, you reset or mess around with pos().

The section I link above is quite interesting, in that it explains the notion of best match -vs- second best match, and when the regex engine takes it upon itself to use the second best match.

This is, needless to say, an impossibly obscure corner of PCRE. Enjoy.