Major Changes in the Regex Engine ================================= In 5.9.x (pre. 5.10) Blead => mini-talk, kw.pm meeting, 16/11/2006, fishbot some examples loosely adapted from recent perlre patches ############################################################ Non-recursive: -------------- Big thing: Regex engine is no longer recursive. The role of the stack is now some magic linked list or something. This is a mild optimization, but also now means that regex complexity is now limited by your RAM, not your stack. It also opened up a bunch of possibilities, and sparked interest in hacking the regex engine. This lead to features, features lead to love, love leads to pain. Other optimizations: -------------------- Auto trie's from literal alternations: historically, if you wanted to match: (foo|fob|friends) it was more efficient to rewrite the regex: f(o(o|b)|riends) The engine now does this -automatically- for alternations with literal prefixes. It uses a trie internally, and Aho-Corasick for matching. Quite a cool optimization. ############################################################ Named Capture Groups: Syntax: (?pattern) (?'name'pattern) Example: "foobar" =~ m/(?foo)bar/ and print Dumper \%+; $VAR1 = { 'foo' => 'foo' }; "foobarfoobar" =~ m/(?(?:(?foo)bar)*)/ and print "foo = $+{foo}\nall = $+{all}\n"; foo = foo all = foobarfoobar You can also use \k or \k'name' on the LHS: "foobar" =~ m/(?.)\k/ and print $+{first}; # prints "o" ############################################################ Named Captures as sub-patterns: Syntax: (?&name) Example: my $reg = qr{ ( (?[aeiou]) # match and declare b (?&vowel) # 'call' named sub-pattern ) }x; "foobar" =~ $reg and print $1; # prints "oba" "foobar" =~ $reg and print $+{vowel}; # prints "o", leftmost ############################################################ Unnamed sub-patterns: Syntax: (?PARNO) Example: my $reg = qr{ ( ([aeiou]) b (?2) ) }x; Also, can be relative: my $reg = qr{ ( ([aeiou]) b (?-1) ) }x; (Very nice if you don't like counting.) Also, you can use relative backreferences now too: my $reg = qr{ (\w) \R1 }x; # doubled word chars Note that relative includes an enclosing group, so you can now tail call a capturing group: my $reg = qr{ ( ([^aeiou][aeiou] (?-1)? ) ) }x; "fobarak" =~ $reg and print $1; # prints "fobara" You can also just recurse on the whole pattern: Syntax: (?0) or (?R) my @found; my $reg = qr{ ( \( (?: (?R) | \w* )* \) ) (?{ push @found, $^N }) }x; "(((bar))()(foo))" =~ $reg and print Dumper \@found; # prints: $VAR1 = [ '(bar)', '((bar))', '()', '(foo)', '(((bar))()(foo))' ]; ############################################################ Control Verbs: General form (*VERB:NAME) All currently: (*PRUNE) (*PRUNE:NAME) (*SKIP) (*SKIP:NAME) (*MARK:NAME) (*:NAME) (*THEN) (*THEN:NAME) (*COMMIT) (*FAIL) (*ACCEPT) ------------------------------------------------------------- FAIL: fail this alternative, will force backtracking examples below. ------------------------------------------------------------- PRUNE: ditch the entire backtracking tree if we backtrack over it: Example: 'aaab' =~ m{ a+b? (*PRUNE) (?{ print "$&\n"; }) (*FAIL) }x; Prints: aaab aab ab ------------------------------------------------------------- SKIP: similar to PRUNE, but also forwards pos() to whatever we've consumed at this point in the match Example: 'aaabaaab' =~ m{ a+b? (*SKIP) (?{ print "$&\n"; }) (*FAIL) }x; Prints: aaab aaab ------------------------------------------------------------- MARK: marks a point for skip to prune to Example: abaaab' =~ m{ a (*MARK:luke) a+b? (*SKIP:luke) (?{ print "$&"; }) (*FAIL) }x; Prints: aaab aab aaab aab ------------------------------------------------------------- THEN: fails an alternation if backtracked over Example: 'fooooaoaoarfoobar' =~ m{ ( f[ao]* (?{print $&}) (*THEN) (x|f) | foobar ) (?{print $&;}) }x; Without the (*THEN): fooooaoaoa fooooaoao fooooaoa fooooao fooooa foooo fooo foo fo f foo fo f foobar With: fooooaoaoa foo foobar ------------------------------------------------------------ COMMIT: Fail entire match if we backtrack over this. 'aaabaaab' =~ m{ a+b? (*COMMIT) (?{ print "$&\n"; }) (*FAIL) }x; Prints: aaab # match fails. It's like skip, except we won't try at the next pos(). It's the matched chunk or nothing. ############################################################ More complex example: Backtrack controls actually now allow you to be super-super greedy in matches, yeilding all matches, even overlapping ones: $string = "foobar"; my %stuff; $string =~ m{ (foo|oobar|oob) (?{$stuff{$^N}++}) (*FAIL) }ix; # note, this pattern actually -fails- print Dumper \%stuff; $VAR1 = { 'foo' => 1, 'oobar' => 1 'oob' => 1 };