Regex Tutorial - New Features (5.005 and 5.6.0)
If you thought Perl's regular expressions were powerful, but complex and difficult to understand and apply, watch out! The more recent additions to Perl's regex arsenal might just be enough to make run screaming for the door---Perl's regular expressions are now even more powerful, more potentially complex, and decidedly non-regular by any definition. This tutorial will present most of the new regex features (both stable and experimental), and show you how to use them ... be afraid...
Several new regex features have been recently introduced into Perl---some in the 5.005xx series, and others in the new 5.6.0 release. This tutorial will introduce many of these new features, and a few older things just in case you weren't aware of them.
A quick run-down of the regex features you will find in this tutorial
are: precompiling regexen with the qr// operator, embedded
modifiers and localization, lookahead and lookbehind assertions,
independent subexpressions, conditional expressions, and embedded
code assertions. We will finish with mention of a few new re pragmas
that you may find useful. Some of these newer features are officially
still in an experimental stage.
In this section we will address a few of the things that you may already know about and which aren't considered experimental.
A new operator was introduced in 5.005, the qr// operator.
This is another quoting operator similar to q// and qq//, but which
quotes regular expressions and returns the regex precompiled into
Perl's internal regex format. A simple usage example is:
my $re = qr/foo.*bar/;
You probably are already aware of the /o modifier, which tells
perl to compile the pattern only once if it contains a variable. This
is useful in some circumstances, but not when you want to match data
against many patterns, perhaps stored in an array. Take a look at the
following code example, ignoring the commented out line that turns
the array of patterns into an array of compiled patterns:
#!/usr/bin/perl -w
use strict;
my @patterns = qw|
a.*b
foo.*bar
a.*b.*z
|;
# @patterns = map{qr/$_/} @patterns;
while(<>){
foreach my $re (@patterns){
print "$re: $_" if /$re/;
}
}
__END__
In the above code, for line of input read we loop through the
patterns in the array and print the line (and the regex) if it
matches against the line. Every time through the loop, the regex must
be recompiled---that's three regex compilations for every line in the
file. However, if we comment the qr// back into the code, then
the array holds already precompiled patterns and each time through
the loop perl recognizes this and only has to apply the pattern
against the line---there is no further regex compilation required.
The following shows timing results of two consecutive runs of each
version of the above program using /usr/dict/words as the
input file---on my system, this file has one word per line and is
10,754 lines long. As you might expect, compiling 3 regexen 10,754
times really sucks compared to compiling 3 regexen only once each.
Version without precompiled regexen:
$ time perl retesting.pl /usr/dict/words >junk
6.49user 0.01system 0:06.50elapsed 99%CPU
$ time perl retesting.pl /usr/dict/words >junk
6.46user 0.01system 0:06.46elapsed 100%CPU
Version with precompiled regexen:
$ time perl retesting.pl /usr/dict/words >junk
0.83user 0.01system 0:00.84elapsed 99%CPU
$ time perl retesting.pl /usr/dict/words >junk
0.84user 0.01system 0:00.85elapsed 99%CPU
Some of Perl's regex modifiers, in particular the /ismx
modifiers, can be embedded in the pattern rather than specified
after the final regex delimiter. This is done using syntax of the
form (?ismx), and any of the modifier letters included are
then turned on. You may also turn off modifiers using a dash so that
(?i-sm) will turn on the /i modifier and turn the
/sm modifiers off.
Having embeddable modifiers is useful for constructing patterns based on user input. Consider the following case:
#!/usr/bin/perl -w
use strict;
print "enter a pattern: ";
chomp(my $re = <STDIN>);
print "case sensitive? [y/n]: ";
chomp(my $cs = <STDIN>);
$cs = $cs =~ /^y/i ? 1 : 0;
while(<>){
if ($cs) {
print if /$re/i;
}else{
print if /$re/;
}
}
Now, that's not terrible, but we are forced to use a conditional test within the input loop, for every line in the input. Using an embedded modifier we can create the correct pattern before we enter the input loop:
#!/usr/bin/perl -w
use strict;
print "enter a pattern: ";
chomp(my $re = <STDIN>);
print "case sensitive? [y/n]: ";
chomp(my $cs = <STDIN>);
if ($cs =~ /^y/i) {
$re = "(?i)$re";
}
while(<>){
print if /$re/;
}
Don't ask me of what use embedding a /x modifier is because I
can't really think of a good one.
This same type of syntax can be used to embed a comment within a
regex as in /a(?# this is silly).*bcd/.
Ok, Perl has had lookahead assertions for quite some time, so I will
only mention the briefly. A lookahead assertion is a form of
anchoring a regex, it is a zero-width assertion that does not consume
any of the target string. For example: /a(?!b).*c/ will match
only if it finds an 'a' that is not immediately followed by a b, but
is followed by zero-or-more of anything and a 'c'. Similarly, this
regex: /a(?=b)(.*c)/ will only match if an 'a' is followed
immediately by a 'b' and then zero-or-more of anything followed by a
'c' ---and the $1 variable will include the 'b' and everything up to
and including the 'c'.
Perl also has the somewhat newer lookbehind assertions, which are
similar exept that they apply to what came immediately prior to the
current position: /(".*?")(?<!\\")/ will match a double
quote followed by anything up to the next double quote that is not
preceded by a backslash. The limitation on lookbehinds is that the
lookbehind expression must be a fixed width pattern---in other words,
indeterminate quantifiers are not allowed in the expression being
tested. There is also the positive lookbehind of the form
/(?<=)/ that is subject to the same rules and limitation.
Now, you probably already knew about the embedded modifier feature, but did you know that you could localize such a modifier to only have effect for part of the pattern? Well, in 5.005 you could do this by using the embedded modifier within a grouping construct and the modifier would only apply to the pattern within that construct:
/(?:(?i)foo)bar/ # foo is case insensitive, but bar is not
Now the non-capturing grouping construct, (?:...) has been
modified to take embedded modifiers between the ? and the : and
those modifiers apply only the pattern following the colon as in
/(?i:foo)bar/. You may also turn off modifiers using the same
syntax as the embedded modifier form discussed previously:
/(?i-s:f.*)bar.*quux/s, here the grouping is case insensitive and the
. is not allowed to match newlines between the 'f' and 'bar'.
Another recent (experimental) addition is the independent
subexpression with a syntax of: /(?>pattern)morepattern/. If an
independent subexpression matches the regex engine will not allow
backtracking within the subexpression if the remaining pattern
fails.
Let's explore the example in the perlre manpage in a little detail. In this example we want to match a non-empty substring contained within parentheses that may contain one more level of parentheses within it. Thus, in the string:
$_ = '(a(b(c)))';
We want our pattern to match the substring '(b(c))' and put it into the $1 variable. The following will do this:
$_ = '(a(b(c)))';
print "$1\n" if
m{ (
\(
(?:
[^()]+
|
\( [^()]* \)
)+
\)
)
}x;
However, there is a problem with this regex. Consider a case where
the match may get started but eventually fail -- on a long string
that 'eventually' may take a long long time. This is because we have
a pattern which contains an indeterminate quantifier within a
grouping that is also modified by an indeterminate quantifier -- in
other words, something of the form: (x+)+ . In such a case, the
regex engine may have to explore a potentially huge number of ways of
splitting up the string. Let's consider another string:
$_ = '(a(b(c)' . 'z' x 18;
print "$1\n" if
m{ (
\(
(?:
[^()]+
|
\( [^()]* \)
)+
\)
)
}x;
In this case, $1 should contain '(c)', but there is a great deal
of backtracking on failed attempts before it is found. Perl 5.6.0 is
much faster at doing this, but eventually suffers from the
exponential nature of the problem. To illustrate, here are timing runs
on the above using 5.00503 and 5.6.0 (I've cut out some of the output
of the time command):
$ time perl5.00503 t.pl
(c)
0:01.57elapsed
$ time perl5.6.0 t.pl
(c)
0:00.06elapsed
OK, Now let's grow that string out a little more to stretch the
limits of 5.6.0 -- now we replace 'z' x 18 with 'z' x 2000
and see the timing results:
$ time perl5.6.0 t.pl
(c)
2.69user 0.02system 0:02.74elapsed
In case you are wondering, 5.00503 doesn't finish for a very, very long time.
This is where the independent subexpression can save us. Consider this new regex:
$_ = '(a(b(c)' . 'z' x 2000;
print "$1\n" if
m{ (
\(
(?:
(?> [^()]+ ) # <-- independent subexpression
|
\( [^()]* \)
)+
\)
)
}x;
This feature was present in 5.005 as well, so we can now show a timings for each version using this construct:
$ time perl5.00503 t.pl
(c)
0:00.06elapsed
$ time perl5.6.0 t.pl
(c)
0:00.03elapsed
In fact, after changing the string again to 'z' x 1000000 the
timing is still acceptable:
$ time perl5.00503 t.pl
(c)
0:00.54elapsed
$ time perl5.6.0 t.pl
(c)
0:00.44elapsed
The reason for the dramatic speed increase is that no backtracking occurs within the independent subexpression, thus failures are detected very quickly. The regex fails starting from the first parenthesis before even getting to the 'z' characters, but starting from the second parenthesis it could succeed if it finds a closing parenthesis somewhere after the '(c)'. Here the independent subexpression matches the 'z' characters until the end of the string and no closing parenthesis is found -- but the subexpression won't allow give up any characters so the regex falls all the way back to the start of the 'z' characters and then backtracks to the beginning -- from there it bumps ahead to the third parenthesis, the subexpression matches the 'c', then neither alternation matches and the final parenthesis is matched giving the final success.
Finally, we arrive at some of the more esoteric features of the new regex engine that give us real parsing capabilities. The first of these is the conditional expression (experimental feature, present in 5.005 as well). This comes in two flavors:
(?(condition)yespattern)
(?(condition)yespattern|nopattern)
The condition itself can be either an integer (referring to the number of a backreference of capturing parentheses) or a lookahead (or lookbehind) zero-width assertion (an embedded code assertion may also be used, but we will get that shortly).
Let's consider that we want to match a substring containing no
parentheses: [^()]+. But, if it is surrounded by parentheses, then
we want to grab those as well.
$_ = '()abc)((def))(())ghi)';
In this string we want to use /g and pull out the 'abc' and then the '(def)', then the 'ghi'. One way to do this is with alternation trying to match the largest possibility first or else the smaller choice:
$_ = '()abc)((def))(())ghi)';
print "$1\n" while
m{
(
\([^()]+ \)
|
[^()]+
)
}gx;
But, with the conditional we can do away with the alternation and match a final parenthesis based on the condition of having matched an opening parenthesis:
$_ = '()abc)((def))(())ghi)';
print "$1\n" while
m{(
( \( )?
[^()]+
(?(2) \) )
)
}xg;
In this case, we optionally match an opening parenthesis into $2,
then a series of non-parentheses, and then, if $2 contains something
(it matched) we want to find a closing parenthesis, otherwise we
ignore this conditional pattern. In other words, this is an if
condition: if ($2) then match \).
So, obviously, the second form of this construct is an if/else
statement. As a variation on the above example, let's assume we only
want to extract delimited substrings (not containing delimiters) with
any of '(), {}, <>, []' as matched delimiters
$_ = 'a(bc)d{ef}g<h)<ij>[([k]';
We only want to get '(bc)', '{ef}', '<ij>', and '[k]'. The old and newer ways of doing this are:
$_ = 'a(bc)d{ef}g<h)<ij>[([k]';
print "$1\n" while # old way
m#
(
\([^(){}<>\[\]]+\)
|
\{[^(){}<>\[\]]+\}
|
<[^(){}<>\[\]]+>
|
\[[^(){}<>\[\]]+\]
)
#gx;
print "$1\n" while # new way
m#
(
(?: (\() | (\{) | (<) | (\[) )
[^(){}<>\[\]]+
(?(2)\)|(?(3)\}|(?(4)>|(?(5)\]))))
)
#gx;
This turns the 4 main alternations into 4 1-character alternations and a nested conditional. This isn't necessarily better -- it still has 4 alternations, and it will be slower in some cases and faster in others and whether it is more readable is questionable. I included this example so I could show a variation of it in a later section.
When used with a lookahead (behind) assertion, the extra parentheses around the condition are not used because the parentheses of the lookahead (behind) assertion serve to delimit the condition.
Another experimental feature is the ability to embed perl code
into your regular expressions -- and this comes in more than one
flavor. The first flavor is the (?{code}) construct, which is a
zero-width assertion. It can be used as the condition in a
conditional expression: (?(?{code})yespattern|nopattern). In this
case, the code is evaluated and the value of the last statement is
used to determine the truth value of the condition. When not used in
a condition, the assertion always succeeds, regardless of the truth
value of the result, and the result is stored in the $^R special
variable -- which can be used in later code expressions in the regex.
$_ = 'foobar';
/foo(?{print "hello world\n"})bar/;
This will match and print out ``hello world\n''.
$_ = 'foofoobar';
/(foo(?{print "hello world\n"}))+bar/;
And this will match, and ``hello world\n'' will be printed twice, once
for each repetition of the (foo(?{code}))+ segment.
It is worth noting that, at the present time, introducing variables
inside a code-expression without using my() creates global variables
that appear to be exempt from the use strict pragma inside the
regex. On the other hand, due to scoping rules and backtracking, you
probably want to use globals and use local() to limit the scope of
changes to them within segments of the regex that might be
backtracked over. But, you should realize that even though you can
use a global inside of a code expression without having delcared it
in a use vars pragma or without using its fully qualified name --
it is still a global and accessing it outside of the regex requires
the full package name (or declaration via use vars).
The example in the perlre manpage illustrates this with a regex that counts characters matched in a backtracked repetition:
$_ = 'a' x 8;
m<
(?{ $cnt = 0 })
(
a
(?{
local $cnt = $cnt + 1;
})
)*
aaaa
(?{ $res = $cnt })
>x;
Here, $cnt is a global that is initialized in the first code
expression. The second code expression resides inside a grouping
construct -- the <$cnt> variable is localized here so that changes
made to it may be unwound during backtracking. At the end of this
regex, $cnt will equal 0 again, and $res will equal 4. Had we
not localized the $cnt variable in the inner grouping, $res
would have equalled 8 at the end because backtracking would not have
unwound the changes to the variable.
The other flavor of embedded code is the postponed subexpression of
the form (??{code}). Here, the code is evaluated and the result is
used as the subexpression pattern to match -- the important thing to
note about this is that the code is evaluated at runtime, and not
interpolated -- so any variables within it will be evaluated on the
fly with their current values. This means we can even use $1 and
friends in a pattern and get the current results just like \1. Now
let's revisit our matched-delimiter problem from before -- we can put
a mapping of the closing delimiters in a hash, and now we can avoid
the alternation problem by using a character class:
my %map = (
'(' => '\)',
'[' => '\]',
'<' => '\>',
'{' => '\}',
);
$_ = 'a(bc)d{ef}g<h)<ij>[([k]';
print "$1\n" while
m#
( ([ \[ \( \{ < ])
(?> [^(){}<>\[\]]+)
(?(2)(??{$map{$2}}))
)
#gx;
Now, if $2 matches, our conditional will use the current value of
$2 in the lookup map. If the number of delimiters was larger this
mapping version would still remain simple and only require expanding
the initial character class and providing the additional mappings in
the hash.
In Mark Kvale's perlretut.pod (see below) an example is given which detects whether a string of binary 1's and 0's contains a Fibonacci spacing of the 1's as does the string:
1101010010001000001
That is, if you count the number of 0's between the 1's you'll get a sequence of 0,1,1,2,3,5. I have modified that example to detect if a string of characters represents a Fibonacci sequence (in this case the initial zero of the sequence is implied as the null character at the start of any string). Such a Fobonacci string would look like one of these:
abccdddeeeeeffffffffggggggggggggg
abaabbbccccc########aaaaaaaaaaaaa
In other words, groups of repeating characters, the size of which increases in a Fibonacci series.
#!/usr/bin/perl -w
use strict;
my $str = 'abaabbbccccc########aaaaaaaaaaaaa';
my ($s0,$s1,$s2) = (0,1,1);
my $largest = 0;
print "It is a Fibonacci sequence\n"
if $str =~ /
^
(\w)(?!\1)
(?:
(.)(??{"$2" x $s0})(?!\2)
(?{
$largest = $s0 + 1;
($s0, $s1, $s2) = ($s2 + $s1 - 1, $s2, $s2+ $s1);
})
){2,}
$
/x;
print "largest sequence was $largest\n";
use re 'eval';
# the only way to use both code and runtime interpolation
# in a regex (otherwise there is a security risk) as in:
use re 'eval'; # must use this for this example
$_ = 'foobar';
my $re = 'foo';
m/^$re(?{print "hello world\n"})/;
use re 'taint'
# automatically taint anything resulting from a pattern match
# on tainted data -- thus extracted substrings remain tainted
use re 'debug'
use re 'debugcolor'
# these produce debug output from the regex engine on any regexen
# used:
use re 'debug';
$_ = 'fooxxxbar';
m/f*o+.*bar/;
__END__
Compiling REx `f*o+.*bar'
size 11 first at 1
1: STAR(4)
2: EXACT <f>(0)
4: PLUS(7)
5: EXACT <o>(0)
7: STAR(9)
8: REG_ANY(0)
9: EXACT <bar>(11)
11: END(0)
floating `bar' at 1..2147483647 (checking floating) minlen 4
Guessing start of match, REx `f*o+.*bar' against `fooxxxbar'...
Found floating substr `bar' at offset 6...
Guessed: match at offset 0
Matching REx `f*o+.*bar' against `fooxxxbar'
Setting an EVAL scope, savestack=3
0 <> <fooxxxbar> | 1: STAR
EXACT <f> can match 1 times out of 32767...
Setting an EVAL scope, savestack=3
1 <f> <ooxxxbar> | 4: PLUS
EXACT <o> can match 2 times out of 32767...
Setting an EVAL scope, savestack=3
3 <foo> <xxxbar> | 7: STAR
REG_ANY can match 6 times out of 32767...
Setting an EVAL scope, savestack=3
6 <fooxxx> <bar> | 9: EXACT <bar>
9 <fooxxxbar> <> | 11: END
Match successful!
Freeing REx: `f*o+.*bar'
Copyright (c) April 2000, Andrew L Johnson. All rights reserved
The definitive guide to perl's regular expressions can be found in the perlre manpage. Mark Kvale also has two regex tutorials available that are currently available at:
http://keck.ucsf.edu/~kvale/perlrequick.pod
http://keck.ucsf.edu/~kvale/perlretut.pod
The first is a more basic primer and the second a more advanced tutorial including the newer regex features.