Synopsis
If you thought Perl’s regular expressions were powerful, but complex
and difficult to understand and apply, watch out! The more recent additions
to Perl’s regex arsenal might just be enough to make run screaming
for the door—Perl’s regular expressions are now even more
powerful, more potentially complex, and decidedly non-regular by any
definition. This tutorial will present most of the new regex features (both
stable and experimental), and show you how to use them … be
afraid…
Introduction
Several new regex features have been recently introduced into
Perl—some in the 5.005xx series, and others in the new 5.6.0 release.
This tutorial will introduce many of these new features, and a few older
things just in case you weren’t aware of them.
A quick run-down of the regex features you will find in this tutorial are:
precompiling regexen with the qr// operator, embedded modifiers
and localization, lookahead and lookbehind assertions, independent
subexpressions, conditional expressions, and embedded code assertions. We
will finish with mention of a few new re pragmas that you may find useful.
Some of these newer features are officially still in an experimental stage.
Things You Probably Already Know
In this section we will address a few of the things that you may already
know about and which aren’t considered experimental.
Precompiling Regular Expressions
A new operator was introduced in 5.005, the qr// operator. This is
another quoting operator similar to q// and qq//, but which quotes regular
expressions and returns the regex precompiled into Perl’s internal
regex format. A simple usage example is:
my $re = qr/foo.*bar/;
You probably are already aware of the /o modifier, which tells
perl to compile the pattern only once if it contains a variable. This is
useful in some circumstances, but not when you want to match data against
many patterns, perhaps stored in an array. Take a look at the following
code example, ignoring the commented out line that turns the array of
patterns into an array of compiled patterns:
#!/usr/bin/perl -w
use strict;
my @patterns = qw|
a.*b
foo.*bar
a.*b.*z
|;
# @patterns = map{qr/$_/} @patterns;
while(<>){
foreach my $re (@patterns){
print "$re: $_" if /$re/;
}
}
__END__
In the above code, for line of input read we loop through the patterns in
the array and print the line (and the regex) if it matches against the
line. Every time through the loop, the regex must be
recompiled—that’s three regex compilations for every line in
the file. However, if we comment the qr// back into the code, then
the array holds already precompiled patterns and each time through the loop
perl recognizes this and only has to apply the pattern against the
line—there is no further regex compilation required.
The following shows timing results of two consecutive runs of each version
of the above program using /usr/dict/words as the input
file—on my system, this file has one word per line and is 10,754
lines long. As you might expect, compiling 3 regexen 10,754 times really
sucks compared to compiling 3 regexen only once each.
Version without precompiled regexen:
$ time perl retesting.pl /usr/dict/words >junk
6.49user 0.01system 0:06.50elapsed 99%CPU
$ time perl retesting.pl /usr/dict/words >junk
6.46user 0.01system 0:06.46elapsed 100%CPU
Version with precompiled regexen:
$ time perl retesting.pl /usr/dict/words >junk
0.83user 0.01system 0:00.84elapsed 99%CPU
$ time perl retesting.pl /usr/dict/words >junk
0.84user 0.01system 0:00.85elapsed 99%CPU
Embedded Modifiers And Comments
Some of Perl’s regex modifiers, in particular the /ismx
modifiers, can be embedded in the pattern rather than specified after the
final regex delimiter. This is done using syntax of the form
(?ismx), and any of the modifier letters included are then turned
on. You may also turn off modifiers using a dash so that (?i-sm)
will turn on the /i modifier and turn the /sm modifiers
off.
Having embeddable modifiers is useful for constructing patterns based on
user input. Consider the following case:
#!/usr/bin/perl -w
use strict;
print "enter a pattern: ";
chomp(my $re = <STDIN>);
print "case sensitive? [y/n]: ";
chomp(my $cs = <STDIN>);
$cs = $cs =~ /^y/i ? 1 : 0;
while(<>){
if ($cs) {
print if /$re/i;
}else{
print if /$re/;
}
}
Now, that’s not terrible, but we are forced to use a conditional test
within the input loop, for every line in the input. Using an embedded
modifier we can create the correct pattern before we enter the input loop:
#!/usr/bin/perl -w
use strict;
print "enter a pattern: ";
chomp(my $re = <STDIN>);
print "case sensitive? [y/n]: ";
chomp(my $cs = <STDIN>);
if ($cs =~ /^y/i) {
$re = "(?i)$re";
}
while(<>){
print if /$re/;
}
Don’t ask me of what use embedding a /x modifier is because
I can’t really think of a good one.
This same type of syntax can be used to embed a comment within a regex as
in /a(?# this is silly).*bcd/.
Lookahead And Lookbehind Assertions
Ok, Perl has had lookahead assertions for quite some time, so I will only
mention the briefly. A lookahead assertion is a form of anchoring a regex,
it is a zero-width assertion that does not consume any of the target
string. For example: /a(?!b).*c/ will match only if it finds an
‘a’ that is not immediately followed by a b, but is followed by
zero-or-more of anything and a ‘c’. Similarly, this regex:
/a(?=b)(.*c)/ will only match if an ‘a’ is followed
immediately by a ‘b’ and then zero-or-more of anything followed
by a ‘c’ —and the $1 variable will include the
‘b’ and everything up to and including the ‘c’.
Perl also has the somewhat newer lookbehind assertions, which are similar
exept that they apply to what came immediately prior to the current
position: /(".*?")(?<!\")/ will match a double
quote followed by anything up to the next double quote that is not preceded
by a backslash. The limitation on lookbehinds is that the lookbehind
expression must be a fixed width pattern—in other words,
indeterminate quantifiers are not allowed in the expression being tested.
There is also the positive lookbehind of the form /(?<=)/ that
is subject to the same rules and limitation.
Things You Might Not Know
Localizing The Scope Of Embedded Modifiers
Now, you probably already knew about the embedded modifier feature, but did
you know that you could localize such a modifier to only have effect for
part of the pattern? Well, in 5.005 you could do this by using the embedded
modifier within a grouping construct and the modifier would only apply to
the pattern within that construct:
/(?:(?i)foo)bar/ # foo is case insensitive, but bar is not
Now the non-capturing grouping construct, (?:…) has been
modified to take embedded modifiers between the ? and the : and those
modifiers apply only the pattern following the colon as in
/(?i:foo)bar/. You may also turn off modifiers using the same
syntax as the embedded modifier form discussed previously:
/(?i-s:f.*)bar.*quux/s, here the grouping is case insensitive and
the . is not allowed to match newlines between the ‘f’ and
‘bar’.
Independent Subexpressions
Another recent (experimental) addition is the independent subexpression
with a syntax of: /(?>pattern)morepattern/. If an independent
subexpression matches the regex engine will not allow backtracking within
the subexpression if the remaining pattern fails.
Let’s explore the example in the perlre manpage in a little detail.
In this example we want to match a non-empty substring contained within
parentheses that may contain one more level of parentheses within it. Thus,
in the string:
$_ = '(a(b(c)))';
We want our pattern to match the substring ’(b©)’ and put
it into the $1 variable. The following will do this:
$_ = '(a(b(c)))';
print "$1\n" if
m{ (
\(
(?:
[^()]+
|
\( [^()]* \)
)+
\)
)
}x;
However, there is a problem with this regex. Consider a case where the
match may get started but eventually fail — on a long string that
‘eventually’ may take a long long time. This is because we have
a pattern which contains an indeterminate quantifier within a grouping that
is also modified by an indeterminate quantifier — in other words,
something of the form: (x+)+ . In such a case, the regex engine
may have to explore a potentially huge number of ways of splitting up the
string. Let’s consider another string:
$_ = '(a(b(c)' . 'z' x 18;
print "$1\n" if
m{ (
\(
(?:
[^()]+
|
\( [^()]* \)
)+
\)
)
}x;
In this case, $1 should contain ’(c)’, but
there is a great deal of backtracking on failed attempts before it is
found. Perl 5.6.0 is much faster at doing this, but eventually suffers from
the exponential nature of the problem. To illustrate, here are timing runs
on the above using 5.00503 and 5.6.0 (I’ve cut out some of the output
of the time command):
$ time perl5.00503 t.pl
(c)
0:01.57elapsed
$ time perl5.6.0 t.pl
(c)
0:00.06elapsed
OK, Now let’s grow that string out a little more to stretch the
limits of 5.6.0 — now we replace ‘z’ x 18 with
‘z’ x 2000 and see the timing results:
$ time perl5.6.0 t.pl
(c)
2.69user 0.02system 0:02.74elapsed
In case you are wondering, 5.00503 doesn’t finish for a very, very
long time.
This is where the independent subexpression can save us. Consider this new
regex:
$_ = '(a(b(c)' . 'z' x 2000;
print "$1\n" if
m{ (
\(
(?:
(?> [^()]+ ) # <-- independent subexpression
|
\( [^()]* \)
)+
\)
)
}x;
This feature was present in 5.005 as well, so we can now show a timings for
each version using this construct:
$ time perl5.00503 t.pl
(c)
0:00.06elapsed
$ time perl5.6.0 t.pl
(c)
0:00.03elapsed
In fact, after changing the string again to ‘z’ x
1000000 the timing is still acceptable:
$ time perl5.00503 t.pl
(c)
0:00.54elapsed
$ time perl5.6.0 t.pl
(c)
0:00.44elapsed
The reason for the dramatic speed increase is that no backtracking occurs
within the independent subexpression, thus failures are detected very
quickly. The regex fails starting from the first parenthesis before even
getting to the ‘z’ characters, but starting from the second
parenthesis it could succeed if it finds a closing parenthesis somewhere
after the ’(c)’. Here the independent subexpression
matches the ‘z’ characters until the end of the string and no
closing parenthesis is found — but the subexpression won’t
allow give up any characters so the regex falls all the way back to the
start of the ‘z’ characters and then backtracks to the
beginning — from there it bumps ahead to the third parenthesis, the
subexpression matches the ‘c’, then neither alternation matches
and the final parenthesis is matched giving the final success.
Things You Might Not Want To Know
Conditional Expressions
Finally, we arrive at some of the more esoteric features of the new regex
engine that give us real parsing capabilities. The first of these is the
conditional expression (experimental feature, present in 5.005 as well).
This comes in two flavors:
(?(condition)yespattern)
(?(condition)yespattern|nopattern)
The condition itself can be either an integer (referring to the number of a
backreference of capturing parentheses) or a lookahead (or lookbehind)
zero-width assertion (an embedded code assertion may also be used, but we
will get that shortly).
Let’s consider that we want to match a substring containing no
parentheses: [^()]+. But, if it is surrounded by parentheses, then
we want to grab those as well.
$_ = '()abc)((def))(())ghi)';
In this string we want to use /g and pull out the ‘abc’ and
then the ’(def)’, then the ‘ghi’. One way to do
this is with alternation trying to match the largest possibility first or
else the smaller choice:
$_ = '()abc)((def))(())ghi)';
print "$1\n" while
m{
(
\([^()]+ \)
|
[^()]+
)
}gx;
But, with the conditional we can do away with the alternation and match a
final parenthesis based on the condition of having matched an opening
parenthesis:
$_ = '()abc)((def))(())ghi)';
print "$1\n" while
m{(
( \( )?
[^()]+
(?(2) \) )
)
}xg;
In this case, we optionally match an opening parenthesis into $2, then a
series of non-parentheses, and then, if $2 contains something (it matched)
we want to find a closing parenthesis, otherwise we ignore this conditional
pattern. In other words, this is an if condition: if ($2) then
match \).
So, obviously, the second form of this construct is an if/else
statement. As a variation on the above example, let’s assume we only
want to extract delimited substrings (not containing delimiters) with any
of ’(), {}, <>, []’ as matched delimiters
$_ = 'a(bc)d{ef}g<h)<ij>[([k]';
We only want to get ’(bc)’, ’{ef}’,
’<ij>’, and ’[k]’. The old and newer ways of
doing this are:
$_ = 'a(bc)d{ef}g<h)<ij>[([k]';
print "$1\n" while # old way
m#
(
\([^(){}<>\[\]]+\)
|
\{[^(){}<>\[\]]+\}
|
<[^(){}<>\[\]]+>
|
\[[^(){}<>\[\]]+\]
)
#gx;
print "$1\n" while # new way
m#
(
(?: (\() | (\{) | (<) | (\[) )
[^(){}<>\[\]]+
(?(2)\)|(?(3)\}|(?(4)>|(?(5)\]))))
)
#gx;
This turns the 4 main alternations into 4 1-character alternations and a
nested conditional. This isn’t necessarily better — it still
has 4 alternations, and it will be slower in some cases and faster in
others and whether it is more readable is questionable. I included this
example so I could show a variation of it in a later section.
When used with a lookahead (behind) assertion, the extra parentheses around
the condition are not used because the parentheses of the lookahead
(behind) assertion serve to delimit the condition.
Embedded Code
Another experimental feature is the ability to embed perl code into your
regular expressions — and this comes in more than one flavor. The
first flavor is the (?{code}) construct, which is a zero-width
assertion. It can be used as the condition in a conditional expression:
(?(?{code})yespattern|nopattern). In this case, the code is
evaluated and the value of the last statement is used to determine the
truth value of the condition. When not used in a condition, the assertion
always succeeds, regardless of the truth value of the result, and the
result is stored in the $^R special variable — which can be
used in later code expressions in the regex.
$_ = 'foobar';
/foo(?{print "hello world\n"})bar/;
This will match and print out "hello world\n".
$_ = 'foofoobar';
/(foo(?{print "hello world\n"}))+bar/;
And this will match, and "hello world\n" will be printed twice,
once for each repetition of the (foo(?{code}))+ segment.
It is worth noting that, at the present time, introducing variables inside
a code-expression without using my() creates global variables that
appear to be exempt from the use strict pragma inside the regex.
On the other hand, due to scoping rules and backtracking, you probably want
to use globals and use local() to limit the scope of changes to
them within segments of the regex that might be backtracked over. But, you
should realize that even though you can use a global inside of a code
expression without having delcared it in a use vars pragma or
without using its fully qualified name — it is still a global and
accessing it outside of the regex requires the full package name (or
declaration via use vars).
The example in the perlre manpage illustrates this with a regex that counts
characters matched in a backtracked repetition:
$_ = 'a' x 8;
m<
(?{ $cnt = 0 })
(
a
(?{
local $cnt = $cnt + 1;
})
)*
aaaa
(?{ $res = $cnt })
>x;
Here, $cnt is a global that is initialized in the first code
expression. The second code expression resides inside a grouping construct
— the $cnt variable is localized here so that changes made
to it may be unwound during backtracking. At the end of this regex,
$cnt will equal 0 again, and $res will equal 4. Had we
not localized the $cnt variable in the inner grouping,
$res would have equalled 8 at the end because backtracking would
not have unwound the changes to the variable.
The other flavor of embedded code is the postponed subexpression of the
form (??{code}). Here, the code is evaluated and the result is
used as the subexpression pattern to match — the important thing to
note about this is that the code is evaluated at runtime, and not
interpolated — so any variables within it will be evaluated on the
fly with their current values. This means we can even use $1 and
friends in a pattern and get the current results just like \1. Now
let’s revisit our matched-delimiter problem from before — we
can put a mapping of the closing delimiters in a hash, and now we can avoid
the alternation problem by using a character class:
my %map = (
'(' => '\)',
'[' => '\]',
'<' => '\>',
'{' => '\}',
);
$_ = 'a(bc)d{ef}g<h)<ij>[([k]';
print "$1\n" while
m#
( ([ \[ \( \{ < ])
(?> [^(){}<>\[\]]+)
(?(2)(??{$map{$2}}))
)
#gx;
Now, if $2 matches, our conditional will use the current value of
$2 in the lookup map. If the number of delimiters was larger this
mapping version would still remain simple and only require expanding the
initial character class and providing the additional mappings in the hash.
In Mark Kvale’s perlretut.pod (see below) an example is given which
detects whether a string of binary 1’s and 0’s contains a
Fibonacci spacing of the 1’s as does the string:
1101010010001000001
That is, if you count the number of 0’s between the 1’s
you’ll get a sequence of 0,1,1,2,3,5. I have modified that example to
detect if a string of characters represents a Fibonacci sequence (in this
case the initial zero of the sequence is implied as the null character at
the start of any string). Such a Fobonacci string would look like one of
these:
abccdddeeeeeffffffffggggggggggggg
abaabbbccccc########aaaaaaaaaaaaa
In other words, groups of repeating characters, the size of which increases
in a Fibonacci series.
#!/usr/bin/perl -w
use strict;
my $str = 'abaabbbccccc########aaaaaaaaaaaaa';
my ($s0,$s1,$s2) = (0,1,1);
my $largest = 0;
print "It is a Fibonacci sequence\n"
if $str =~ /
^
(\w)(?!\1)
(?:
(.)(??{"$2" x $s0})(?!\2)
(?{
$largest = $s0 + 1;
($s0, $s1, $s2) = ($s2 + $s1 - 1, $s2, $s2+ $s1);
})
){2,}
$
/x;
print "largest sequence was $largest\n";
Epilogue: A Few Re Pragmas
use re 'eval';
# the only way to use both code and runtime interpolation
# in a regex (otherwise there is a security risk) as in:
use re 'eval'; # must use this for this example
$_ = 'foobar';
my $re = 'foo';
m/^$re(?{print "hello world\n"})/;
use re 'taint'
# automatically taint anything resulting from a pattern match
# on tainted data -- thus extracted substrings remain tainted
use re 'debug'
use re 'debugcolor'
# these produce debug output from the regex engine on any regexen
# used:
use re 'debug';
$_ = 'fooxxxbar';
m/f*o+.*bar/;
__END__
Compiling REx `f*o+.*bar'
size 11 first at 1
1: STAR(4)
2: EXACT <f>(0)
4: PLUS(7)
5: EXACT <o>(0)
7: STAR(9)
8: REG_ANY(0)
9: EXACT <bar>(11)
11: END(0)
floating `bar' at 1..2147483647 (checking floating) minlen 4
Guessing start of match, REx `f*o+.*bar' against `fooxxxbar'...
Found floating substr `bar' at offset 6...
Guessed: match at offset 0
Matching REx `f*o+.*bar' against `fooxxxbar'
Setting an EVAL scope, savestack=3
0 <> <fooxxxbar> | 1: STAR
EXACT <f> can match 1 times out of 32767...
Setting an EVAL scope, savestack=3
1 <f> <ooxxxbar> | 4: PLUS
EXACT <o> can match 2 times out of 32767...
Setting an EVAL scope, savestack=3
3 <foo> <xxxbar> | 7: STAR
REG_ANY can match 6 times out of 32767...
Setting an EVAL scope, savestack=3
6 <fooxxx> <bar> | 9: EXACT <bar>
9 <fooxxxbar> <> | 11: END
Match successful!
Freeing REx: `f*o+.*bar'
Author
Copyright © April 2000, Andrew L Johnson. All rights reserved
Other Resources
The definitive guide to perl’s regular expressions can be found in
the perlre manpage. Mark Kvale also has two regex tutorials available that
are currently available at:
http://keck.ucsf.edu/~kvale/perlrequick.pod
http://keck.ucsf.edu/~kvale/perlretut.pod
The first is a more basic primer and the second a more advanced tutorial
including the newer regex features.