Siaris
Simple Things
Articles and other Writings

a few words more …
 
Faqgrep:This is a simple, but useful Perl script, written up as an example of Literate Programming. The plain faqgrep script is in the code section. (pdf-download)
Regex Chapter:The free Regex chapter from my book — while the underlying language is Perl, the regex concepts are general enough. (pdf-download)
AdvancedRegex:Tutorial for a few of Perl’s (and other languages) more advanced regex features.

 

BloomFilter

simple things …
 

A bloom-filter (after Burton H. Bloom) is a method of superimposed coding on primary keys. The purpose is to provide a means of set membership queries utilizing minimal space — with the tradeoff of an adjustable rate of false positives.

See this article for a little background.

Latest changes

  • Tries the BitSet (C extension) before falling back on its own SimpleVector (support for BitVector removed as it had no real performance advantage over BitSet, and it had leaked).
  • BloomFilters are marshal-able, so you only need create your large filter once and then store or transmit it.

Downloads:

Or grab it from my darcs repository:

  darcs get http://repos.siaris.net/ruby/BloomFilter
Tree + Heap == Treap

simple things …
 

A treap is a randomized binary search tree using random priorities to probabilistically balance the tree.

Fetch a tarball:

Or grab it from my darcs repository:

  darcs get http://repos.siaris.net/ruby/Treap
WordFind Puzzle Generator

simple things …
 

A simple class for generating word-find style puzzles (simple plain text format at the moment).

Fetch the tarball here:

Or grab it from my darcs repository:

  darcs get http://repos.siaris.net/ruby/wordfind
Software Packages and Scripts

simple things …
 

Ruby

Ruby-BloomFilter:- a simple but effective bloomfilter class
Ruby-Treap:- a randomized binary search tree
Ruby-Async:- asynchronous sequential messaging (per object)
Ruby-WordFind:- generate wordfind puzzles

Perl

Perl-Faqgrep:- a simple script for searching the perlfaqs
Games-WordFind:- a class for generating wordfind style puzzles
Tree-Treap:- randomized binary search tree
Tie-Hash-Treap:- a hash interface for the treap (gives ordered hash)

 

Asynchronous sequential messages

simple things…
 

Adding asynchronous message queues to Ruby objects, with semi-transparent futures. See: this article

See the wiki-feedback page for the above article to see the derivation of these two versions:

Or darcs it:

  darcs get http://repos.siaris.net/ruby/Async
Tree-Treap

simple things…
 

A treap is a randomized binary search tree using random priorities to probabilistically balance the tree.

Fetch the tarball here:

Or grab it from my darcs repository:

  darcs get http://repos.siaris.net/perl/Tree-Treap
Tie-Hash-Treap

simple things …
 

A treap is a randomized binary search tree using random priorities to probabilistically balance the tree.

This gives a tied-hash interface to Tree-Treap (required)

Fetch the tarball here:

Or grab it from my darcs repository:

  darcs get http://repos.siaris.net/perl/Tie-Hash-Treap
Perl: Faqgrep

simple things are better …
 

A Perl script for searching the perlfaq entries for keywords. Written before the perldoc -q term option, but it is still a more convenient tool (IMHO).

Perl: Games-WordFind

 
 
A Perl module to generate word-find type puzzles. Output can be either HTML, LaTeX, or plain text.

Or use darcs and fetch it from my darcs repository:

  darcs get http://repos.siaris.net/perl/Games-WordFind
Perl Regex Tutorial - New Features (5.005 And 5.6.0)

Andrew L Johnson
 

Synopsis

If you thought Perl’s regular expressions were powerful, but complex and difficult to understand and apply, watch out! The more recent additions to Perl’s regex arsenal might just be enough to make run screaming for the door—Perl’s regular expressions are now even more powerful, more potentially complex, and decidedly non-regular by any definition. This tutorial will present most of the new regex features (both stable and experimental), and show you how to use them … be afraid…

Introduction

Several new regex features have been recently introduced into Perl—some in the 5.005xx series, and others in the new 5.6.0 release. This tutorial will introduce many of these new features, and a few older things just in case you weren’t aware of them.

A quick run-down of the regex features you will find in this tutorial are: precompiling regexen with the qr// operator, embedded modifiers and localization, lookahead and lookbehind assertions, independent subexpressions, conditional expressions, and embedded code assertions. We will finish with mention of a few new re pragmas that you may find useful. Some of these newer features are officially still in an experimental stage.

Things You Probably Already Know

In this section we will address a few of the things that you may already know about and which aren’t considered experimental.

Precompiling Regular Expressions

A new operator was introduced in 5.005, the qr// operator. This is another quoting operator similar to q// and qq//, but which quotes regular expressions and returns the regex precompiled into Perl’s internal regex format. A simple usage example is:

    my $re = qr/foo.*bar/;

You probably are already aware of the /o modifier, which tells perl to compile the pattern only once if it contains a variable. This is useful in some circumstances, but not when you want to match data against many patterns, perhaps stored in an array. Take a look at the following code example, ignoring the commented out line that turns the array of patterns into an array of compiled patterns:

    #!/usr/bin/perl -w
    use strict;

    my @patterns = qw|
        a.*b
        foo.*bar
        a.*b.*z
        |;

    # @patterns = map{qr/$_/} @patterns;

    while(<>){
        foreach my $re (@patterns){
            print "$re: $_" if /$re/;
        }
    }
    __END__

In the above code, for line of input read we loop through the patterns in the array and print the line (and the regex) if it matches against the line. Every time through the loop, the regex must be recompiled—that’s three regex compilations for every line in the file. However, if we comment the qr// back into the code, then the array holds already precompiled patterns and each time through the loop perl recognizes this and only has to apply the pattern against the line—there is no further regex compilation required.

The following shows timing results of two consecutive runs of each version of the above program using /usr/dict/words as the input file—on my system, this file has one word per line and is 10,754 lines long. As you might expect, compiling 3 regexen 10,754 times really sucks compared to compiling 3 regexen only once each.

    Version without precompiled regexen:

    $ time perl retesting.pl /usr/dict/words >junk
    6.49user 0.01system 0:06.50elapsed 99%CPU

    $ time perl retesting.pl /usr/dict/words >junk
    6.46user 0.01system 0:06.46elapsed 100%CPU

    Version with precompiled regexen:
    $ time perl retesting.pl /usr/dict/words >junk
    0.83user 0.01system 0:00.84elapsed 99%CPU

    $ time perl retesting.pl /usr/dict/words >junk
    0.84user 0.01system 0:00.85elapsed 99%CPU

Embedded Modifiers And Comments

Some of Perl’s regex modifiers, in particular the /ismx modifiers, can be embedded in the pattern rather than specified after the final regex delimiter. This is done using syntax of the form (?ismx), and any of the modifier letters included are then turned on. You may also turn off modifiers using a dash so that (?i-sm) will turn on the /i modifier and turn the /sm modifiers off.

Having embeddable modifiers is useful for constructing patterns based on user input. Consider the following case:

    #!/usr/bin/perl -w
    use strict;
    print "enter a pattern: ";
    chomp(my $re = <STDIN>);

    print "case sensitive? [y/n]: ";
    chomp(my $cs = <STDIN>);
    $cs = $cs =~ /^y/i ? 1 : 0;

    while(<>){
        if ($cs) {
            print if /$re/i;
        }else{
            print if /$re/;
        }
    }

Now, that’s not terrible, but we are forced to use a conditional test within the input loop, for every line in the input. Using an embedded modifier we can create the correct pattern before we enter the input loop:

    #!/usr/bin/perl -w
    use strict;
    print "enter a pattern: ";
    chomp(my $re = <STDIN>);

    print "case sensitive? [y/n]: ";
    chomp(my $cs = <STDIN>);
    if ($cs =~ /^y/i) {
        $re = "(?i)$re";
    }

    while(<>){
        print if /$re/;
    }

Don’t ask me of what use embedding a /x modifier is because I can’t really think of a good one.

This same type of syntax can be used to embed a comment within a regex as in /a(?# this is silly).*bcd/.

Lookahead And Lookbehind Assertions

Ok, Perl has had lookahead assertions for quite some time, so I will only mention the briefly. A lookahead assertion is a form of anchoring a regex, it is a zero-width assertion that does not consume any of the target string. For example: /a(?!b).*c/ will match only if it finds an ‘a’ that is not immediately followed by a b, but is followed by zero-or-more of anything and a ‘c’. Similarly, this regex: /a(?=b)(.*c)/ will only match if an ‘a’ is followed immediately by a ‘b’ and then zero-or-more of anything followed by a ‘c’ —and the $1 variable will include the ‘b’ and everything up to and including the ‘c’.

Perl also has the somewhat newer lookbehind assertions, which are similar exept that they apply to what came immediately prior to the current position: /(".*?")(?<!\")/ will match a double quote followed by anything up to the next double quote that is not preceded by a backslash. The limitation on lookbehinds is that the lookbehind expression must be a fixed width pattern—in other words, indeterminate quantifiers are not allowed in the expression being tested. There is also the positive lookbehind of the form /(?<=)/ that is subject to the same rules and limitation.

Things You Might Not Know

Localizing The Scope Of Embedded Modifiers

Now, you probably already knew about the embedded modifier feature, but did you know that you could localize such a modifier to only have effect for part of the pattern? Well, in 5.005 you could do this by using the embedded modifier within a grouping construct and the modifier would only apply to the pattern within that construct:

    /(?:(?i)foo)bar/ # foo is case insensitive, but bar is not

Now the non-capturing grouping construct, (?:…) has been modified to take embedded modifiers between the ? and the : and those modifiers apply only the pattern following the colon as in /(?i:foo)bar/. You may also turn off modifiers using the same syntax as the embedded modifier form discussed previously: /(?i-s:f.*)bar.*quux/s, here the grouping is case insensitive and the . is not allowed to match newlines between the ‘f’ and ‘bar’.

Independent Subexpressions

Another recent (experimental) addition is the independent subexpression with a syntax of: /(?>pattern)morepattern/. If an independent subexpression matches the regex engine will not allow backtracking within the subexpression if the remaining pattern fails.

Let’s explore the example in the perlre manpage in a little detail. In this example we want to match a non-empty substring contained within parentheses that may contain one more level of parentheses within it. Thus, in the string:

    $_ = '(a(b(c)))';

We want our pattern to match the substring ’(b©)’ and put it into the $1 variable. The following will do this:

    $_ = '(a(b(c)))';
    print "$1\n" if
        m{ (
              \(
                  (?:
                    [^()]+
                  |
                    \( [^()]* \)
                  )+
              \)
            )
         }x;

However, there is a problem with this regex. Consider a case where the match may get started but eventually fail — on a long string that ‘eventually’ may take a long long time. This is because we have a pattern which contains an indeterminate quantifier within a grouping that is also modified by an indeterminate quantifier — in other words, something of the form: (x+)+ . In such a case, the regex engine may have to explore a potentially huge number of ways of splitting up the string. Let’s consider another string:

    $_ = '(a(b(c)' . 'z' x 18;
    print "$1\n" if
        m{ (
              \(
                  (?:
                    [^()]+
                  |
                    \( [^()]* \)
                  )+
              \)
            )
         }x;

In this case, $1 should contain ’(c)’, but there is a great deal of backtracking on failed attempts before it is found. Perl 5.6.0 is much faster at doing this, but eventually suffers from the exponential nature of the problem. To illustrate, here are timing runs on the above using 5.00503 and 5.6.0 (I’ve cut out some of the output of the time command):

    $ time perl5.00503 t.pl
    (c)
    0:01.57elapsed
    $ time perl5.6.0 t.pl
    (c)
    0:00.06elapsed

OK, Now let’s grow that string out a little more to stretch the limits of 5.6.0 — now we replace ‘z’ x 18 with ‘z’ x 2000 and see the timing results:

    $ time perl5.6.0 t.pl
    (c)
    2.69user 0.02system 0:02.74elapsed

In case you are wondering, 5.00503 doesn’t finish for a very, very long time.

This is where the independent subexpression can save us. Consider this new regex:

    $_ = '(a(b(c)' . 'z' x 2000;
    print "$1\n" if
        m{ (
              \(
                  (?:
                    (?> [^()]+ )     # <-- independent subexpression
                  |
                    \( [^()]* \)
                  )+
              \)
            )
         }x;

This feature was present in 5.005 as well, so we can now show a timings for each version using this construct:

    $ time perl5.00503 t.pl
    (c)
    0:00.06elapsed
    $ time perl5.6.0 t.pl
    (c)
    0:00.03elapsed

In fact, after changing the string again to ‘z’ x 1000000 the timing is still acceptable:

    $ time perl5.00503 t.pl
    (c)
    0:00.54elapsed
    $ time perl5.6.0 t.pl
    (c)
    0:00.44elapsed

The reason for the dramatic speed increase is that no backtracking occurs within the independent subexpression, thus failures are detected very quickly. The regex fails starting from the first parenthesis before even getting to the ‘z’ characters, but starting from the second parenthesis it could succeed if it finds a closing parenthesis somewhere after the ’(c)’. Here the independent subexpression matches the ‘z’ characters until the end of the string and no closing parenthesis is found — but the subexpression won’t allow give up any characters so the regex falls all the way back to the start of the ‘z’ characters and then backtracks to the beginning — from there it bumps ahead to the third parenthesis, the subexpression matches the ‘c’, then neither alternation matches and the final parenthesis is matched giving the final success.

Things You Might Not Want To Know

Conditional Expressions

Finally, we arrive at some of the more esoteric features of the new regex engine that give us real parsing capabilities. The first of these is the conditional expression (experimental feature, present in 5.005 as well). This comes in two flavors:

    (?(condition)yespattern)
    (?(condition)yespattern|nopattern)

The condition itself can be either an integer (referring to the number of a backreference of capturing parentheses) or a lookahead (or lookbehind) zero-width assertion (an embedded code assertion may also be used, but we will get that shortly).

Let’s consider that we want to match a substring containing no parentheses: [^()]+. But, if it is surrounded by parentheses, then we want to grab those as well.

    $_ = '()abc)((def))(())ghi)';

In this string we want to use /g and pull out the ‘abc’ and then the ’(def)’, then the ‘ghi’. One way to do this is with alternation trying to match the largest possibility first or else the smaller choice:

    $_ = '()abc)((def))(())ghi)';
    print "$1\n" while
        m{
            (
                \([^()]+ \)
              |
                [^()]+
            )
        }gx;

But, with the conditional we can do away with the alternation and match a final parenthesis based on the condition of having matched an opening parenthesis:

    $_ = '()abc)((def))(())ghi)';
    print "$1\n" while
        m{(
           ( \( )?
           [^()]+
           (?(2) \) )
          )
         }xg;

In this case, we optionally match an opening parenthesis into $2, then a series of non-parentheses, and then, if $2 contains something (it matched) we want to find a closing parenthesis, otherwise we ignore this conditional pattern. In other words, this is an if condition: if ($2) then match \).

So, obviously, the second form of this construct is an if/else statement. As a variation on the above example, let’s assume we only want to extract delimited substrings (not containing delimiters) with any of ’(), {}, <>, []’ as matched delimiters

    $_ = 'a(bc)d{ef}g<h)<ij>[([k]';

We only want to get ’(bc)’, ’{ef}’, ’<ij>’, and ’[k]’. The old and newer ways of doing this are:

    $_ = 'a(bc)d{ef}g<h)<ij>[([k]';

    print "$1\n" while               # old way
        m#
            (
                \([^(){}<>\[\]]+\)
              |
                \{[^(){}<>\[\]]+\}
              |
                <[^(){}<>\[\]]+>
              |
                \[[^(){}<>\[\]]+\]
            )
        #gx;

    print "$1\n" while               # new way
        m#
            (
                (?: (\() | (\{) | (<) | (\[) )
                [^(){}<>\[\]]+
                (?(2)\)|(?(3)\}|(?(4)>|(?(5)\]))))
            )
        #gx;

This turns the 4 main alternations into 4 1-character alternations and a nested conditional. This isn’t necessarily better — it still has 4 alternations, and it will be slower in some cases and faster in others and whether it is more readable is questionable. I included this example so I could show a variation of it in a later section.

When used with a lookahead (behind) assertion, the extra parentheses around the condition are not used because the parentheses of the lookahead (behind) assertion serve to delimit the condition.

Embedded Code

Another experimental feature is the ability to embed perl code into your regular expressions — and this comes in more than one flavor. The first flavor is the (?{code}) construct, which is a zero-width assertion. It can be used as the condition in a conditional expression: (?(?{code})yespattern|nopattern). In this case, the code is evaluated and the value of the last statement is used to determine the truth value of the condition. When not used in a condition, the assertion always succeeds, regardless of the truth value of the result, and the result is stored in the $^R special variable — which can be used in later code expressions in the regex.

    $_ = 'foobar';
    /foo(?{print "hello world\n"})bar/;

This will match and print out "hello world\n".

    $_ = 'foofoobar';
    /(foo(?{print "hello world\n"}))+bar/;

And this will match, and "hello world\n" will be printed twice, once for each repetition of the (foo(?{code}))+ segment.

It is worth noting that, at the present time, introducing variables inside a code-expression without using my() creates global variables that appear to be exempt from the use strict pragma inside the regex. On the other hand, due to scoping rules and backtracking, you probably want to use globals and use local() to limit the scope of changes to them within segments of the regex that might be backtracked over. But, you should realize that even though you can use a global inside of a code expression without having delcared it in a use vars pragma or without using its fully qualified name — it is still a global and accessing it outside of the regex requires the full package name (or declaration via use vars).

The example in the perlre manpage illustrates this with a regex that counts characters matched in a backtracked repetition:

    $_ = 'a' x 8;
    m<
       (?{ $cnt = 0 })
       (
         a
         (?{
             local $cnt = $cnt + 1;
         })
       )*
       aaaa
       (?{ $res = $cnt })
     >x;

Here, $cnt is a global that is initialized in the first code expression. The second code expression resides inside a grouping construct — the $cnt variable is localized here so that changes made to it may be unwound during backtracking. At the end of this regex, $cnt will equal 0 again, and $res will equal 4. Had we not localized the $cnt variable in the inner grouping, $res would have equalled 8 at the end because backtracking would not have unwound the changes to the variable.

The other flavor of embedded code is the postponed subexpression of the form (??{code}). Here, the code is evaluated and the result is used as the subexpression pattern to match — the important thing to note about this is that the code is evaluated at runtime, and not interpolated — so any variables within it will be evaluated on the fly with their current values. This means we can even use $1 and friends in a pattern and get the current results just like \1. Now let’s revisit our matched-delimiter problem from before — we can put a mapping of the closing delimiters in a hash, and now we can avoid the alternation problem by using a character class:

    my %map = (
        '(' => '\)',
        '[' => '\]',
        '<' => '\>',
        '{' => '\}',
        );

    $_ = 'a(bc)d{ef}g<h)<ij>[([k]';
    print "$1\n" while
         m#
             ( ([ \[ \( \{ < ])
                (?> [^(){}<>\[\]]+)
               (?(2)(??{$map{$2}}))
             )
         #gx;

Now, if $2 matches, our conditional will use the current value of $2 in the lookup map. If the number of delimiters was larger this mapping version would still remain simple and only require expanding the initial character class and providing the additional mappings in the hash.

In Mark Kvale’s perlretut.pod (see below) an example is given which detects whether a string of binary 1’s and 0’s contains a Fibonacci spacing of the 1’s as does the string:

    1101010010001000001

That is, if you count the number of 0’s between the 1’s you’ll get a sequence of 0,1,1,2,3,5. I have modified that example to detect if a string of characters represents a Fibonacci sequence (in this case the initial zero of the sequence is implied as the null character at the start of any string). Such a Fobonacci string would look like one of these:

    abccdddeeeeeffffffffggggggggggggg
    abaabbbccccc########aaaaaaaaaaaaa

In other words, groups of repeating characters, the size of which increases in a Fibonacci series.

    #!/usr/bin/perl -w
    use strict;

    my $str = 'abaabbbccccc########aaaaaaaaaaaaa';
    my ($s0,$s1,$s2) = (0,1,1);
    my $largest = 0;
    print "It is a Fibonacci sequence\n"
        if $str =~ /
            ^
            (\w)(?!\1)
            (?:
                (.)(??{"$2" x $s0})(?!\2)
                (?{
                    $largest = $s0 + 1;
                    ($s0, $s1, $s2) = ($s2 + $s1 - 1, $s2, $s2+ $s1);
                  })
            ){2,}
            $
            /x;
    print "largest sequence was $largest\n";

Epilogue: A Few Re Pragmas

    use re 'eval';
    # the only way to use both code and runtime interpolation
    # in a regex (otherwise there is a security risk) as in:

    use re 'eval';  # must use this for this example
    $_ = 'foobar';
    my $re = 'foo';
    m/^$re(?{print "hello world\n"})/;

    use re 'taint'
    # automatically taint anything resulting from a pattern match
    # on tainted data -- thus extracted substrings remain tainted

    use re 'debug'
    use re 'debugcolor'
    # these produce debug output from the regex engine on any regexen
    # used:

    use re 'debug';
    $_ = 'fooxxxbar';
    m/f*o+.*bar/;
    __END__

    Compiling REx `f*o+.*bar'
    size 11 first at 1
       1: STAR(4)
       2:   EXACT <f>(0)
       4: PLUS(7)
       5:   EXACT <o>(0)
       7: STAR(9)
       8:   REG_ANY(0)
       9: EXACT <bar>(11)
      11: END(0)
    floating `bar' at 1..2147483647 (checking floating) minlen 4
    Guessing start of match, REx `f*o+.*bar' against `fooxxxbar'...
    Found floating substr `bar' at offset 6...
    Guessed: match at offset 0
    Matching REx `f*o+.*bar' against `fooxxxbar'
      Setting an EVAL scope, savestack=3
       0 <> <fooxxxbar>       |  1:  STAR
                               EXACT <f> can match 1 times out of 32767...
      Setting an EVAL scope, savestack=3
       1 <f> <ooxxxbar>       |  4:    PLUS
                               EXACT <o> can match 2 times out of 32767...
      Setting an EVAL scope, savestack=3
       3 <foo> <xxxbar>       |  7:      STAR
                               REG_ANY can match 6 times out of 32767...
      Setting an EVAL scope, savestack=3
       6 <fooxxx> <bar>       |  9:        EXACT <bar>
       9 <fooxxxbar> <>       | 11:        END
    Match successful!
    Freeing REx: `f*o+.*bar'

Author

Copyright © April 2000, Andrew L Johnson. All rights reserved

Other Resources

The definitive guide to perl’s regular expressions can be found in the perlre manpage. Mark Kvale also has two regex tutorials available that are currently available at:

    http://keck.ucsf.edu/~kvale/perlrequick.pod
    http://keck.ucsf.edu/~kvale/perlretut.pod

The first is a more basic primer and the second a more advanced tutorial including the newer regex features.

Regex Chapter

 
 
The free Regex chapter from my book — while the underlying language is Perl, the regex concepts are general enough.
Literate Programming example

 
 
This is a simple, but useful Perl script, written up as an example of Literate Programming. The plain faqgrep script is in the code section.
Perl: Faqgrep

 
 
A Perl script for searching the perlfaq entries for keywords. Written before the perldoc -q term option, but it is still a more convenient tool (IMHO).
Advanced Regex Features

 
 
Tutorial for a few of Perl’s (and other languages) more advanced regex features.