Tuesday, February 28. 2006
Introducing BunnyRegex: easy regular expressions, and mini-languages inside of PHP.
Regular expressions are hard. They are hard to create, and even harder to read after the fact. Regular expressions, while quite powerful, are a blight upon readable code. There is no easy way to know that '/^\d{4}\/\d{2}\/\d{2}$/' is a search for a string in the form of 'xxxx/xx/xx'. You have to know that ^ is the start of a line, \d is a digit, \/ is / escaped, etc.
Granted, once you know regular expressions, that information is portable across which ever language you use, be it PHP, Perl, Javascript, whatever. But getting to that point is not easy, and even after you are there, the fact remains:Regular expressions are not human parseable. This is where BunnyRegex comes in.
BunnyRegex is a BSD Licensed class wrapper around the preg_* functions of PHP. The whole point of its existence is to make working with regular expressions less like the memorization and usage of ancient Norse runes (Elder Futhark) and more like, you know, actual programming. BunnyRegex is also a great example of Fluent Programming and the implementation of mini-languages (more on this in a new entry). BunnyRegex is available on my personal wiki.
So, here is some code illustrating how BunnyRegex works. We are going to do the same regular expression for the string 'xxxx/xx/xx'.
include("BunnyRegex.php");
$pattern = new BunnyRegex();
$pattern->bol() // ^
->digit() // \d
->exactly(4) // {4}
->string('/') // /
->digit() // \d
->exactly(2) // {2}
->string('/') // /
->digit() // \d
->exactly(2) // {2}
->eol(); // $
$pattern->match('2006/02/28'); // returns true
$pattern->numberOfMatches('2006/02/28'); // returns 1
There are a few things to notice here. First off, you will notice that the way this code is laid out is almost completely backwards from the way PCRE is normally written inside of PHP; usually the readable part of the code is the comments, and the unreadable part is the code itself. Second, you will notice that there is a chain of method calls, starting from the bol() (i.e. beginning of line) method and finishing off at the eol() method. And finally, you will notice that the regular expression can be re-used, as shown by the match() and numberOfMatches() method calls. So what is going on here?
Fluent Regular Expressions
Every method of BunnyRegex that helps build your regular expression returns an object reference to the current object. This means that you can chain together methods like This:
startCapture()->startGroup()->upper()->lower()->oneOrMore()->endGroup()->moreThan(1)->endCapture()
which is equivalent to this:
$pattern->startCapture(); $pattern->startGroup(); $pattern->upper(); $pattern->lower(); $pattern->oneOrMore(); $pattern->endGroup(); $pattern->moreThan(1); $pattern->endCapture();
This method of programming is called "Fluent Programming" and depending on the situation, can make for much more readable code. When fluent programming is applied to BunnyRegex, it allows you to format your regular expression code in the same way that you format the rest of your code, with proper indentation of blocks. For example:
$pattern
->startCapture()
->startGroup()
->upper()
->lower()->oneOrMore()
->endGroup()
->moreThan(1)
->endCapture();
You can pretty well see what is going on here just by looking, as opposed to the string '((?[A-Z][a-z]+){2,})'. What is lost of course is brevity; those of you who enjoy being able to pound out something as powerful as "return every CamelCase word to me as an array" in just one line are not going to be impressed. Neither are those of you who feel somehow special by knowing and understanding cryptic regular expressions. However, those of you who work with regular expressions seldom enough not to remember every minute character/meta-character combination, but frequently enough so that it is a pain in the ass, then BunnyRegex is for you. The method names may even be slightly less memorable then the usual regex sigils, but they are about 100 times more readable.
I have found so far that the best way to format the regular expression is to give each component its own line, with exception to the repetition operators (noneOrOne(), noneOrMore(), oneOrMore(), moreThan(), between(), etc.) which should be grouped with the entity they are repeating. The nonGreedy operator should follow the repetition operator.
Update:
I hit post just a little too soon on this one. Oops. Here is the restAs you can see, this is a lot like a mini-language that is implemented right inside of php, rather then just a simple class. This is where fluent programming really shines because it allows you to do two things: 1) think about your problem in terms of the implementation of a mini-language and 2) wrap the solution of your problem in a reuseable, exportable software component. This idea of implementing a language to solve your problem is not a new one, the Scheme guys have been doing it for years, but it is really quite an effective way to think about and solve your problem. If the language you implement is sufficiently general, you can turn around and use it to solve other similar problems.
Matching, Grepping, Replacing and Splitting with BunnyRegex
Once you have built your regular expression, you probably want to do something with it. About the easiest thing that you can do is to simply get the raw regular expression with get(). You can then use the normal preg_* functions on that string. However, BunnyRegex also gives you access to the same functions from within (with slightly different names and semantics). Use whichever method you feel is comfortable. I am not going to document these, because their behavior so closely mirrors the preg_* functions. Check the PHP Manual for more info.
- match($subject) is similar to preg_match, except it will return a boolean as to whether or not the match failed or succeeded
- numberOfMatches($subject) is exactly the same as preg_match_all.
- captures($subject) will return the array of captures that preg_match would return in its 3rd argument.
- allCaptures($subject) is similar to the previous method, except it uses preg_match_all.
- grep($subjectArray) is preg_grep. It takes an array, and then returns an array of elements that match.
All of these methods take a (string) subject to match against as an argument, with exception to grep.
- replace($replacement, $subject, [$limit]) is preg_replace. No frills, no lace.
- replaceWithCallback($callback, $subject, [$limit]) is preg_replace_callback. No frills, no lace.
The final method is split($subject, [$allowempty], [$limit]) which, like most of the other methods, is a simple wrapper around the similarly named preg_split function.
As I mentioned earlier, you can run as many of the match methods on your regular expression as you like. You can even build a regular expression, match it, and then continue to build it and matching it:
$r = new BunnyRegex()
$r->bol()
->whitespace()->noneOrMore()
->string('*@');
if ($r->match($myString))
{
$r->string('param');
if ($r->match($myString))
{
....
}
}
I am not sure just how useful this is to people, but it shows just how flexable BunnyRegex is.
At any rate, I hope that this class will be useful to the PHP community, especially those who are tired of programming in Elder Futhark
You might consider adding two optional arguments to each construct to replace the chained repetition. For one thing, this prevents the broken output of "\d{4}{4}", and it also stays consistent and short.
Instead of
$pattern->bol()
->digit()->moreThan(3) // future proofing
->string('/')
->digit()->exactly(2)
->string('/')
->digit()->exactly(2)
->eol();
maybe something like
$pattern->bol()
->digit(4, '+')
->string('/')
->digit(2)
->string('/')
->digit(2)
->eol();
I would suggest keyword args, but I don't think PHP has them. Another syntactic issue is the grouping. What if someone starts the group, but forgets to end it?
$pattern->startGroup()
->upper()
->lower(1, '+');
I'm not quite so sure how to fix this in PHP. I recently fought with a similar thing while writing a query language, but the solution was Lisp (in which its easy to do) and to not allow access to the grouping from the wrappers in other languages. That probably won't fly in a regex syntax. So, you may very well have it as good as it gets.
It was certainly on my mind. In fact, if you do a checkout of the SVN, you can find a class that is half written called RabbitRegex. RabbitRegex isn't a subclass of BunnyRegex, but a class that sits on top of BunnyRegex. It actually is a fine example of AOP in PHP. (Something that I should really expand upon in a new entry, but I digress). At any rate, it is (or rather, will be) a fairly simplistic parser, keeping track of its state inside of a stack and a 'register' containing the previous sigil.
I abandoned it for now because I wasn't 100% positive of its usefulness. I figure BunnyRegex is about 80% complete. To syntax checking seems like a 20% gain for another 80% of effort.
But it sure is interesting to think about! :)
I imagine that CL-PPCRE's structure will make it easy to just add another syntactic transform that takes different strings and translates them to the same s-expr. It's a really clean division of syntax and behavior. I don't know how PCRE or any non-PCRE libraries are implemented, but I'm guessing there will be a bit more work involved in making them do Perl6 regex.
dateRE->pattern = "__/__/__"
dateRE->length = 8
dateRE->chars = "00/00/00"
emailRE->pattern = "__@__.__"
emailRE->chars = "AA@AA.AA"
This would only work for very simple expressions, but I think this is even far more consise and readable then the (evidently more powerful) class.
does dateRE match "0000////" or "000/0/00"? Each subject has 8 characters in it, but the slashes are in different spots.
What about when you need to use the _ in part of your expression? do you need to escape it? Is it possible?
The second example wouldn't be too hard to build in BunnyRegex, (or straight up PCRE for that matter).
Great implementation though, It's very readable when you put it all together!
Great idea, and though I admire your idea enormously - I just have a hard time imagining I can teach noobs on forums about what an object is rather than explain what a regex is...
Nevertheless, I think its cool and can think of some places where I can compose and aggregate this class to fit user needs.
I am really interested in seeing your thoughts on AOP in PHP.
Please keep sharing like this...
My rantings on AOP in PHP are finished and are available:
http://blog.jonnay.net/archives/637-Aspect-Oriented-Programming-in-PHP-as-a-contrast-to-other-languages..html
You're going to have to cut and paste. Stupid spammers are making blog comment sections and trackbacks less useful every day. >:P
What version of PHP is required for use?





When I coded BunnyRegex, I just coded it with no thought towards those who use PHP4. I actually don't feel bad about this, there are really so many good reasons to upgrade to PHP5. At any rate, I did decide that making a PHP4 version wouldn't be to diff
Tracked: Mar 05, 01:04
I've long been convinced that coincidence is never an accident. On a day to day basis, the questions that come through ##php will gather in bunches to the point where I can often just up-key to find the last time I answered the question, and state it a
Tracked: Apr 10, 13:45