Quick Look: Regular Expressions Cookbook

Regular expressions can be a great problem-solving tool, both when writing code and editing text. The theory is pretty straightforward -- find (and sometimes manipulate) items based on a pattern you define, such as: Look for 10 digits, possibly with separators of a space, dash or dot after the third and sixth digits, and maybe with parentheses around the first three numbers. That's one way to go through text to check for U.S. phone numbers.

However, translating that sort of pattern into proper regular expression syntax can be somewhat challenging if you don't work with them, well, regularly. For example, a regexp for that telephone number pattern above would look something like

\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})

Not the most intuitive and human-readable of code.  Since I enjoy the first part of using regexps for problem solving -- figuring out the pattern algorithm I need -- much more than the second part (actually writing the code), I was eager to take a look at O'Reilly's newly published Regular Expression Cookbook. Like other programming books in the Cookbook series, this entry offers a slew of use-as-is code snippets to solve real-world programming problems. There are dozens of "recipes" for tasks like validating an e-mail address, reformatting names, finding repeating words, and extracting the file name from a Windows path. There's also a basic tutorial on regular expressions, although I wouldn't consider this a primary reference for learning regexp theory from scratch.

However, for those with even a beginner's knowledge, the Regular Expressions Cookbook is a valuable reference. In addition to offering code and explanations for tasks, the book offers variations for Perl, PCRE (Perl-Compatible Regular Expressions), .NET, Java, JavaScript, Python and Ruby (both native Ruby 1.8 and Oniguruma regexps in Ruby 1.9).

I learned a great deal from skimming this volume, from basics such as when I don't need to use an escape character (but thought I did) to more advanced (for me) concepts such as lookaround (i.e. checking "whether certain text can be matched without actually matching it").

I tested the Cookbook with this problem: Find the first match of the word "Google" in some html-coded text where the word is not within a hyperlink.

 Finding

<a href="http://www.computerworld.com/s/article/9136345/Google_Update">Google</a>

is a fairly trivial task, but also screening out something like

<a href="http://www.computerworld.com/s/article/9025218/Google_turns_on_solar_panels_plans_10M_in_grants">the Google solar project</a>

is more complex. After working on the problem for an hour last night unaided, I turned to the Cookbook this morning and had the answer in 5 minutes, simply by modifying an existing lookaround recipe for finding words within XML-Style Comments.

Here's the recipe for finding all occurrences of the word TODO within a comment:

\bTODO\b(?=(?:(?!<!--).)*?-->)

The Cookbook explanation:

\bTODO\b matches TODO as a word and not part of another word (\b before and after signifies a word boundary)

(?= says the regexp that follows should match here

(?: instructs to "group but don't capture"

(?! <!-- ) asserts that <!-- should not match

. matches any single character

(*? says to repeat zero or more times, finding a match as soon as possible

--> matches those characters

) ends the expression

With that complex code already at my finger tips, I read earlier in the book that (?!regexp) is a "does not match" lookaround, as opposed to (?=regexp) which is a "does match." And so, here's my modification that finds occurrences of the word Google not within a hyperlink:

\bGoogle\b(?!(?:(?!<a.*?>).)*?a>)

That snippet alone made this book a worthwhile read, and I expect it will offer many more time-saving coding tips. Regular Expressions Cookbook is a useful addition to a programmer's reference shelf.

FREE Computerworld Insider Guide: Five IT certifications that won’t break you
Join the discussion
Be the first to comment on this article. Our Commenting Policies