Regular Expression Tutorial Part 5: Greedy and Non-Greedy Quantification

Last week, I described the standard quantifiers as greedy. This week,

we will look at non-greedy forms of quantifiers, but first let's

discuss just what it means to be greedy.

my $string = 'bcdabdcbabcd';

$string =~ m/^(.*)ab/;

print "$1\n"; # prints: bcdabdcb

The * is greedy; therefore, the .* portion of the regex will match as

much as it can and still allow the remainder of the regex to match. In

this case, it will match everything up to the last 'ab'. Actually,

the .* will match right to the end of the string, and then start

backing up until it can match an 'ab' (this is called backtracking).

To make the quantifier non-greedy you simply follow it with a '?'

symbol:

my $string = 'bcdabdcbabcd';

$string =~ m/^(.*?)ab/;

print "$1\n"; # prints: bcd

In this case the .*? portion attempts to match the least amount of data

while allowing the remainder of the regex to match. Here the regex

engine will match the beginning of the string, then it will try to

match zero of anything and check to see if the rest can match (that

fails). Next, it will match the 'b' and then check again if the 'ab'

can match (still fails). This continues until the the .*? has matched

the first 3 characters and then the following 'ab' is matched.

You can make any of the standard quantifiers that aren't exact non-

greedy by appending a '?' symbol to them: *?, +?, ??, {n,m}?, and {n,}?.

One thing to watch out for: given a pattern such as /^(.*?)%(.*?)/ one

could match and extract the first two fields of a like of % separated

data:

#!/usr/bin/perl -w

use strict;

$_ = 'Johnson%Andrew%AX321%37';

m/^(.*?)%(.*?)%/;

print "$2 $1\n";

And one can easily begin to think of each subexpression as

meaning 'match up to the next % symbol', but that isn't exactly what it

means. Let's say that the third field represents an ID tag and we want

to extract only those names of people with ID tags starting with 'A'.

We might be tempted to do this:

#!/usr/bin/perl -w

use strict;

while (

) { print "$2 $1\n" if m/^(.*?)%(.*?)%A/; } __DATA__ Johnson%Andrew%AX321%Manitoba Smith%John%BC142%Alberta This would print out: Andrew Johnson John%BC142 Smith But that isn't what we wanted at all -- what happened? Well, the second half of the regex does not say match up to the next % symbol and then match an 'A', it says, match up to the next % symbol that is followed by an 'A'. The pattern '(.*?)' part is not prevented from matching and proceeding past a % character if that is what is necessary to cause the whole regex to succeed. What we really wanted in this case was a negated character class: #!/usr/bin/perl -w use strict; while () { print "$2 $1\n" if m/^([^%]*)%([^%]*)%A/; } __DATA__ Johnson%Andrew%AX321%Manitoba Smith%John%BC142%Alberta Now we are saying exactly what we want: the first subexpression grabs zero or more of anything except a % character, then we match a % character, then the second subexpression also grabs zero or more of anything but a % character, and finally we match '%A' or we fail. To summarize, a greedy quantifier takes as much as it can get, and a non-greedy quantifier takes as little as possible (in both cases only while still allowing the entire regex to succeed). Take care in how you use non-greedy quantifiers -- it is easy to get fooled into using one where a negated character class is more appropriate. Next Week: Capturing, Grouping, and Backreferences

This story, "Regular Expression Tutorial Part 5: Greedy and Non-Greedy Quantification" was originally published by ITworld.

Copyright © 2001 IDG Communications, Inc.

It’s time to break the ChatGPT habit
Shop Tech Products at Amazon