Search

9/24/2010

Performance of Greedy vs. Lazy Regex Quantifiers

Regular Expressions Cookbook

Problem
Match a pair of <p> and </p> XHTML tags and the text between them. The text between
the tags can include other XHTML tags
Solution
<p>.*?</p>
Take a look at one incorrect solution for the problem in this recipe:
<p>.*</p>
After matching the first <p> tag in the subject, the engine reaches ‹.*›. The dot matches
any character, including line breaks. The asterisk repeats it zero or more times. The
asterisk is greedy, and so ‹.*› matches everything all the way to the end of the subject
text. Let me say that again: ‹.*› eats up your whole XHTML file, starting with the first
paragraph.
When the ‹.*› has its belly full, the engine attempts to match the ‹<› at the end of the
subject text. That fails. But it’s not the end of the story: the regex engine backtracks.
The asterisk prefers to grab as much text as possible, but it’s also perfectly satisfied to
match nothing at all (zero repetitions). With each repetition of a quantifier beyond the
quantifier’s minimum, the regular expression stores a backtracking position. Those are
positions the engine can go back to, in case the part of the regex following the quantifier
fails.
When ‹<› fails, the engine backtracks by making the ‹.*› give up one character of its
match. Then ‹<› is attempted again, at the last character in the file. If it fails again, the
engine backtracks once more, attempting ‹<› at the second-to-last character in the file.
This process continues until ‹<› succeeds. If ‹<› never succeeds, the ‹.*› eventually runs
out of backtracking positions and the overall match attempt fails.
If ‹<› does match at some point during all that backtracking, ‹/› is attempted. If ‹/› fails,
the engine backtracks again. This repeats until ‹</p>› can be matched entirely.
So what’s the problem? Because the asterisk is greedy, the incorrect regular expression
matches everything from the first <p> in the XHTML file to the last </p>. But to correctly
match an XHTML paragraph, we need to match the first <p> with the first </p> that
follows it.
That’s where lazy quantifiers come in. You can make any quantifier lazy by placing a
question mark after it: ‹*?›, ‹+?›, ‹??›, and ‹{7,42}?› are all lazy quantifiers.

Lazy quantifiers backtrack too, but the other way around. A lazy quantifier repeats as
few times as it has to, stores one backtracking position, and allows the regex to continue.
If the remainder of the regex fails and the engine backtracks, the lazy quantifier
repeats once more. If the regex keeps backtracking, the quantifier will expand until its
maximum number of repetitions, or until the regex token it repeats fails to match.


Performance of Greedy vs. Lazy Regex Quantifiers
Consider the following simple example: When the regexes <.*?> and <[^>]*> are applied to the subject string "<0123456789>", they are functionally equivalent. The only difference is how the regex engine goes about generating the match. However, the latter regex (which uses a greedy quantifier) is more efficient, because it describes what the user really means: match the character "<", followed by any number of characters which are not ">", and finally, match the character ">". Defined this way, it requires no backtracking in the case of any successful match, and only one backtracking step in the case of any unsuccessful match. Hand-optimization of regex patterns largely revolves around the ideas of reducing backtracking and the steps which are potentially required to match or fail at any given character position, and here we've reduced both cases to the absolute minimum.


jQuery Regular Expressions Review (Rev:20100921_1700)
/color|date|datetime|email|hidden|month|number|
password|range|search|tel|text|time|url|week/i
/\b(?:color|date|datetime|email|hidden|month|number|
password|range|search|tel|text|time|url|week)/i
In a manner similar to the previous example, adding (or "exposing") a word boundary anchor to the beginning of this regex improves the efficiency in the case of non-matches, by reducing the number of positions within the string where the regex engine attempts a match. Instead of attempting a match at every location within a target string, the engine now only needs to check on word boundaries.

沒有留言: