NetPDL Regular Expressions
This document briefly presents Regular Expressions, which are supported by NetPDL. Most of the content of this document is elsewhere available on the Internet (e.g. Wikipedia, the RegularExpression.info tutorial (probably the best source), and the David Mertz Tutorial). An excellent manual can be found in the pcrepattern.html file, which is distributed as part of the PCRE documentation.
Basic regular expressions
The simplest regular expression consists of a single literal character, e.g.: a, or a string e.g. is. If the string is Jack is a boy, the first pattern will match the a after the J, while the second pattern will match the word is. The first regex can match the second a too. It will only do so when you tell the regex engine to start searching through the string after the first match, which depends on how the rexeg engine is invoked (for instance, some can be programmed in order to look at the nth occurrency instead of the first one).
Regex engines are usually case sensitive by default. cat does not match Cat, unless you tell the regex engine to ignore differences in case.
Special characters
There are several special characters, which can be easily presented by splitting them in several groups.
Quantifiers
| Character | Meaning | Example pattern | Example matches |
+ | The preceding item (i.e. character, string or character class) can appear 1 or N times. | go+gle | gogle, google, … (but not ggle) |
* | The preceding item (i.e. character, string or character class) can appear 0, 1, or N times. | go*gle | ggle, gogle, google, … |
? | The preceding item (i.e. character, string or character class) can appear 0 or 1 times. | colou?r | color, colour |
{ and } | The curly braces define the exact number of occurrencies for a pattern. Within the parenteses there are two numbers representing the minimum and maximum number of occurrencies. In case the first number is missing, it is supposed to be zero; in case the second one is missing, it is supposed to be infinity. If only one argument is used (with no comma in there), exactly that many occurrences are matched. | ab{2,4} | abb, abbb, abbbb |
ab{,4} | a, ab, abb, abbb, abbbb |
Operators
| Character | Meaning | Example pattern | Example matches |
| | Items preceding and following the vertical bar can be present in alternative. | gray|grey | gray, grey |
( and ) | Parentheses defines grouping, i.e. a new atom in the regular expression. The group behave like a simpler regular expression embedded within a larger one and this allows you to apply a regex operator, e.g. a repetition operator, to the entire group. Parenteses can also be used to enable backreference, which is presented later. | (abc)+ | abc, abcabc, … |
Other characters
| Character | Meaning | Example pattern | Example matches |
. | Wildcard character: it matches any character except newline. | .s | as, is, … |
^ | Position special character: matches if the pattern following the ^ sign starts at the beginning of the line (or the beginning of the string). | ^Mary | ''(newline)''Mary |
$ | Position special character: matches if the pattern preceding the $ sign terminates at the end the line (or the end of the string). | Mary$ | Mary''(endofline)'' |
[ and ] | It defines the starting / ending point of a character class. More details about character classes will be presented later. | [abc] | a, b, c |
\ | It defines the escape character and it is used to insert a special character (e.g. +, |, $, etc) within the regular expression, or an arbitrary ASCII character (e.g. the ASCII code 0x44). More details about using special characters and non-printable characters will be presented later. | \\ \x44 | \ (the ASCII code 0x44) |
\n ''(n=1-9)'' | It matches the nth subexpression matched. This escape character can be used in case of backreference, which is presented later. |
Using special characters in regex
If you want to use any of the special characters (e.g. +, |, etc) as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.
Note that 1+1=2, with the backslash omitted, is a valid regex. So you will not get an error message. But it will not match 1+1=2. It would match 111=2 in 123+111=234, due to the special meaning of the plus character.
If you forget to escape a special character where its use is not allowed, such as in +1, then you will get an error message.
All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. E.g. \d will match a single digit from 0 to 9. Please refer to the Character class section for more details.
Non-Printable Characters
You can use special character sequences to put non-printable characters in your regular expression. The most common characters are the following:
| Escape | Meaning |
\t | Tab (ASCII 0x09) |
\r | Carriage return (0x0D) |
\n | Line feed (0x0A) |
\a | Bell (0x07) |
\e | Escape (0x1B) |
\f | Form feed (0x0C) |
\v | Vertical tab (0x0B) |
In addition, you can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9. Another way to search for a tab is to use \x09. Note that the leading zero is required.
If your regular expression engine supports Unicode, use \uFFFF rather than \xFF to insert a Unicode character. The euro currency sign occupies code point 0x20A0. If you cannot type it on your keyboard, you can insert it into a regular expression with \u20A0.
Character class
Rather than name only a single character, you can include a pattern in a regular expression that matches any of a set of characters. A set of characters can be given as a simple list inside square brackets; for example, [aeiou] will match any single lowercase vowel. For letter or number ranges you may also use only the first and last letter of a range, with a dash in the middle; for example, [A-Ma-m] will match any lowercase or uppercase in the first half of the alphabet.
Within a character class, normal operators lose their meaning. The only two operators allowed in a character class are the hyphen (-) and circumflex (^). When used between two characters, the hyphen represents a range of characters. The circumflex, when used as the first character, negates the expression. If two patterns match the same string, the longest match wins. In case both matches are the same length, then the first pattern listed is used.
Other special characters allowed in a character class are the [] brackets (for further character classes or POSIX classes), and the \ as the general escape character.
Since many ranges of characters depends on the chosen locale setting (e.g., in some settings letters are organized as abc..yzABC..YZ while in some others as aAbBcC..yYzZ) the POSIX standard defines some classes or categories of characters as shown in the following table:
| POSIX class | Similar to | Meaning |
[:upper:] | [A-Z] | uppercase letters |
[:lower:] | [a-z] | lowercase letters |
[:alpha:] | [A-Za-z] | upper- and lowercase letters |
[:alnum:] | [A-Za-z0-9] | digits, upper- and lowercase letters |
[:digit:] | [0-9] | digits |
[:xdigit:] | [0-9A-Fa-f] | hexadecimal digits |
[:punct:] | [.,!?:…] | punctuation |
[:blank:] | [ \t] | space and TAB |
[:space:] | [ \t\n\r\f\v] | blank characters |
[:cntrl:] | control characters | |
[:graph:] | [^ \t\n\r\f\v] | printed characters |
[:print:] | [^\t\n\r\f\v] | printed characters and space |
In addition, PCRE defines some escape characters that can be used forcharacter classes (some of them equivalent to POSIX classes):
| PCRE class | Similar to | Meaning |
\d | [0-9] | Any decimal digit |
\D | [^0-9] | Any character that is not a decimal digit |
\s | [ \t\n\r\f\v] | Any whitespace character |
\S | [^\t\n\r\f\v] | Any character that is not a whitespace character |
\w | [A-Za-z0-9] | Any “word” character |
\W | [^A-Za-z0-9] | Any “non-word” character |
For example, [[:upper:]ab] should only match the uppercase letters and lowercase 'a' and 'b'.
It is generally agreed that [:print:] consists of [:graph:] plus the space character. However, in PERL regular expressions [:print:] matches [:graph:] union [:space:].
Page http://billposer.org/Linguistics/Computation/ascii.html includes an ASCII chart color-coded to show the POSIX classes.
Warning: please note that POSIX classes are supported only within a character class. Therefore, [:print:] is not a valid pattern, and [[:print:]] must be used instead.
Pattern matching examples
| Expression | Matches |
abc | abc |
abc* | ab, abc, abcc, abccc, … |
abc+ | abc, abcc, abccc, … |
a(bc)+ | abc, abcbc, abcbcbc, … |
a(bc)? | a, abc |
[abc] | a, b, c |
[a-z] | Any letter, a through z |
[a\-z] | a, -, z |
[-az] | -, a, z |
[A-Za-z0-9]+ | One or more alphanumeric characters |
[ \t\n]+ | Whitespace |
[^ab] | anything except: a, b |
[a^b] | a, ^, b |
[a|b] | a, |, b |
a|b | a or b |
[[:print:]] | Any printable string |
[01[:alpha:]%] | 0, 1, % or any alphabeticcharacter |
Advanced Regular Expressions
Backreference
Backreference is enabled by using round parentheses. A backreference stores the part of the string matched by the part of the regular expression inside the parentheses. Remembering part of the regex match in a backreference slows down the regex engine because it has more work to do. If you do not use the backreference, you can speed things up by using non-capturing parentheses, at the expense of making your regular expression slightly harder to read.
The regex Set(Valu.)? matches Set or SetValue. In the first case, the first backreference will be empty, because it did not match anything. In the second case, the first backreference will contain Value., since the regex contained in the parentheses was able to match a string.
For instance, the HTTP header can contain the following line:
host: www.foo.com
In case the user wants to print only the host name, it can define the following regular expression:
: ([[:print:]]*)
However, it must configure the regex engine that only the first backreference has to be printed, because the complete matching will be : www.foo.com, while the backreference (i.e. the string matched by the parenthesis alone) will be www.foo.com.
If you do not use the backreference, you can optimize this regular expression into Set(?:Valu.)?. The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference. Particularly, the ? defines that some special processing is required for the atom, and the colon indicates that the change we want to make is to turn off capturing the backreference. Other types of special processing will be related to Assertions.
Using backreferences inside the regular expression
Backreferences can not only be used after a match has been found, but also during the match.
For example, pattern
(sens|respons)e and \1ibility
matches ”'''sens'''e and '''sens'''ibility” and ”'''respons'''e and '''respons'''ibility”, but not ”'''sens'''e and responsibility”.
A more complicated example is the following. Suppose you want to match a pair of opening and closing HTML tags, and the text in between. By putting the opening tag into a backreference, we can reuse the name of the tag for the closing tag. Here's how: <([A-Z][A-Z0-9]*)[^>]*>.*?</\1>. This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
You can reuse the same backreference more than once. ([a-c])x\1x\1 will match axaxa, bxbxb and cxcxc. If a backreference was not used in a particular match attempt (such as in the first example where the question mark made the first backreference optional), it is simply empty. Using an empty backreference in the regex is perfectly fine. It will simply be replaced with nothingness.
A backreference cannot be used inside itself. ([abc]\1) will not work. Depending on your regex flavor, it will either give an error message, or it will fail to match anything without an error message. Therefore, \0 cannot be used inside a regex, only in the replacement.
Greedy expressions
Quantifiers in regular expressions are greedy. That is, they match as much as they possibly can. Probably the easiest mistake to make in composing regular expressions is to match too much. When you use a quantifier, you want it to match everything (of the right sort) up to the point where you want to finish your match. But when using the “*”, ”+”, or numeric quantifiers, it is easy to forget that the last bit you are looking for might occur later in a line than the one you are interested in.
If you find that your regular expressions are matching too much, a useful procedure is to reformulate the problem in your mind. Rather than thinking ”what am I trying to match later in the expression?” ask yourself ”what do I need to avoid matching in the next part?”. Often this leads to more parsimonious pattern matches. Often the way to avoid a pattern is to use the complement operator and a character class. The following example shows this approach:
| Pattern | Description | Test String | Matched String |
th.*s | Matches the longest string that starts with 'th' and ends with 's'. | Match the words that starts with 'th' and ends with 's'. | the words that starts with 'th' and ends with 's |
th[^s]*. | Matches the first string that starts with 'th' and ends with 's'. Please note that the last '.' character is used to match the final 's'. | Match the words that starts with 'th' and ends with 's'. | the words |
A second solution is to put a question mark after the quantifier, which transforms the quantifier in a non-greedy one:
| Pattern | Description | Test String | Matched String |
th.*s | Matches the longest string that starts with 'th' and ends with 's'. | Match the words that starts with 'th' and ends with 's'. | the words that starts with 'th' and ends with 's |
th.*?s | Matches the first string that starts with 'th' and ends with 's'. | Match the words that starts with 'th' and ends with 's'. | the words |
Assertions
An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. Some simple assertions are the following:
| Assertion | Description |
\b | Matches at a word boundary |
\B | Matches when not at a word boundary |
\A | Matches at start of subject |
\Z | Matches at end of subject or before newline at end |
\z | Matches at end of subject |
\G | Matches at first matching position in subject |
Another trick of advanced regular expression tools is lookahead / lookbehind assertions. These are similar to regular grouped subexpression, except they do not actually grab what they match.
There are two advantages to using lookahead assertions. On the one hand, a lookahead assertion can function in a similar way to a group that is not backreferenced; that is, you can match something without counting it in backreferences. More significantly, however, a lookahead assertion can specify that the next chunk of a pattern has a certain form, but let a different subexpression actually grab it (usually for purposes of backreferencing that other subexpression).
There are two kinds of assertions: positive and negative. As you would expect, a positive assertion specifies that something does come next, and a negative one specifies that something does not come next.
Lookahead assertions
Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example,
\w+(?=;)
matches a word followed by a semicolon, but does not include the semicolon in the match, and
foo(?!bar)
matches any occurrence of ”foo” that is not followed by ”bar”. Note that the apparently similar pattern
(?!foo)bar
does not find an occurrence of ”bar” that is preceded by something other than ”foo”; it finds any occurrence of ”bar” whatsoever, because the assertion (?!foo) is always true when the next three characters are ”bar”. A lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the most convenient way to do it is with (?!) because an empty string always matches, so an assertion that requires there not to be an empty string must always fail.
Lookbehind assertions
Lookbehind assertions start with (?⇐ for positive assertions and (?<! for negative assertions. For example,
(?<!foo)bar
does find an occurrence of ”bar” that is not preceded by ”foo”. The contents of a lookbehind assertion are usually restricted such that all the strings it matches must have a fixed length; some exception (and some more tricks) are available but are not reported in this document.
The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back by the fixed width and then try to match. If there are insufficient characters before the current position, the match is deemed to fail.
Atomic groups can be used in conjunction with lookbehind assertions to specify efficient matching at the end of the subject string. Consider a simple pattern such as
abcd$
when applied to a long string that does not match. Because matching proceeds from left to right, PCRE will look for each “a” in the subject and then see if what follows matches the rest of the pattern. If the pattern is specified as
^.*abcd$
the initial .* matches the entire string at first, but when this fails (because there is no following ”a”), it backtracks to match all but the last character, then all but the last two characters, and so on. Once again the search for ”a” covers the entire string, from right to left, so we are no better off. However, if the pattern is written as
^(?>.*)(?<=abcd)
or, equivalently, using the possessive quantifier syntax,
^.*+(?<=abcd)
there can be no backtracking for the .* item; it can match only the entire string. The subsequent lookbehind assertion does a single test on the last four characters. If it fails, the match fails immediately. For long strings, this approach makes a significant difference to the processing time.
Pattern matching examples
| Expression | Test String | Matched String |
: ([[:print:]]*) | Host: www.foo.com\n | : www.foo.com (in case the backreference is used: www.foo.com) |
(?⇐: ) [[:print:]]* | Host: www.foo.com\n | www.foo.com |