The patterns in the input (see Rules Section 4.2) are written using an extended set of regular expressions. These are:
x.[xyz][abj-oZ][^A-Z][^A-Z\n]"[xyz]\"foo"\X\0\123\x2a
(r){name}
r*r+r?
rs
r{m, n}1 <= m <= n: match `r' at least m, but at most n times; called
interval expression.
r{m,}1 <= m: match `r' m or more times.
r{m}1 <= m: match `r' exactly m times.
r|s
r/s^rr$
<s>r<s1,s2,s3>r
<<EOF>><s1,s2><<EOF>>
Note that inside of a character class, all regular expression operators lose
their special meaning except escape (\) and the character class operators,
-, ]], and, at the beginning of the class, ^.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence. For example,
foo|bar*
is the same as
(foo)|(ba(r*))
since the * operator has higher precedence than concatenation, and
concatenation higher than alternation (|). This pattern therefore matches
either the string `foo' or the string `ba' followed by zero-or-more `r's. To
match `foo' or zero-or-more repetitions of the string `bar', use:
foo|(bar)*
And to match a sequence of zero or more repetitions of `foo' and `bar':
(foo|bar)*
Note that concatenation has a higher precedence than the interval expression. This is different from many other regular expression engines. It conforms, however, to the lex standard.
Also, note that the name expansion has about the same precedence as grouping (using parentheses to influence the precedence of the other operators in the regular expression). Since the name expansion is treated as a group in flexc++, it is not allowed to use the lookahead operator in a name definition (a named pattern, defined in the definition section). That is because only one lookahead operator is allowed in a regular expression. In flex, it was possible to use the lookahead operator and the `^' operator (the begin anchor) in a name definition, so pay attention to the difference.
In addition to characters and ranges of characters, character classes can also
contain character class expressions. These are expressions enclosed inside
[: and :] delimiters (which themselves must appear between the [
and ] of the character class. Other elements may occur inside the
character class, too). The valid expressions are:
[:alnum:] [:alpha:] [:blank:]
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
These expressions all designate a set of characters equivalent to the
corresponding standard C isXXX function. For example, [:alnum:] designates
those characters for which isalnum() returns true - i.e., any alphabetic
or numeric character.
For example, the following character classes are all equivalent:
[[:alnum:]]
[[:alpha:][:digit:]]
[[:alpha:][0-9]]
[a-zA-Z0-9]
A negated character class such as the example [^A-Z] above will match a
newline unless \n (or an equivalent escape sequence) is one of the
characters explicitly present in the negated character class (e.g.,
[^A-Z\n]). This is unlike how many other regular expression tools treat
negated character classes, but unfortunately the inconsistency is historically
entrenched. Matching newlines means that a pattern like [^"]* can match
the entire input unless there's another quote in the input.
Flexc++ allows negation of character class expressions by prepending ^ to
the POSIX character class name.
[:^alnum:] [:^alpha:] [:^blank:]
[:^cntrl:] [:^digit:] [:^graph:]
[:^lower:] [:^print:] [:^punct:]
[:^space:] [:^upper:] [:^xdigit:]
The '{-}' operator computes the difference of two character classes. For example, '[a-c]{-}[b-z]' represents all the characters in the class '[a-c]' that are not in the class '[b-z]' (which in this case, is just the single character 'a'). The '{-}' operator is left associative, so '[abc]{-}[b]{-}[c]' is the same as '[a]'. Be careful not to accidentally create an empty set, which will never match.
The '{+}' operator computes the union of two character classes. For example, '[a-z]{+}[0-9]' is the same as '[a-z0-9]'. This operator is useful when preceded by the result of a difference operation, as in, '[[:alpha:]]{-}[[:lower:]]{+}[q]', which is equivalent to '[A-Zq]' in the "C" locale.
A rule can have at most one instance of trailing context (the / operator
or the $ operator). The start condition, ^, and <<EOF>>
patterns can only occur at the beginning of a pattern, and, as well as with
/ and $, cannot be grouped inside parentheses. A ^ which does not
occur at the beginning of a rule or a $ which does not occur at the end of
a rule loses its special properties and is treated as a normal character.
The following are invalid:
foo/bar$
<sc1>foo<sc2>bar
Note that the first of these can be written `foo/bar\n'.
The following will result in $ or ^ being treated as a normal
character:
foo|(bar$)
foo|^bar
If the desired meaning is a `foo' or a `bar'-followed-by-a-newline, the following could be used (the special | action is explained below, see Actions):
foo |
bar$ /* action goes here */
A similar trick will work for matching a `foo' or a `bar'-at-the-beginning-of-a-line.