Chapter 5: Patterns

The patterns in the input (see Rules Section 4.2) are written using an extended set of regular expressions. These are:

x
match the character `x'

.
any character (byte) except newline

[xyz]
a character class; in this case, the pattern matches either an `x', a `y', or a `z'

[abj-oZ]
a ``character class'' with a range in it; matches an `a', a `b', any letter from `j' through `o', or a `Z'

[^A-Z]
a ``negated character class'', i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.

[^A-Z\n]
any character EXCEPT an uppercase letter or a newline

"[xyz]\"foo"
the literal string: `[xyz]"foo'

\X
if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C interpretation of `\x'. Otherwise, a literal `X' (used to escape operators such as `*')

\0
a NUL character (ASCII code 0)

\123
the character with octal value 123

\x2a
the character with hexadecimal value 2a

(r)
match an `r'; parentheses are used to override precedence (see below)

{name}
the expansion of the `name' definition (see format of the input file 4).

r*
zero or more r's, where r is any regular expression. Note that this will also match the empty string!

r+
one or more r's

r?
zero or one r's (that is, ``an optional r'') (This will also match the empty string!).

rs
the regular expression `r' followed by the regular expression `s'; called concatenation

r{m, n}
where 1 <= m <= n: match `r' at least m, but at most n times; called interval expression.

r{m,}
where 1 <= m: match `r' m or more times.

r{m}
where 1 <= m: match `r' exactly m times.

r|s
either an `r' or an `s'

r/s
an `r' but only if it is followed by an `s'. The text matched by `s' is included when determining whether this rule is the longest match, but is then returned to the input before the action is executed. So the action only sees the text matched by `r'. This type of pattern is called trailing context.

^r
an `r', but only at the beginning of a line (i.e., when just starting to scan, or right after a newline has been scanned).

r$
an `r', but only at the end of a line (i.e., just before a newline). Equivalent to `r/\n'.

<s>r
an `r', but only in start condition s (see Start Conditions 7 for discussion of start conditions).

<s1,s2,s3>r
same, but in any of start conditions s1, s2, or s3.

<<EOF>>
an end-of-file.

<s1,s2><<EOF>>
an end-of-file when in start condition s1 or s2

Note that inside of a character class, all regular expression operators lose their special meaning except escape (\) and the character class operators, -, ]], and, at the beginning of the class, ^.

The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence. For example,

foo|bar*

is the same as

(foo)|(ba(r*))

since the * operator has higher precedence than concatenation, and concatenation higher than alternation (|). This pattern therefore matches either the string `foo' or the string `ba' followed by zero-or-more `r's. To match `foo' or zero-or-more repetitions of the string `bar', use:

foo|(bar)*

And to match a sequence of zero or more repetitions of `foo' and `bar':

(foo|bar)*

Note that concatenation has a higher precedence than the interval expression. This is different from many other regular expression engines. It conforms, however, to the lex standard.

Also, note that the name expansion has about the same precedence as grouping (using parentheses to influence the precedence of the other operators in the regular expression). Since the name expansion is treated as a group in flexc++, it is not allowed to use the lookahead operator in a name definition (a named pattern, defined in the definition section). That is because only one lookahead operator is allowed in a regular expression. In flex, it was possible to use the lookahead operator and the `^' operator (the begin anchor) in a name definition, so pay attention to the difference.

In addition to characters and ranges of characters, character classes can also contain character class expressions. These are expressions enclosed inside [: and :] delimiters (which themselves must appear between the [ and ] of the character class. Other elements may occur inside the character class, too). The valid expressions are:

     
         [:alnum:] [:alpha:] [:blank:]
         [:cntrl:] [:digit:] [:graph:]
         [:lower:] [:print:] [:punct:]
         [:space:] [:upper:] [:xdigit:]

These expressions all designate a set of characters equivalent to the corresponding standard C isXXX function. For example, [:alnum:] designates those characters for which isalnum() returns true - i.e., any alphabetic or numeric character.

For example, the following character classes are all equivalent:

 
         [[:alnum:]]
         [[:alpha:][:digit:]]
         [[:alpha:][0-9]]
         [a-zA-Z0-9]

A negated character class such as the example [^A-Z] above will match a newline unless \n (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., [^A-Z\n]). This is unlike how many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like [^"]* can match the entire input unless there's another quote in the input.

Flexc++ allows negation of character class expressions by prepending ^ to the POSIX character class name.

                
         [:^alnum:] [:^alpha:] [:^blank:]
         [:^cntrl:] [:^digit:] [:^graph:]
         [:^lower:] [:^print:] [:^punct:]
         [:^space:] [:^upper:] [:^xdigit:]

The '{-}' operator computes the difference of two character classes. For example, '[a-c]{-}[b-z]' represents all the characters in the class '[a-c]' that are not in the class '[b-z]' (which in this case, is just the single character 'a'). The '{-}' operator is left associative, so '[abc]{-}[b]{-}[c]' is the same as '[a]'. Be careful not to accidentally create an empty set, which will never match.

The '{+}' operator computes the union of two character classes. For example, '[a-z]{+}[0-9]' is the same as '[a-z0-9]'. This operator is useful when preceded by the result of a difference operation, as in, '[[:alpha:]]{-}[[:lower:]]{+}[q]', which is equivalent to '[A-Zq]' in the "C" locale.

A rule can have at most one instance of trailing context (the / operator or the $ operator). The start condition, ^, and <<EOF>> patterns can only occur at the beginning of a pattern, and, as well as with / and $, cannot be grouped inside parentheses. A ^ which does not occur at the beginning of a rule or a $ which does not occur at the end of a rule loses its special properties and is treated as a normal character.

The following are invalid:

                
         foo/bar$
         <sc1>foo<sc2>bar

Note that the first of these can be written `foo/bar\n'.

The following will result in $ or ^ being treated as a normal character:

 
         foo|(bar$)
         foo|^bar

If the desired meaning is a `foo' or a `bar'-followed-by-a-newline, the following could be used (the special | action is explained below, see Actions):

                
         foo      |
         bar$     /* action goes here */

A similar trick will work for matching a `foo' or a `bar'-at-the-beginning-of-a-line.