lex, section 3.

3. Lex Regular Expressions.

The definitions of regular expressions are very similar to those in QED [5]. A regular expression specifies a set of strings to be matched. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions, choices, and other features). The letters of the alphabet and the digits are always text characters; thus the regular expression
center;
l l.
integer
matches the string integer wherever it appears and the expression
center;
l.
a57D
looks for the string a57D.

Operators. The operator characters are
center;
l.
" \ [ ] ^ - ? . * + | ( ) $ / { } % < >
and if they are to be used as text characters, an escape should be used. The quotation mark operator (") indicates that whatever is contained between a pair of quotes is to be taken as text characters. Thus
center;
l.
xyz"++"
matches the string xyz++ when it appears. Note that a part of a string may be quoted. It is harmless but unnecessary to quote an ordinary text character; the expression
center;
l.
"xyz++"
is the same as the one above. Thus by quoting every non-alphanumeric character being used as a text character, the user can avoid remembering the list above of current operator characters, and is safe should further extensions to Lex lengthen the list.

An operator character may also be turned into a text character by preceding it with \ as in
center;
l.
xyz\+\+
which is another, less readable, equivalent of the above expressions. Another use of the quoting mechanism is to get a blank into an expression; normally, as explained above, blanks or tabs end a rule. Any blank character not contained within [] (see below) must be quoted. Several normal C escapes with \ are recognized: \n is newline, \t is tab, and \b is backspace. To enter \ itself, use \\. Since newline is illegal in an expression, \n must be used; it is not required to escape tab and backspace. Every character but blank, tab, newline and the list above is always a text character.

Character classes. Classes of characters can be specified using the operator pair []. The construction [abc] matches a single character, which may be a, b, or c. Within square brackets, most operator meanings are ignored. Only three characters are special: these are \ - and ^. The - character indicates ranges. For example,
center;
l.
[a-z0-9<>_]
indicates the character class containing all the lower case letters, the digits, the angle brackets, and underline. Ranges may be given in either order. Using - between any pair of characters which are not both upper case letters, both lower case letters, or both digits is implementation dependent and will get a warning message. (E.g., [0-z] in ASCII is many more characters than it is in EBCDIC). If it is desired to include the character - in a character class, it should be first or last; thus
center;
l.
[-+0-9]
matches all the digits and the two signs.

In character classes, the ^ operator must appear as the first character after the left bracket; it indicates that the resulting string is to be complemented with respect to the computer character set. Thus
center;
l.
[^abc]
matches all characters except a, b, or c, including all special or control characters; or
center;
l.
[^a-zA-Z]
is any character which is not a letter. The \ character provides the usual escapes within character class brackets.

Arbitrary character. To match almost any character, the operator character
center;
l.
.
is the class of all characters except newline. Escaping into octal is possible although non-portable:
center;
l.
[\40-\176]
matches all printable characters in the ASCII character set, from octal 40 (blank) to octal 176 (tilde).

Optional expressions. The operator ? indicates an optional element of an expression. Thus
center;
l.
ab?c
matches either ac or abc.

Repeated expressions. Repetitions of classes are indicated by the operators * and +.
center;
l.
a*
is any number of consecutive a characters, including zero; while
center;
l.
a+
is one or more instances of a. For example,
center;
l.
[a-z]+
is all strings of lower case letters. And
center;
l.
[A-Za-z][A-Za-z0-9]*
indicates all alphanumeric strings with a leading alphabetic character. This is a typical expression for recognizing identifiers in computer languages.

Alternation and Grouping. The operator | indicates alternation:
center;
l.
(ab|cd)
matches either ab or cd. Note that parentheses are used for grouping, although they are not necessary on the outside level;
center;
l.
ab|cd
would have sufficed. Parentheses can be used for more complex expressions:
center;
l.
(ab|cd+)?(ef)*
matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.

Context sensitivity. Lex will recognize a small amount of surrounding context. The two simplest operators for this are ^ and $. If the first character of an expression is ^, the expression will only be matched at the beginning of a line (after a newline character, or at the beginning of the input stream). This can never conflict with the other meaning of ^, complementation of character classes, since that only applies within the [] operators. If the very last character is $, the expression will only be matched at the end of a line (when immediately followed by newline). The latter operator is a special case of the / operator character, which indicates trailing context. The expression
center;
l.
ab/cd
matches the string ab, but only if followed by cd. Thus
center;
l.
ab$
is the same as
center;
l.
ab/\n
Left context is handled in Lex by start conditions as explained in section 10. If a rule is only to be executed when the Lex automaton interpreter is in start condition x, the rule should be prefixed by
center;
l.
<x>
using the angle bracket operator characters. If we considered ``being at the beginning of a line'' to be start condition ONE, then the ^ operator would be equivalent to
center;
l.
<ONE>
Start conditions are explained more fully later.

Repetitions and Definitions. The operators {} specify either repetitions (if they enclose numbers) or definition expansion (if they enclose a name). For example
center;
l.
{digit}
looks for a predefined string named digit and inserts it at that point in the expression. The definitions are given in the first part of the Lex input, before the rules. In contrast,
center;
l.
a{1,5}
looks for 1 to 5 occurrences of a.

Finally, initial % is special, being the separator for Lex source segments.