The definitions of regular expressions are very similar to those
in QED [5].
A regular
expression specifies a set of strings to be matched.
It contains text characters (which match the corresponding
characters in the strings being compared)
and operator characters (which specify
repetitions, choices, and other features).
The letters of the alphabet and the digits are
always text characters; thus the regular expression
center;
l l.
integer
matches the string
integer
wherever it appears
and the expression
center;
l.
a57D
looks for the string
a57D.
Operators.
The operator characters are
center;
l.
" \ [ ] ^ - ? . * + | ( ) $ / { } % < >
and if they are to be used as text characters, an escape
should be used.
The quotation mark operator (")
indicates that whatever is contained between a pair of quotes
is to be taken as text characters.
Thus
center;
l.
xyz"++"
matches the string
xyz++
when it appears. Note that a part of a string may be quoted.
It is harmless but unnecessary to quote an ordinary
text character; the expression
center;
l.
"xyz++"
is the same as the one above.
Thus by quoting every non-alphanumeric character
being used as a text character, the user can avoid remembering
the list above of current
operator characters, and is safe should further extensions to Lex
lengthen the list.
An operator character may also be turned into a text character
by preceding it with \ as in
center;
l.
xyz\+\+
which
is another, less readable, equivalent of the above expressions.
Another use of the quoting mechanism is to get a blank into
an expression; normally, as explained above, blanks or tabs end
a rule.
Any blank character not contained within [] (see below) must
be quoted.
Several normal C escapes with \
are recognized: \n is newline, \t is tab, and \b is backspace.
To enter \ itself, use \\.
Since newline is illegal in an expression, \n must be used;
it is not
required to escape tab and backspace.
Every character but blank, tab, newline and the list above is always
a text character.
Character classes.
Classes of characters can be specified using the operator pair [].
The construction
[abc]
matches a
single character, which may be
a,
b,
or
c.
Within square brackets,
most operator meanings are ignored.
Only three characters are special:
these are \ - and ^. The - character
indicates ranges. For example,
center;
l.
[a-z0-9<>_]
indicates the character class containing all the lower case letters,
the digits,
the angle brackets, and underline.
Ranges may be given in either order.
Using - between any pair of characters which are
not both upper case letters, both lower case letters, or both digits
is implementation dependent and will get a warning message.
(E.g., [0-z] in ASCII is many more characters
than it is in EBCDIC).
If it is desired to include the
character - in a character class, it should be first or
last; thus
center;
l.
[-+0-9]
matches all the digits and the two signs.
In character classes,
the ^ operator must appear as the first character
after the left bracket; it indicates that the resulting string
is to be complemented with respect to the computer character set.
Thus
center;
l.
[^abc]
matches all characters except a, b, or c, including
all special or control characters; or
center;
l.
[^a-zA-Z]
is any character which is not a letter.
The \ character provides the usual escapes within
character class brackets.
Arbitrary character.
To match almost any character, the operator character
center;
l.
.
is the class of all characters except newline.
Escaping into octal is possible although non-portable:
center;
l.
[\40-\176]
matches all printable characters in the ASCII character set, from octal
40 (blank) to octal 176 (tilde).
Optional expressions.
The operator
?
indicates
an optional element of an expression.
Thus
center;
l.
ab?c
matches either
ac
or
abc.
Repeated expressions.
Repetitions of classes are indicated by the operators
*
and
+.
center;
l.
a*
is any number of consecutive
a
characters, including zero; while
center;
l.
a+
is one or more instances of
a.
For example,
center;
l.
[a-z]+
is all strings of lower case letters.
And
center;
l.
[A-Za-z][A-Za-z0-9]*
indicates all alphanumeric strings with a leading
alphabetic character.
This is a typical expression for recognizing identifiers in
computer languages.
Alternation and Grouping.
The operator |
indicates alternation:
center;
l.
(ab|cd)
matches either
ab
or
cd.
Note that parentheses are used for grouping, although
they are
not necessary on the outside level;
center;
l.
ab|cd
would have sufficed.
Parentheses
can be used for more complex expressions:
center;
l.
(ab|cd+)?(ef)*
matches such strings as
abefef,
efefef,
cdef,
or
cddd;
but not
abc,
abcd,
or
abcdef.
Context sensitivity.
Lex will recognize a small amount of surrounding
context. The two simplest operators for this are
^
and
$.
If the first character of an expression is
^,
the expression will only be matched at the beginning
of a line (after a newline character, or at the beginning of
the input stream).
This can never conflict with the other meaning of
^,
complementation
of character classes, since that only applies within
the [] operators.
If the very last character is
$,
the expression will only be matched at the end of a line (when
immediately followed by newline).
The latter operator is a special case of the
/
operator character,
which indicates trailing context.
The expression
center;
l.
ab/cd
matches the string
ab,
but only if followed by
cd.
Thus
center;
l.
ab$
is the same as
center;
l.
ab/\n
Left context is handled in Lex by
start conditions
as explained in section 10. If a rule is only to be executed
when the Lex automaton interpreter is in start condition
x,
the rule should be prefixed by
center;
l.
<x>
using the angle bracket operator characters.
If we considered ``being at the beginning of a line'' to be
start condition
ONE,
then the ^ operator
would be equivalent to
center;
l.
<ONE>
Start conditions are explained more fully later.
Repetitions and Definitions.
The operators {} specify
either repetitions (if they enclose numbers)
or
definition expansion (if they enclose a name). For example
center;
l.
{digit}
looks for a predefined string named
digit
and inserts it
at that point in the expression.
The definitions are given in the first part of the Lex
input, before the rules.
In contrast,
center;
l.
a{1,5}
looks for 1 to 5 occurrences of
a.
Finally, initial % is special, being the separator for Lex source segments.