Flex is a preprocessor. Flex can generate C or C++ compatible code. It is essentially the same program as Lex with a few changes.
A Flex program has 3 sections: definitions, rules, user subroutines. Each section is separated by a pair of percents signs (%) in the first column. The form is roughly like:
definitions %% rules %% user subroutine section
%{ code to copy into the final program %}
After this can come pattern macros and few other options.
Pattern Macros are defined by giving a name followed by the definition. For example: name [a-z][a-z0-9]*
pattern action pattern action pattern action
The user subroutine section, of course, contains user subroutines and the main program which usually invokes the yylex subroutine.
Patterns are regular expressions for matching strings.
Any character that does not have special meaning matches itself. | |
\t | matches tab |
\n | matches newline |
. | dot matches any char except a newline |
\ | is used to make any special character lose its special status and match just the character, except in the special cases where the \ character has a meaning in C such as tab or newline. |
* | matches 0 or more copies of the pattern preceding the * |
+ | matches 1 or more copies of the previous pattern |
? | matches 0 or 1 copies of the previous pattern |
{x,y} | x through y copies of the previous patterns |
[ ] | character set. If (^) is the first character it is the inverse or complement of the character set about to be specified. - in between two characters in a character set means range of characters as in [a-z] means the lowercase letters and [aeiou] means the vowels and [a-zA-Z] means all letters. Note that if you use the not character (^) that the resulting character class pattern will match \n |
/ | means "match if followed by" for example dog/cat match the 3 characters "dog" if followed by the three characters "cat". "cat" is not actually matched. |
$ | at the end of a pattern matches at the end of line |
^ | at the beginning of a pattern matches beginning of line |
| | alternation. cat|dog means cat or dog. |
"" | are takes the contents of the string as characters and not as special pattern characters. |
( ) | grouping. (cat|dog)+ is one or more occurrances of cat or dog. |
{name} | is the invocation of the pattern macro named name that was defined in the definition section. |
Warning: Anytime you want to match a string that contains any suspicious characters such as say > or < you should enclose the constant characters in quotes. For example: dogs"<"[0-9] matches the word dogs followed by < followed by a digit. See A Regular Expression Primer for a comparison with other regular expression using tools.
The four rules for matching tokens are:
%{ #include <stdio.h> %} letter [a-zA-Z] %% [0-9]+ { printf("NUM: %s\n", yytext); } {letter}+ { printf("WORD: %s\n", yytext); } .|\n ; %% main() { yylex(); }
If this lex file is stored in numwords.l then this can be compiled by using:
flex -d numwords.l # -d turns on tracing for *hardcore* debugging gcc -g lex.yy.c -lfl -o numwords
This will compile the lex file using flex to generate the lex.yy.c file. The -d puts in code to turn on tracing in your scanner. Most of the time you do not want or need this option. Then gcc with the debug option -g and adding in the flex library -lfl and putting the program in numwords using the -o option. In this case g++ would work as well.
BIN = t0 # name of thing to be built goes here. e.g. this builds t0 CC = g++ CFLAGS = -g -DCPLUSPLUS SRCS = $(BIN).l OBJS = lex.yy.o LIBS = -lfl $(BIN): $(OBJS) $(CC) $(CCFLAGS) $(OBJS) $(LIBS) -o $(BIN) lex.yy.c: $(BIN).l flex $(BIN).l clean: rm -f $(OBJS) $(BIN) lex.yy.c $(BIN).tab.h $(BIN).tab.c
%{ #include <stdio.h> int c=0, w=0, l=0; %} word [^ \t\n]+ eol \n %% {word} {w++; c+=yyleng;}; {eol} {c++; l++;} . {c++;} %% main() { yylex(); printf("%d %d %d\n", l, w, c); }
%{ #include <iostream.h> int counter = 0; <- why here and not in main %} %% [0-9]* { cout << counter++; } <- why no endl? %% main() { yylex(); }
%{ #include <stdio.h> %} %% "<"[^>]*> { printf("VALUE: %s\n", yytext); } <- stretches across \n .|\n ; %% main() { yylex(); }