A grammar is used to specify the syntax of a language. It answers the question: What sentences are in the language and what are not? A sentence is a finite sequence of symbols from the vocabulary of the language. The grammar we discuss here is for a context free languages. The grammar is a context free grammar or CFG.
A grammar has:
Backus-Naur Form (BNF) is a notation for expressing a CFG. The notation:
Here is an example grammar for simple English sentences:
<sentence> ::= <subject> <predicate> <subject> ::= <article> <noun> <predicate> ::= <verb> <direct-object> <direct-object> ::= <article> <noun> <article> ::= THE | A <noun> ::= MAN | DOG <verb> ::= BITES | PETS
Is the sentence "THE DOG BITES A MAN" in the language? That is does
<sentence> -> THE DOG BITES A MAN
MAN BITES DOG is not in the grammar.
Parse tree is a tree based notation for describing the way a sentence could be built from a grammar. There might be more than one parse tree for a given sentence in a language (see ambiguity).
Derivation is the list of steps used in construction of a specific parse tree for a sentence from a grammar.
Left Most Derivation is a derivation in which the left most nonterminal is always replaced first.
Parse is to show how a sentence could be built from a grammar.
Metasymbols are symbols outside the language to prevent circularity in talking about the grammar.
Many of the linguistic structures in a computer language come from a small set of basic idioms. Here are some basic forms for some common things you might want in your language:
start symbol:
This is also a grammar for that language:
Here is a more hierarchical way to do the same thing:
Note that the first part handles this "listing" of the subparts which
can either be an X or Y. It is often clarifying to do grammars
hierarchically.
Note the hierarchical approach.
Sometimes you need to ask yourself is this a separating delimiter or
a terminating delimiter?
<sentence> ::= X | X <sentence>
<sentence> ::= X | <sentence> X
This is an example that there can be multiple correct
grammars for the same language.
<sentence> ::= X | Y | <sentence> X | <sentence> Y
<sentence> ::= <sub> | <sentence> <sub>
<sub> ::= X | Y
<sentence> ::= <listx> ";"
<listx> ::= X | X <listx>
<sentence> ::= <list> | <sentence> <list>
<list> ::= <listx> ";"
<listx> ::= X | X <listx>
<listx> ::= X | X "," <listx>
<listx> ::= X ";" | X ";" <listx>
This is a terminating delimiter case.
Hierarchically it looks like:
<sentence> ::= <sub> | <sentence> <sub>
<sub> ::= X ","
<arglist> ::= "(" ")" | "(" <stmtlist> ")"
<stmtlist> ::= <stmt> | <stmtlist> "," <stmt>
<stmt> ::= X | Y | Z
<tdecl> ::= <type> <varlist> ";"
<varlist> ::= <var> | <varlist> "," <var>
<var> ::= X | Y | Z
<type> ::= int | bool | string
<tdecl> ::= <type> <varlist> ";"
<varlist> ::= <var> | <varlist> "," <var>
<var> ::= X | Y | Z
<type> ::= static <basictype> | <basictype>
<basictype> ::= int | bool | string
Notice how I slipped the static as an option before the type
allowing either static followed by the basic type or
just the basictype.
<tree> ::= "(" <list> ")" // deal with parentheses
<list> ::= <thing> | <list> "," <thing> // do the list of things
<thing> ::= <tree> | <name> // things are more trees or names
<name> ::= ant | bat | cow | dog
Operator precedence is rule for determining the order of execution of heterogeneous operators in an expression. precedence is handled by grouping operators of same precedence in the same production. You can see that + and - have the same precedence as does * and /. The operators with highest precedence occur farther down in the grammar, that is, an expression is a sum of products which is product of exponentiation.
Finally, products of sums is denoted by putting the sum in parentheses as in:
666*(42+496)To get this affect the parenthesized expression is put in the same place that any variable could be put. In order to return to the top of the grammar (see Tips section below) the parentheses act as a counter for the number of times you return to the "top" of the grammar.
<exp> ::= <exp> + <addpart> | <exp> - <addpart> | <addpart> <addpart> ::= <addpart> * <mulpart> | <addpart> / <mulpart> | <mulpart> <mulpart> ::= <group> ^ <mulpart> | <group> <group> ::= <var> | ( <exp> ) <var> ::= X | Y | Z
The processing of sentences in a language to get their meaning usually uses the parse trees. The parse tree is a way to understand the structure of what was said in the sentence. If for a given grammar G there exists a sentence in the language for which there is more than one parse tree then G is said to be ambiguous. You only need one example sentence to make G ambiguous! Note that the language is not ambiguous, the grammar is. Also note that it is OK for there to be more than one derivation for a sentence using an unambiguous grammar. Derivation doesn't make the grammar ambiguous, parse trees do.
<sentence> ::= X | <sentence> X | X <sentence>
This grammar is ambiguous because there exists a sentence that has more than one parse tree. For example XX has two parse trees. Here is another way to get ambiguity for that language:
<sentence> ::= X | <sentence> <sentence>
Try the sentence XXX on above grammar. Can you can find two parse trees? The fix is just to do:
<sentence> ::= X | <sentence> X
Here is another very similar ambiguous grammar:
<binary-string> ::= 0 | 1 | <binary-string> <binary-string>
Here is a fix:
<binary-string> ::= 0 <binary-string> | 1 <binary-string> | 0 | 1
Ambiguity effects meaning:
<sentence> ::= <expression>
<expression> ::= <expression> + <expression> |
<expression> * <expression> |
<identifier>
<identifier> ::= X | Y | Z
These are in the language:
X
X + Y
X + Y * Z
While the first two expressions are fine,
X + Y * Z has a multiple parse trees. We want the
multiply to be done first when we do a depth first traversal of
the parse tree to extract the structure. How do we do this?
We need to control operator precedence.
Here we have added to the ambiguous grammar above to make multiply done
before addition (BUT IT IS STILL AMBIGUOUS)
<sentence> ::= <expression>
<expression> ::= <term> | <expression> + <expression>
<term> ::= <identifier> | <term> * <term>
<identifier> ::= X | Y | Z
Operator Associativity determines the order of execution of homogeneous operators. Essentially is it left to right or right to left. Here we have modified the grammar to be left to to right for both + and *.
<sentence> ::= <expression>
<expression> ::= <term> | <expression> + <term>
<term> ::= <identifier> | <term> * <identifier>
<identifier> ::= X | Y | Z
Suppose we want to have both + and - be of equal precedence? We can do that
here:
<sentence> ::= <expression>
<expression> ::= <term> | <expression> + <term> | <expression> - <term>
<term> ::= <identifier> | <term> * <identifier>
<identifier> ::= X | Y | Z
Note: we cannot have two operators of the same precedence in which one is left to right associative and
the other is right to left associative. For example:
<sentence> ::= <expression>
<expression> ::= <term> | <expression> + <term> | <term> - <expression>
<identifier> ::= X | Y | Z
If we did then the grammar would be ambiguous To see this
assume for a moment that + and left to right as normal but - is right to left. How do we parse:
X+Y-Z? Is it evaluated as if it is (X+Y)-Z or is it evaluated as X+(Y-Z)? Our grammar doesn't tell us.
<sentence> ::= <expression>
<expression> ::= <term> | <expression> + <term>
<term> ::= <identifier> | <term> * <identifier> | <expression>
<identifier> ::= X | Y | Z
The fix is to mark each time through the loop by forcing something
to be "removed" from the sentence each time. In this case we
have parentheses:
<sentence> ::= <expression>
<expression> ::= <term> | <expression> + <term>
<term> ::= <identifier> | <term> * <identifier> | ( <expression> )
<identifier> ::= X | Y | Z
Now if you go through the loop 12 times you are nested 12 deep in
parens but if you go through it 317 times then you are nested 317 deep.
A grammar is ambiguous if you have two or more paths through the grammar to a symbol. For example starting with <aardvark>:
<aardvark> :: = <cat> | <dog> <cat> ::= <zebra> <dog> ::= <zebra>then you get at least two parse trees for <zebra> starting with <aardvark>. So this grammar is ambiguous. Note that
<aardvark> :: = ( <cat> ) | <dog> <cat> ::= <zebra> <dog> ::= <zebra>is NOT ambiguous since <zebra> must go through <dog> or it will pickup a set of parentheses "marking" that it went via <cat>.
In this next grammar just adding on stuff to a working grammar has created two ways to get from expression to X making the grammar ambiguous.
<expression> ::= <term> | <expression> + <term> | <identifier>
<term> ::= <identifier> | <term> * <identifier>
<identifier> ::= X | Y | Z | <expression>
The first parse tree is:
<expression> -> <term> -> <identifier> -> XThe second is:
<expression> -> <identifier> -> X
In summary there are 3 ways to get ambiguity.
<sentence> ::= X | <sentence> <sentence>
<expression> ::= <term> | <expression> + <term>
<term> ::= <identifier> | <term> * <identifier>
<identifier> ::= X | Y | Z | <expression>
<expression> ::= <term> | <expression> + <term> | <identifier>
<term> ::= <identifier> | <term> * <identifier>
<identifier> ::= X | Y | Z | <expression>
if ( exp ) if ( exp ) stmt else stmtif you have the grammar:
stmt : IF '(' exp ')' stmt ELSE stmt
| IF '(' exp ')' stmt
Let's now solve this my creating separate nonterminals for if-statements with a paired else and without. Let a "matched if" be one in which an else is matched with the the most recent unmatched if. Let an "unmatched if" be an if with no matching else. This gives us 6 possible cases for an if-statement whose then-part or else-part may also be an if-statement. They are:
IF '(' <exp> ')' <matched> // 1
IF '(' <exp> ')' <unmatched> // 2
IF '(' <exp> ')' <matched> ELSE <matched> // 3
IF '(' <exp> ')' <matched> ELSE <unmatched> // 4
IF '(' <exp> ')' <unmatched> ELSE <matched> // 5
IF '(' <exp> ')' <unmatched> ELSE <unmatched> // 6
To have our rule of most recent unmatched if matches the else we must discard rules 5 and 6.
This gives us a grammar like:
<stmts> ::= <stmts> <stmt> | <stmt>
<stmt> ::= <matched> | <unmatched>
<matched> ::= IF '(' <exp> ')' <matched> ELSE <matched>
| <other-stmts>
<unmatched> ::= IF '(' <exp> ')' <matched>
| IF '(' <exp> ')' <unmatched>
| IF '(' <exp> ')' <matched> ELSE <unmatched>
Here is a common mistake with the famous dangling else grammar:
<stmts> ::= <stmts> <stmt> | <stmt>
<stmt> ::= <matched> | <unmatched> | <other-stmts>
<matched> ::= IF '(' <exp> ')' <matched> ELSE <matched>
| <other-stmts>
<unmatched> ::= IF '(' <exp> ')' <matched>
| IF '(' <exp> ')' <unmatched>
| IF '(' <exp> ')' <matched> ELSE <unmatched>
In this case there is more than one way to get to <other-stmts>
Here is another common mistake with the dangling else grammar:
<stmts> ::= <stmts> <stmt> | <stmt>
<stmt> ::= <matched> | <unmatched> | <other-stmts>
<matched> ::= IF '(' <exp> ')' <matched> ELSE <matched>
<unmatched> ::= IF '(' <exp> ')' <matched>
| IF '(' <exp> ')' <unmatched>
| IF '(' <exp> ')' <matched> ELSE <unmatched>
Not only is this grammar incomplete it has a trap in it.
Suppose you get to
<matched> or
<unmatched> then you can never escape those two states.
Another problem with grammars is they do not describe all the sentences of language you had in mind. Let's look at that along with the idea that a CFG is context free.
In a context free language you can make any substitution for a left-hand-side nonterminal suggested by any production without regard to the context in which the nonterminal occurs. For example
<dogs> ::= corgi | othermeans that anywhere a dog nonterminal occurs you can replace it with a corgi or an other. Let's use this mathematical property of grammars to see if there is a difference between two languages specified by these two grammars.
What is the difference between L_a and L_b?
L_a:
<sentence> ::= <exp>
<exp> ::= <term> | <exp> + <term>
<term> ::= <id> | <term> * <id>
<id> ::= X | Y | Z | ( <exp> )
L_b:
<sentence> ::= <exp>
<exp> ::= <term> | <exp> + <term>
<term> ::= <id> | <term> * <id> | ( <exp> )
<id> ::= X | Y | Z
Note that by using the rule of context freeness we can transform the
grammar of L_a to:
<sentence> ::= <exp>
<exp> ::= <term> | <exp> + <term>
<term> ::= <id> | <term> * <id> | (<exp>) | <term> * (<exp>)
<id> ::= X | Y | Z
so this adds
Here is another common mistake with the dangling else grammar:
X*(Y)
Therefore X*(Y)
is in L_a but not in L_b!!!
<stmts> ::= <stmts> <stmt> | <stmt>
<stmt> ::= <matched> | <unmatched> | <other-stmts>
<matched> ::= IF '(' <exp> ')' <matched> ELSE <matched>
<unmatched> ::= IF '(' <exp> ')' <matched>
| IF '(' <exp> ')' <unmatched>
| IF '(' <exp> ')' <matched> ELSE <unmatched>
Not only is this grammar incomplete it has a trap in it.
Suppose you get to
<matched> or
<unmatched> then you can never escape those two states.
Tips for grammar writing
<list>::= <list> dogs | "(" <list> ")" | if <list> then dogs | my <list>
can never get rid of the nonterminal since any replacement has a
<list> in it.
A fix would be change the language:
<list>::= <list> dogs | "(" <list> ")" | if <list> then dogs | <other>
or
<list>::= <list> dogs | "(" <list> ")" | if <list> then dogs | cats
<num> ::= <digit> <num> | <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Here we add a production for the signed number:
<signed> ::= + <num> | - <num> | <num> <--- added
<num> ::= <digit> <num> | <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Here we add a production for putting a nonzero digit
at the beginning. How was 0 handled?
<signed> ::= + <num> | - <nznum> | <nznum>
<nznum> ::= <nzdigit> <num> | <digit> <--- added
<num> ::= <digit> <num> | <digit>
<digit> ::= 0 | <nzdigit>
<nzdigit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Extended BNF
Extended BNF tries to make lists and optional arguments easy.
Extended BNF uses metachars [], {}, (), *, +
Note: sometimes (as in our book) {} means {}*.
EBNF often says nothing about precedence or associativity and
this is often defined with text about the operators. Here are
a bunch of examples:
Robert Heckendorn
Up One Level
Last updated: