PL: Parsing of PLs • Jonathan Cook

From old CS 471 notes…

Parsing the Syntax of Programming Languages

The Problem

We know that context free grammars are nice abstractions for specifying the syntax of programming languages.

But parsing of arbitrary CFG’s can be up to O(N^3) time!

So, compilers are typically built around a parsing algorithm that will only allow some subset of CFG to be defined.

Two main approaches: top-down or bottom-up parsing algorithms.

Two main sub-CFG grammar classes: LL and LR.

Recall

A lexeme is a small syntactic unit (sequence of characters) that become the “words” in the grammar. Each lexeme group is represented in the grammar by a token

Starting with the start symbol, replacing any one nonterminal with any one of its rule bodies, until all nonterminals are gone, results in a sequence of sentential forms, the last one being one possible sentence in the language defined by the grammar

Example on page 122

Replacement can happen leftmost, rightmost, or in random order

One Top-Down Method: Recursive Descent

Basic idea: every nonterminal in the grammar is turned into a function in the program.

A global variable “NextToken” always has the next token to consume in the input stream. A call to “Lex()” causes the lexer to get the next token.

Parsing begins at the top-level rule, the “Program()” function.

Every nonterminal in a rule is a call to that nonterminal’s parse function (can be recursive).

Where a rule matches a terminal, a check must be made that the existing token matches the expected token. If not, then an error is raised.

When a rule uses up a token, then it must call Lex() to get the next token.

Additional code would be inserted to actually do something with the lexemes as the program is parsed: the basic parsing code just either completes (returns true) if the program is parsed correctly, or aborts (returns false) if the program is syntactically incorrect.

LL Grammars

Recursive descent is an example of a top-down parsing method.

It reads the input stream from Left to right, and produces a Leftmost derivation of the grammar.

Thus it is an LL algorithm.

LL parsing cannot handle CFG rules that are left recursive, either directly or indirectly.

Direct recursion:

E -> E + T | T
T -> T * F | F
F -> (E) | id

Rewrite using rules in Sec 4.4.2:

E -> T E1
E1 -> + T E1 | empty
T -> F T1
T1 -> * F T1 | empty
F -> (E) | id

For indirect recursion, find set of nonterminals that will occur first for each right hand side derivation rule.

If sets are disjoint, then grammar can be left factored to remove indirect left recursion.

A -> aB | bAb | Bb
B -> cB | d

First sets for the three RHS rules of A are {a}, {b}, {c,d}. So this grammar can be left factored.

LR Grammars

Another parsing algorithm takes a bottom-up approach, matching lowest-level nonterminals until finally the whole input matches the topmost nonterminal rule.

LR parsing algorithms take input Left to right, and then produce a Rightmost derivation tree.

This sounds counter-intuitive, because the rightmost nonterminal in a sentential form must be matched.

Textbook’s term definitions:

The handle of a right sentential form is (informally) the thing that must be matched to ensure a rightmost derivation.

A phrase of a right sentential form is some part of the form that can be matched in one or more derivation steps.

A simple phrase is a phrase that can be matched in one step.

The handle of a right sentential form is its leftmost simple phrase.

E => E + T
  => E + T * F
  => E + T * id
  => E + F * id
  => E + id * id
  => T + id * id
  => F + id * id
  => id + id * id

Bottom up parsers use a shift-reduce algorithm.

Either shift and push state info onto a stack;

Or reduce by taking the top of the stack and reducing it down to a higher nonterminal.

Example from textbook, pp194-195:

Methods:

Start with state 0 on stack.
Action “S#” means shift token and state # onto stack.
Action “R#” means reduce top of stack using grammar rule #, then look at (state,nonterminal) pair on top of stack and use the goto rule to put a new state on top of stack.

Grammar with numbered rules:

1. E -> E + T
2. E -> T
3. T -> T * F
4. T -> F
5. F -> ( E )
6. F -> id

State table with actions for terminals and goto states for nonterminals:

Example parse of id + id * id $

State	id	+	*	(	)	$	E	T	F
0	S5	-	-	S4	-	-	1	2	3
1	-	S6	-	-	-	accept	-	-	-
2	-	R2	S7	-	R2	R2	-	-	-
3	-	R4	R4	-	R4	R4	-	-	-
4	S5	-	-	S4	-	-	8	2	3
5	-	R6	R6	-	R6	R6	-	-	-
6	S5	-	-	S4	-	-	-	9	3
7	S5	-	-	S4	-	-	-	-	10
8	-	S6	-	-	S11	-	-	-	-
9	-	R1	S7	-	R1	R1	-	-	-
10	-	R3	R3	-	R3	R3	-	-	-
11	-	R5	R5	-	R5	R5	-	-	-

Stack	Input	Action
0	id + id * id $	Shift 5
0 id 5	+ id * id $	Reduce 6 (goto 0,F)
0 F 3	+ id * id $	Reduce 4 (goto 0,T)
0 T 2	+ id * id $	Reduce 2 (goto 0,E)
0 E 1	+ id * id $	Shift 6
0 E 1 + 6	id * id $	Shift 5
0 E 1 + 6 id 5	* id $	Reduce 6 (goto 6,F)
0 E 1 + 6 F 3	* id $	Reduce 4 (goto 6,T)
0 E 1 + 6 T 9	* id $	Shift 7
0 E 1 + 6 T 9 * 7	id $	Shift 5
0 E 1 + 6 T 9 * 7 id 5	$	Reduce 6 (goto 7,F)
0 E 1 + 6 T 9 * 7 F 10	$	Reduce 3 (goto 6,T)
0 E 1 + 6 T 9	$	Reduce 1 (goto 0,E)
0 E 1	$	Accept