Dr. J's Compiler Construction Lecture Notes
lecture #1 began here
Why study compilers?
Most CS students do not go on to write a commercial compiler someday, but
that's not why we study compilers. We study compiler construction for the
following reasons:
- Writing a compiler gives experience with large-scale
applications development. Your compiler program may be the largest
program you write as a student. Experience working with really big
data structures and complex interactions between algorithms will
help you out on your next big programming project.
- Compiler writing is one of the shining triumphs of CS theory.
It demonstrates the value of theory over the impulse to just "hack up"
a solution.
- Compiler writing is a basic element of programming language research.
Many language researchers write compilers for the languages they design.
- Many applications have similar properties to one or more phases of
a compiler, and compiler expertise and tools can help an application
programmer working on other projects besides compilers.
CS 370 is labor intensive. Famous computer scientist Dan Berry of the
University of Waterloo has argued convincingly that there is no software
development method for writing large programs that doesn't involve pain:
pain is inevitable in software development (Berry's Theorem). From my own
experience as a student, I posulate Jeffery's Corollary: there is no way
to learn the skills necessary for writing big programs without pain.
A good CS course includes pain, and teaches pain management and minimization.
The questions we should ask, then, are: (a) should CS majors be required
to spend a lot of time becoming really good programmers? and (b) are we
providing students with the assistance and access to the tools and
information they need to accomplish their goals with the minimal doses
of inevitable pain that are required?
Some Tools we will use
Labs and lectures will discuss all of these, but if you do not know them
already, the sooner you go learn them, the better.
- C and "make".
- If you are not expert with these yet, you will be a lot closer
by the time you pass this class.
- lex and yacc
- These are compiler-writers tools, but they are useful for other
kinds of applications, almost anything with a complex file format
to read in can benefit from them.
- gdb
- If you do not know a source-level debugger well, start learning.
You will need one to survive this class.
- e-mail
- Regularly e-mailing your instructor is a crucial part of class
participation. If you aren't asking questions, you aren't doing
your job as a student.
- web
- This is where you get your lecture notes, homeworks, and labs,
and turnin all your work.
- virtual environment
- We have a 3D video game / chat tool available that can help us
handle questions when one of us is not on campus.
Compilers - What Are They and What Kinds of Compilers are Out There?
The purpose of a compiler is: to translate a program in some language (the
source language) into a lower-level language (the target
language). The compiler itself is written in some language, called
the implementation language. To write a compiler you have to be
very good at programming in the implementation language, and have to
think about and understand the source language and target language.
There are several major kinds of compilers:
- Native Code Compiler
- Translates source code into hardware (assembly or machine code)
instructions. Example: gcc.
- Virtual Machine Compiler
- Translates source code into an abstract machine code, for execution
by a virtual machine interpreter. Example: javac.
- JIT Compiler
- Translates virtual machine code to native code. Operates within
a virtual machine. Example: Sun's HotSpot java machine.
- Preprocessor
- Translates source code into simpler or slightly lower level source code,
for compilation by another compiler. Examples: cpp, m4.
- Pure interpreter
- Executes source code on the fly, without generating machine code.
Example: Lisp.
Phases of a Compiler
- Lexical Analysis:
- Converts a sequence of characters into words, or tokens
- Syntax Analysis:
- Converts a sequence of tokens into a parse tree
- Semantic Analysis:
- Manipulates parse tree to verify symbol and type information
- Intermediate Code Generation:
- Converts parse tree into a sequence of intermediate code instructions
- Optimization:
- Manipulates intermediate code to produce a more efficient program
- Final Code Generation:
- Translates intermediate code into final (machine/assembly) code
Example of the Compilation Process
Consider the example statement; its translation to machine code
illustrates some of the issues involved in compiling.
position = initial + rate * 60
|
30 or so characters, from a single line of source code, are first
transformed by lexical analysis into a sequence of 7 tokens. Those
tokens are then used to build a tree of height 4 during syntax analysis.
Semantic analysis may transform the tree into one of height 5, that
includes a type conversion necessary for real addition on an integer
operand. Intermediate code generation uses a simple traversal
algorithm to linearize the tree back into
a sequence of machine-independent three-address-code instructions.
t1 = inttoreal(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
|
Optimization of the intermediate code allows the four instructions to
be reduced to two machine-independent instructions. Final code generation
might implement these two instructions using 5 machine instructions, in
which the actual registers and addressing modes of the CPU are utilized.
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
|
lecture #2 began here
Announcements
Reading!
I hope you have already been reading! Make sure you read the class lecture
notes, the related sections of the text, and please ask questions about
whatever is not totally clear. You can Ask Questions in class,
via e-mail, in the virtual environment, or on the
class message board.
Note: although last year's CS 370 lecture notes are ALL available to you up
front, I generally revise each lecture's notes, making additions,
corrections and adaptations to this year's homeworks, the night before each
lecture. The best time to print hard copies of the lecture notes is one
day at a time, right before the lecture is given.
Overview of Lexical Analysis
A lexical analyzer, also called a scanner, typically has the
following functionality and characteristics.
What is a "token" ?
In compilers, a "token" is:
- a single word of source code input (a.k.a. "lexeme")
- an integer code that refers to a single word of input
- a set of lexical attributes computed from a single word of input
Programmers think about all this in terms of #1. Syntax checking uses
#2. Error reporting, semantic analysis, and code generation require #3. In
a compiler written in C, for each token you allocate a C struct to store (3)
for each token.
Worth Mentioning
Here are the names of several important tools closely related to
compilers. You should learn those of these terms that you don't already know.
- interpreter
- a language processor program that translates and executes source
code directly, without compiling it ot machine code.
- assembler
- a translator from human readable (ASCII text) files of machine
instructions into the actual binary code (object files) of a machine.
- linker
- a program that combines (multiple) object files to make an executable.
Converts names of variables and functions to numbers (machine addresses).
- loader
- Program to load code. On some systems, different executables start at
different base addresses, so the loader must patch the executable with
the actual base address of the executable.
- preprocessor
- Program that processes the source code before the compiler sees it.
Usually, it implements macro expansion, but it can do much more.
- editor
- Editors may operate on plain text, or they may be wired into the rest
of the compiler, highlighting syntax errors as you go, or allowing
you to insert or delete entire syntax constructs at a time.
- debugger
- Program to help you see what's going on when your program runs.
Can print the values of variables, show what procedure called what
procedure to get where you are, run up to a particular line, run
until a particular variable gets a special value, etc.
- profiler
- Program to help you see where your program is spending its time, so
you can tell where you need to speed it up.
Auxiliary data structures
You were presented with the phases of the compiler, from lexical and syntax
analysis, through semantic analysis, and intermediate and final code
generation. Each phase has an input and an output to the next phase.
But there are a few data structures
we will build that survive across multiple phases: the literal table,
the symbol table, and the error handler.
- lexeme table
- a table that stores lexeme values, such as strings and variable
names, that may occur in many places. Only one copy of each
unique string and name needs to be allocated in memory.
- symbol table
- a table that stores the names defined (and visible with) each
particular scope. Scopes include: global, and procedure (local).
More advanced languages have more scopes such as class (or record)
and package.
- error handler
- errors in lexical, syntax, or semantic analysis all need a common
reporting mechanism, that shows where the error occurred (filename,
line number, and maybe column number are useful).
Reading Named Files in C using stdio
In this class you are opening and reading files. Hopefully this is review
for you; if not, you will need to learn it quickly. To do any "standard
I/O" file processing, you start by including the header:
#include <stdio.h>
This defines a data type (FILE *) and gives prototypes for
relevant functions. The following code opens a file using a string filename,
reads the first character (into an int variable, not a char, so that it can
detect end-of-file; EOF is not a legal char value).
FILE *f = fopen(filename, "r");
int i = fgetc(f);
if (i == EOF) /* empty file... */
Command line argument handling and file processing in C
The following example is from Kernighan & Ritchie's "The C Programming
Language", page 162.
#include <stdio.h>
/* cat: concatenate files, version 1 */
int main(int argc, char *argv[])
{
FILE *fp;
void filecopy(FILE *, FILE *);
if (argc == 1)
filecopy(stdin, stdout);
else
while (--argc > 0)
if ((fp = fopen(*++argv, "r")) == NULL) {
printf("cat: can't open %s\n", *argv);
return 1;
}
else {
filecopy(fp, stdout);
fclose(fp);
}
return 0;
}
void filecopy(FILE *ifp, FILE *ofp)
{
int c;
while ((c = getc(ifp)) != EOF)
putc(c, ofp);
}
Warning: while using and adapting the above code is fair game in this
class, the yylex() function is very different than the filecopy() function!
It takes no parameters! It returns an integer every time it finds a token!
So if you "borrow" from this example, delete filecopy() and write yylex()
from scratch. Multiple students have fallen into this trap before you.
A Brief Introduction to Make
It is not a good idea to write a large program like a compiler as a single
source file. For one thing, every time you make a small change, you would
need to recompile the whole program, which will end up being many thousands
of lines. For another thing, parts of your compiler may be generated by
"compiler construction tools" which will write separate files. In any case,
this class will require you to use multiple source files, compiled
separately, and linked together to form your executable program. This
would be a pain, except we have "make" which takes care of it for us.
Make uses an input file named "makefile", which stores in ASCII text form
a collection of rules for how to build a program from its pieces. Each
rule shows how to build a file from its source files, or dependencies.
For example, to compile a file under C:
foo.o : foo.c
gcc -c foo.c
The first line says to build foo.o you need foo.c, and the second line,
which must being with a tab, gave a command-line to
execute whenever foo.o should be rebuilt, i.e. when it is missing or
when foo.c has been changed and need to be recompiled.
The first rule in the makefile is what "make" builds by default, but
note that make dependencies are recursive: before it checks whether
it needs to rebuild foo.o from foo.c it will check whether foo.c needs
to be rebuilt using some other rule. Because of this post-order
traversal of the "dependency graph", the first rule in your makefile
is usually the last one that executes when you type "make". For a
C program, the first rule in your makefile would usually be the
"link" step that assembles objects files into an executable as in:
compiler: foo.o bar.o baz.o
gcc -o compiler foo.o bar.o baz.o
There is a lot more to "make" but we will take it one step at a time.
This
article on Make
may be useful to you. You can find other useful on-line documentation
on "make" (manual page, Internet reference guides, etc) if you look.
A couple finer points for HW#1
extern vs. #include: when do you use the one, when the other?
public interface to yylex(): no, you can't add your own parameters
Regular Expressions
The notation we use to precisely capture all the variations that a given
category of token may take are called "regular expressions" (or, less
formally, "patterns". The word "pattern" is really vague and there are
lots of other notations for patterns besides regular expressions).
Regular expressions are a shorthand notation
for sets of strings. In order to even talk about "strings" you have
to first define an alphabet, the set of characters which can
appear.
- Epsilon (ε) is a regular expression denoting the set
containing the empty string
- Any letter in the alphabet is also a regular expression denoting
the set containing a one-letter string consisting of that letter.
- For regular expressions r and s,
r | s
is a regular expression denoting the union of r and s
- For regular expressions r and s,
r s
is a regular expression denoting the set of strings consisting of
a member of r followed by a member of s
- For regular expression r,
r*
is a regular expression denoting the set of strings consisting of
zero or more occurrences of r.
- You can parenthesize a regular expression to specify operator
precedence (otherwise, alternation is like plus, concatenation
is like times, and closure is like exponentiation)
Although these operators are sufficient to describe all regular languages,
in practice everybody uses extensions:
- For regular expression r,
r+
is a regular expression denoting the set of strings consisting of
one or more occurrences of r. Equivalent to rr*
- For regular expression r,
r?
is a regular expression denoting the set of strings consisting of
zero or one occurrence of r. Equivalent to r|ε
- The notation [abc] is short for a|b|c. [a-z] is short for a|b|...|z.
[^abc] is short for: any character other than a, b, or c.
lecture #3 began here
What is a "lexical attribute" ?
A lexical attribute is a piece of information about a token. These typically
include:
| category | an integer code used to check syntax
|
| lexeme | actual string contents of the token
|
| line, column, file | where the lexeme occurs in source code
|
| value | for literals, the binary data they represent
|
Avoid These Common Bugs in Your Homeworks!
- yytext or yyinput were not declared global
- main() does not have its required argc, argv parameters!
- main() does not call yylex() in a loop or check its return value
- getc() EOF handling is missing or wrong! check EVERY all to getc() for EOF!
- opened files not (all) closed! file handle leak!
- end-of-comment code doesn't check for */
- yylex() is not doing the file reading
- yylex() does not skip multiple spaces, mishandles spaces at the front
of input, or requires certain spaces in order to function OK
- extra or bogus output not in assignment spec
- = instead of ==
Some Regular Expression Examples
In a previous lecture we saw regular expressions, the preferred notation for
specifying patterns of characters that define token categories. The best
way to get a feel for regular expressions is to see examples. Note that
regular expressions form the basis for pattern matching in many UNIX tools
such as grep, awk, perl, etc.
What is the regular expression for each of the different lexical items that
appear in C programs? How does this compare with another, possibly simpler
programming language such as BASIC?
| lexical category | BASIC | C |
| operators | the characters themselves | For operators that are regular expression operators we need mark them
with double quotes or backslashes to indicate you mean the character,
not the regular expression operator. Note several operators have a
common prefix. The lexical analyzer needs to look ahead to tell
whether an = is an assignment, or is followed by another = for example.
|
| reserved words | the concatenation of characters; case insensitive |
Reserved words are also matched by the regular expression for identifiers,
so a disambiguating rule is needed.
|
| identifiers | no _; $ at ends of some; 2 significant letters!?; case insensitive | [a-zA-Z_][a-zA-Z0-9]*
|
| numbers | ints and reals, starting with [0-9]+ | 0x[0-9a-fA-F]+ etc.
|
| comments | REM.* | C's comments are tricky regexp's
|
| strings | almost ".*"; no escapes | escaped quotes
|
| what else?
|
lex(1) and flex(1)
These programs generally take a lexical specification given in a .l file
and create a corresponding C language lexical analyzer in a file named
lex.yy.c. The lexical analyzer is then linked with the rest of your compiler.
The C code generated by lex has the following public interface. Note the
use of global variables instead of parameters, and the use of the prefix
yy to distinguish scanner names from your program names. This prefix is
also used in the YACC parser generator.
FILE *yyin; /* set this variable prior to calling yylex() */
int yylex(); /* call this function once for each token */
char yytext[]; /* yylex() writes the token's lexeme to an array */
/* note: with flex, I believe extern declarations must read
extern char *yytext;
*/
int yywrap(); /* called by lex when it hits end-of-file; see below */
The .l file format consists of a mixture of lex syntax and C code fragments.
The percent sign (%) is used to signify lex elements. The whole file is
divided into three sections separated by %%:
header
%%
body
%%
helper functions
The header consists of C code fragments enclosed in %{ and %} as well as
macro definitions consisting of a name and a regular expression denoted
by that name. lex macros are invoked explicitly by enclosing the
macro name in curly braces. Following are some example lex macros.
letter [a-zA-Z]
digit [0-9]
ident {letter}({letter}|{digit})*
The body consists of of a sequence of regular expressions for different
token categories and other lexical entities. Each regular expression can
have a C code fragment enclosed in curly braces that executes when that
regular expression is matched. For most of the regular expressions this
code fragment (also called a semantic action consists of returning
an integer that identifies the token category to the rest of the compiler,
particularly for use by the parser to check syntax. Some typical regular
expressions and semantic actions might include:
" " { /* no-op, discard whitespace */ }
{ident} { return IDENTIFIER; }
"*" { return ASTERISK; }
"." { return PERIOD; }
You also need regular expressions for lexical errors such as unterminated
character constants, or illegal characters.
The helper functions in a lex file typically compute lexical attributes,
such as the actual integer or string values denoted by literals. One
helper function you have to write is yywrap(), which is called when lex
hits end of file. If you just want lex to quit, have yywrap() return 1.
If your yywrap() switches yyin to a different file and you want lex to continue
processing, have yywrap() return 0. The lex or flex library (-ll or -lfl)
have default yywrap() function which return a 1, and flex has the directive
%option noyywrap which allows you to skip writing this function.
A Short Comment on Lexing C Reals
C float and double constants have to have at least one digit, either
before or after the required decimal. This is a pain:
([0-9]+.[0-9]* | [0-9]*.[0-9]+) ...
You might almost be happier if you wrote
([0-9]*.[0-9]*) { return (strcmp(yytext,".")) ? REAL : PERIOD; }
You-all know C's ternary e1 ? e2 : e3 operator, don't ya? Its an if-then-else
expression, very slick.
Lex extended regular expressions
Lex further extends the regular expressions with several helpful operators.
Lex's regular expressions include:
- c
- normal characters mean themselves
- \c
- backslash escapes remove the meaning from most operator characters.
Inside character sets and quotes, backslash performs C-style escapes.
- "s"
- Double quotes mean to match the C string given as itself.
This is particularly useful for multi-byte operators and may be
more readable than using backslash multiple times.
- [s]
- This character set operator matches any one character among those in s.
- [^s]
- A negated-set matches any one character not among those in s.
- .
- The dot operator matches any one character except newline: [^\n]
- r*
- match r 0 or more times.
- r+
- match r 1 or more times.
- r?
- match r 0 or 1 time.
- r{m,n}
- match r between m and n times.
- r1r2
- concatenation. match r1 followed by r2
- r1|r2
- alternation. match r1 or r2
- (r)
- parentheses specify precedence but do not match anything
- r1/r2
- lookahead. match r1 when r2 follows, without
consuming r2
- ^r
- match r only when it occurs at the beginning of a line
- r$
- match r only when it occurs at the end of a line
lecture #4 began here
Announcements
Next homework I promise: I will ask the TA to run your program with a
nonexistent file as a command-line argument!
Lexical Attributes and Token Objects
Besides the token's category, the rest of the compiler may need several
pieces of information about a token in order to perform semantic analysis,
code generation, and error handling. These are stored in an object instance
of class Token, or in C, a struct. The fields are generally something like:
struct token {
int category;
char *text;
int linenumber;
int column;
char *filename;
union literal value;
}
The union literal will hold computed values of integers, real numbers, and
strings. In your homework assignment, I am requiring you to compute
column #'s; not all compilers require them, but they are easy. Also: in
our compiler project we are not worrying about optimizing our use of memory,
so am not requiring you to use a union.
Flex Manpage Examplefest
To read a UNIX "man page", or manual page, you type "man command"
where command is the UNIX program or library function you need information
on. Read the man page for man to learn more advanced uses ("man man").
It turns out the flex man page is intended to be pretty complete, enough
so that we can draw our examples from it. Perhaps what you should figure
out from these examples is that flex is actually... flexible. The first
several examples use flex as a filter from standard input to standard
output.
- Line Counter/Word Counter
int num_lines = 0, num_chars = 0;
%%
\n ++num_lines; ++num_chars;
. ++num_chars;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}
- Toy compiler example
/* scanner for a toy Pascal-like language */
%{
/* need this for the call to atof() below */
#include <math.h>
%}
DIGIT [0-9]
ID [a-z][a-z0-9]*
%%
{DIGIT}+ {
printf( "An integer: %s (%d)\n", yytext,
atoi( yytext ) );
}
{DIGIT}+"."{DIGIT}* {
printf( "A float: %s (%g)\n", yytext,
atof( yytext ) );
}
if|then|begin|end|procedure|function {
printf( "A keyword: %s\n", yytext );
}
{ID} printf( "An identifier: %s\n", yytext );
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
"{"[^}\n]*"}" /* eat up one-line comments */
[ \t\n]+ /* eat up whitespace */
. printf( "Unrecognized character: %s\n", yytext );
%%
main( argc, argv )
int argc;
char **argv;
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
yylex();
}
On the use of character sets (square brackets) in lex and similar tools
A student recently sent me an example regular expression for comments that read:
COMMENT [/*][[^*/]*[*]*]]*[*/]
One problem here is that square brackets are not parentheses, they do not nest,
they do not support concatenation or other regular expression operators. They
mean exactly: "match any one of these characters" or for ^: "match any one
character that is not one of these characters". Note also that you
can't use ^ as a "not" operator outside of square brackets: you
can't write the expression for "stuff that isn't */" by saying (^ "*/")
lecture #5 began here
Finite Automata
A finite automaton (FA) is an abstract, mathematical machine, also known as a
finite state machine, with the following components:
- A set of states S
- A set of input symbols E (the alphabet)
- A transition function move(state, symbol) : new state(s)
- A start state S0
- A set of final states F
The word finite refers to the set of states: there is a fixed size
to this machine. No "stacks", no "virtual memory", just a known number of
states. The word automaton refers to the execution mode: there is
no instruction set, there is no sequence of instructions, there is just a
hardwired short loop that executes the same instruction over and over:
while ((c=getchar()) != EOF) S := move(S, c);
DFAs
The type of finite automata that is easiest to understand and simplest to
implement (say, even in hardware) is called a deterministic finite automaton
(DFA). The word deterministic here refers to the return value of
function move(state, symbol), which goes to at most one state.
Example:
S = {s0, s1, s2}
E = {a, b, c}
move = { (s0,a):s1; (s1,b):s2; (s2,c):s2 }
S0 = s0
F = {s2}
Finite automata correspond in a 1:1 relationship to transition diagrams;
from any transition diagram one can write down the formal automaton in
terms of items #1-#5 above, and vice versa. To draw the transition diagram
for a finite automaton:
- draw a circle for each state s in S; put a label inside the circles
to identify each state by number or name
- draw an arrow between Si and Sj, labeled with x
whenever the transition says to move(Si, x) : Sj
- draw a "wedgie" into the start state S0 to identify it
- draw a second circle inside each of the final states in F
The Automaton Game
If I give you a transition diagram of a finite automaton, you can hand-simulate
the operation of that automaton on any input I give you.
DFA Implementation
The nice part about DFA's is that they are efficiently implemented
on computers. What DFA does the following code correspond to? What
is the corresponding regular expression? You can speed this code
fragment up even further if you are willing to use goto's or write
it in assembler.
state := S0
for(;;)
switch (state) {
case 0:
switch (input) {
'a': state = 1; input = getchar(); break;
'b': input = getchar(); break;
default: printf("dfa error\n"); exit(1);
}
case 1:
switch (input) {
EOF: printf("accept\n"); exit(0);
default: printf("dfa error\n"); exit(1);
}
}
Deterministic Finite Automata Examples
A lexical analyzer might associate different final states with different
token categories:
C Comments:
Nondeterministic Finite Automata (NFA's)
Notational convenience motivates more flexible machines in which function
move() can go to more than one state on a given input symbol, and some
states can move to other states even without consuming an input symbol
(ε-transitions).
Fortunately, one can prove that for any NFA, there is an equivalent DFA.
They are just a notational convenience. So, finite automata help us get
from a set of regular expressions to a computer program that recognizes
them efficiently.
NFA Examples
ε-transitions make it simpler to merge automata:
multiple transitions on the same symbol handle common prefixes:
factoring may optimize the number of states. Is this picture OK/correct?
C Pointers, malloc, and your future
For most of you success as a computer scientist may boil down to whether you can
master the concept of dynamically allocated memory. In C this means pointers and
the malloc() family of functions. Here are some tips:
- Draw "memory box" pictures of your variables. Pencil and paper
understanding of memory leads to correct running programs.
- Always initialize local pointer variables. Consider this code:
void f() {
int i = 0;
struct tokenlist *current, *head;
...
foo(current)
}
Here, current is passed in as a parameter to foo, but it is a
pointer that hasn't been pointed at anything. I cannot tell you how many
times I personally have written bugs myself or fixed bugs in student code,
caused by reading or writing to pointers that weren't pointing at anything
in particular. Local variables that weren't initialized point at random
garbage. If you are lucky this is a coredump, but you might not be lucky,
you might not find out where the mistake was, you might just get a wrong answer.
This can all be fixed by
struct tokenlist *current = NULL, *head = NULL;
- Avoid this common C bug:
struct token *t = (struct token *)malloc(sizeof(struct token *)));
This compiles, but causes coredumps during program execution. Why?
- Check your malloc() return value to be sure it is not NULL.
Sure, modern programs will "never run out of memory". Wrong. malloc() can return
NULL even on big machines. Operating systems often place limits on memory
so as to protect themselves from runaway programs or hacker attacks.
Regular expression examples
Can you draw an NFA corresponding to the following?
(a|c)*b(a|c)*
(a|c)*|(a|c)*b(a|c)*
(a|c)*(b|ε)(a|c)*
Regular expressions can be converted automatically to NFA's
Each rule in the definition of regular expressions has a corresponding
NFA; NFA's are composed using ε transitions. This is called
"Thompson's construction" ).
We will work
examples such as (a|b)*abb in class and during lab.
- For ε, draw two states with a single ε transition.
- For any letter in the alphabet,
draw two states with a single transition labeled with that letter.
- For regular expressions r and s, draw r | s
by adding a new start state with ε transitions to the start
states of r and s, and a new final state with ε transitions
from each final state in r and s.
- For regular expressions r and s, draw rs
by adding ε transitions from the final states of r to the
start state of s.
- For regular expression r, draw r*
by adding new start and final states, and ε transitions
- from the start state to the final state,
- from the final state back to the start state,
- from the new start to the old start and from the old final
states to the new final state.
- For parenthesized regular expression (r) you can use the NFA for r.
lecture #6 began here
NFA's can be converted automatically to DFA's
In: NFA N
Out: DFA D
Method: Construct transition table Dtran (a.k.a. the "move function").
Each DFA state is a set of
NFA states. Dtran simulates in parallel all possible moves N can make
on a given string.
Operations to keep track of sets of NFA states:
- ε_closure(s)
- set of states reachable from state s via ε
- ε_closure(T)
- set of states reachable from any state in set T via ε
- move(T,a)
- set of states to which there is an NFA transition from states in T on symbol a
NFA to DFA Algorithm:
Dstates := {ε_closure(start_state)}
while T := unmarked_member(Dstates) do {
mark(T)
for each input symbol a do {
U := ε_closure(move(T,a))
if not member(Dstates, U) then
insert(Dstates, U)
Dtran[T,a] := U
}
}
Practice converting NFA to DFA
OK, you've seen the algorithm, now can you use it?
...
...did you get:
OK, how about this one:
lecture #7 began here
Some Remarks
- I have a collection of compiler textbooks in my office, which I will
make avaliable as "loaners" from class period to class period, all
you have to do is sign a return contract in blood.
- If you checked out the class web page, you saw a solution to HW#1
was posted awhile ago... I will try to do this for future assignments
also, but not immediately, so as to allow students a few days of
lateness without a heavy penalty.
- Whether we return the same or a different category for integer constants
and for line numbers depends very much on the grammar we use to parse
our language.
Lexical Analysis and the Literal Table
In many compilers, the memory management components of the compiler interact
with several phases of compilation, starting with lexical analysis.
- Efficient storage is necessary to handle large input files.
- There is a colossal amount of duplication in lexical data:
variable names, strings and other literal values duplicate frequently
- What token type to use may depend on previous declarations.
A hash table or other efficient data structure can avoid this duplication.
The software engineering design pattern to use is called the "flyweight".
Major Data Structures in a Compiler
- token
- contains an integer category, lexeme, line #, column #, filename...
We could build these into a link list, but instead we'll use them
as leaves in a tree structure.
- syntax tree
- contains grammar information about a sequence of related tokens.
leaves contain lexical information (tokens). internal nodes
contain grammar rules and pointers to tokens or other tree nodes.
- symbol table
- contains variable names, types, and information needed to generate
code for a name (such as its address, or constant value). Look ups
are by name, so we'll need a hash table.
- intermediate & final code
- We'll need link lists or similar structures to hold sequences of machine
instructions
Literal Table: Usage Example
Example abbreviated from [ASU86]: Figure 3.18, p. 109. Use "install_id()"
instead of "strdup()" to avoid duplication in the lexical data.
%{
/* #define's for token categories LT, LE, etc.
%}
white [ \t\n]+
digit [0-9]
id [a-zA-Z_][a-zA-Z_0-9]*
num {digit}+(\.{digit}+)?
%%
{ws} { /* discard */ }
if { return IF; }
then { return THEN; }
else { return ELSE; }
{id} { yylval.id = install_id(); return ID; }
{num} { yylval.num = install_num(); return NUMBER; }
"<" { yylval.op = LT; return RELOP; }
">" { yylval.op = GT; return RELOP; }
%%
install_id()
{
/* insert yytext into the literal table */
}
install_num()
{
/* insert (binary number corresponding to?) yytext into the literal table */
}
So how would you implement a literal table using a hash table? We will see
more hash tables when it comes time to construct the symbol tables with which
variable names and scopes are managed, so you had better become fluent.
lecture #8 began here
Constructing your Token inside yylex()
A student recently asked if it was OK to allocate a token structure
inside main() after yylex() returns the token. This is not OK because
in the next phase of your compiler, you are not calling yylex(), the
automatically generated parser will call yylex(). There is a way for
the parser to grab your token if you've stored it in a global variable,
but there is not a way for the parser to build the token structure itself.
Syntax Analysis
Parsing is the act of performing syntax analysis to verify an input
program's compliance with the source language. A by-product of this process
is typically a tree that represents the structure of the program.
Context Free Grammars
A context free grammar G has:
- A set of terminal symbols, T
- A set of nonterminal symbols, N
- A start symbol, s, which is a member of N
- A set of production rules of the form A -> w,
where A is a nonterminal and w is a string of terminal and
nonterminal symbols.
A context free grammar can be used to generate strings in the
corresponding language as follows:
let X = the start symbol s
while there is some nonterminal Y in X do
apply any one production rule using Y, e.g. Y -> w
When X consists only of terminal symbols, it is a string of the language
denoted by the grammar. Each iteration of the loop is a
derivation step. If an iteration has several nonterminals
to choose from at some point, the rules of derviation would allow any of these
to be applied. In practice, parsing algorithms tend to always choose the
leftmost nonterminal, or the rightmost nonterminal, resulting in strings
that are leftmost derivations or rightmost derivations.
Context Free Grammar Examples
Well, OK, so how much of the C language grammar can we come up
with in class today? Start with expressions, work on up to statements, and
work there up to entire functions, and programs.
lecture #9 began here
Dr. Pontelli is looking for a web developer, did everyone see that ad?
I too am looking for student research assistants.
Grammar Ambiguity
The grammar
E -> E + E
E -> E * E
E -> ( E )
E -> ident
allows two different derivations for strings such as "x + y * z".
The grammar is ambiguous, but the semantics of the language dictate
a particular operator precedence that should be used. One way to
eliminate such ambiguity is to rewrite the grammar. For example,
we can force the precedence we want by adding some nonterminals and
production rules.
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( E )
F -> ident
Given the arithmetic expression grammar from last lecture:
How can a program figure that x + y * z is legal?
How can a program figure out that x + y (* z) is illegal?
A brief aside on casting your mallocs
If you don't put a prototype for malloc(), C thinks it returns an int.
#include <stdlib.h>
includes prototypes for malloc(), free(), etc. malloc() returns a void *.
void * means "pointer that points at nothing", or "pointer that points
at anything". You need to cast it to what you are really pointing at,
as in:
union lexval *l = (union lexval *)malloc(sizeof(union lexval));
Note the stupid duplication of type information; no language is perfect!
Anyhow, always cast your mallocs. The program may work without the cast,
but you need to fix every warning, so you don't accidentally let a serious
one through.
Recursive Descent Parsing
Perhaps the simplest parsing method, for a large subset of context free
grammars, is called recursive descent. It is simple because the algorithm
closely follows the production rules of nonterminal symbols.
- Write 1 procedure per nonterminal rule
- Within each procedure, a) match terminals at appropriate positions,
and b) call procedures for non-terminals.
- Pitfalls:
- left recursion is FATAL
- must distinguish between several
production rules, or potentially, one has to
try all of them via backtracking.
Recursive Descent Parsing Example #1
Consider the grammar we gave above. There will be functions for
E, T, and F. The function for F() is the "easiest" in some sense: based
on a single token it can decide which production rule to use. The
parsing functions return 0 (failed to parse) if the nonterminal in
question cannot be derived from the tokens at the current point.
A nonzero return value of N would indicate success in parsing using
production rule #N.
int F()
{
int t = yylex();
if (t == IDENT) return 6;
else if (t == LP) {
if (E() && (yylex()==RP) return 5;
}
return 0;
}
Comment #1: if F() is in the middle of a larger parse of E() or T(), F()
may succeed, but the subsequent parsing may fail. The parse may have
to backtrack, which would mean we'd have to be able to put
tokens back for later parsing. Add a memory (say, a gigantic array or
link list for example) of already-parsed tokens
to the lexical analyzer, plus backtracking logic to E() or T() as needed.
The call to F() may get repeated following a different production rule
for a higher nonterminal.
Comment #2: in a real compiler we need more than "yes it parsed" or
"no it didn't": we need a parse tree if it succeeds, and we need a
useful error message if it didn't.
Question: for E() and T(), how do we know which production rule to try?
Option A: just blindly try each one in turn.
Option B: look at the first (current) token, only try those rules that
start with that token (1 character lookahead). If you are lucky, that
one character will uniquely select a production rule. If that is always
true through the whole grammar, no backtracking is needed.
Question: how do we know which rules start with whatever token we are
looking at? Can anyone suggest a solution, or are we stuck?
lecture #10 began here
Announcements
- Homework #3 minor extension
- Midterm exam: Thursday March 16
- The first midterm exam will cover lexical analysis and syntax analysis
Removing Left Recursion
E -> E + T | T
T -> T * F | F
F -> ( E ) | ident
We can remove the left recursion by introducing new nonterminals
and new production rules.
E -> T E'
E' -> + T E' | ε
T -> F T'
T' -> * F T' | ε
F -> ( E ) | ident
Getting rid of such immediate left recursion is not enough, one must
get rid of indirect left recursion, where two or more nonterminals are
mutually left-recursive.
One can rewrite any CFG to remove left recursion (Algorithm 4.1).
for i := 1 to n do
for j := 1 to i-1 do begin
replace each Ai -> Aj gamma with productions
Ai -> delta1gamma | delta2gamma
end
eliminate immediate left recursion
Removing Left Recursion, part 2
Left recursion can be broken into three cases
case 1: trivial
A : A α | β
The recursion must always terminate by A finally deriving β so you
can rewrite it to the equivalent
A : &beta A'
A' : &alpha A' | ε
Example:
E : E op T | T
can be rewritten
E : T E'
E' : op T E' | ε
case 2: non-trivial, but immediate
In the more general case, there may be multiple recursive productions
and/or multiple non-recursive productions.
A : A α1 | A α2 | ... | β1 | β2
As in the trivial case, you get rid of left-recursing A and introduce an A'
A : β1 A' | β2 A' | ...
A' : α1 A' | α2 A' | ... | ε
case 3: mutual recursion
- Order the nonterminals in some order 1 to N.
- Rewrite production rules to eliminate all
nonterminals in leftmost positions that refer to a "previous" nonterminal.
When finished, all productions' right hand symbols start with a terminal
or a nonterminal that is numbered equal or higher than the nonterminal
no the left hand side.
- Eliminate the direct left recusion as per cases 1-2.
Left Recursion Versus Right Recursion: When does it Matter?
A student came to me once with what they described as an operator precedence
problem where 5-4+3 was computing the wrong value (-2 instead of 4). What
it really was, was an associativity problem due to the grammar:
E : T + E | T - E | T
The problem here is that right recursion is forcing right associativity, but
normal arithmetic requires left associativity. Several solutions are:
(a) rewrite the grammar to be left recursive, or (b) rewrite the grammar
with more nonterminals to force the correct precedence/associativity,
or (c) if using YACC or Bison, there are "cheat codes" we will discuss later
to allow it to be majorly ambiguous and specify associativity separately
(look for %left and %right in YACC manuals).
Recursive Descent Parsing Example #2
The grammar
S -> A B C
A -> a A
A -> ε
B -> b
C -> c
maps to pseudocode like the following. (:= is an assignment operator)
procedure S()
if A() & B() & C() then succeed # matched S, we win
end
procedure A()
if yychar == a then { # use production 2
yychar := scan()
return A()
}
else
succeed # production rule 3, match ε
end
procedure B()
if yychar == b then {
yychar := scan()
succeed
}
else fail
end
procedure C()
if yychar == c then {
yychar := scan()
succeed
}
else fail
end
Backtracking?
Could your current token begin more than one of your possible production rules?
Try all of them, remember and reset state for each try.
S -> cAd
A -> ab
A -> a
Left factoring can often solve such problems:
S -> cAd
A -> a A'
A'-> b
A'-> (ε)
One can also perform left factoring to reduce or
eliminate the lookahead or backtracking needed to tell which production rule
to use. If the end result has no lookahead or backtracking needed, the
resulting CFG can be solved by a "predictive parser" and coded easily in a
conventional language. If backtracking is needed, a recursive descent
parser takes more work to implement, but is still feasible.
As a more concrete example:
S -> if E then S
S -> if E then S1 else S2
can be factored to:
S -> if E then S S'
S'-> else S2 | ε
Some More Parsing Theory
Automatic techniques for constructing parsers start with computing some
basic functions for symbols in the grammar. These functions are useful
in understanding both recursive descent and bottom-up LR parsers.
First(a)
First(a) is the set of terminals that begin strings derived from a,
which can include ε.
- First(X) starts with the empty set.
- if X is a terminal, First(X) is {X}.
- if X -> ε is a production, add ε to First(X).
- if X is a non-terminal and X -> Y1 Y2 ... Yk is a production,
add First(Y1) to First(X).
for (i = 1; if Yi can derive ε; i++)
add First(Yi+1) to First(X)
First(a) examples
by the way, this stuff is all in section 4.3 in your text.
Last time we looked at an example with E, T, and F, and + and *.
The first-set computation was not too exciting and we need more
examples.
stmt : if-stmt | OTHER
if-stmt: IF LP expr RP stmt else-part
else-part: ELSE stmt | ε
expr: IDENT | INTLIT
What are the First() sets of each nonterminal?
Follow(A)
Follow(A) for nonterminal A is the set of terminals that can appear
immediately to the right of A in some sentential form S -> aAxB...
To compute Follow, apply these rules to all nonterminals in the grammar:
- Add $ to Follow(S)
- if A -> aBb then add First(b) - ε to Follow(B)
- if A -> aB or A -> aBb where ε is in First(b), then add
Follow(A) to Follow(B).
On resizing arrays in C
The sval attribute in homework #2 is a perfect example of a problem which a
BCS major might not be expected to manage, but a CS major should be able to
do by the time they graduate. This is not to encourage any of you to consider
BCS, but rather, to encourage you to learn how to solve problems like these.
The problem can be summarized as: step through yytext, copying each piece
out to sval, removing doublequotes and plusses between the pieces, and
evaluating CHR$() constants.
Space allocated with malloc() can be increased in size by realloc().
realloc() is awesome. But, it COPIES and MOVES the old chunk of
space you had to the new, resized chunk of space, and frees the old
space, so you had better not have any other pointers pointing at
that space if you realloc(), and you have to update your pointer to
point at the new location realloc() returns.
i = 0; j = 0;
while (yytext[i] != '\0') {
if (yytext[i] == '\"') {
/* copy string into sval */
i++;
while (yytext[i] != '\"') {
sval[j++] = yytext[i++];
}
}
else if ((yytext[i] == 'C') || (yytext[i] == 'c')) {
/* handle CHR$(...) */
i += 5;
k = atoi(yytext + i);
sval[j++] = k; /* might check for 0-255 */
while (yytext[i] != ')') i++;
}
/* else we can just skip it */
i++;
}
sval[j] = '\0'; /* NUL-terminate our string */
There is one more problem: how do we allocate memory for sval, and how big
should it be?
- Solution #1: sval = malloc(strlen(yytext)+1) is very safe, but wastes
space.
- Solution #2: you could malloc a small amount and grow the array as
needed.
sval = strdup("");
...
sval = appendstring(sval, yytext[i]); /* instead of sval[j++] = yytext[i] */
where the function appendstring could be:
char *appendstring(char *s, char c)
{
i = strlen(s);
s = realloc(s, i+2);
s[i] = c;
s[i+1] = '\0';
return s;
}
Note: it is very inefficient to grow your array one character at
a time; in real life people grow arrays in large chunks at a time.
- Solution #3: use solution one and then shrink your array when you
find out how big it actually needs to be.
sval = malloc(strlen(yytext)+1);
/* ... do the code copying into sval; be sure to NUL-terminate */
sval = realloc(sval, strlen(sval)+1);
lecture #11 began here
YACC
YACC ("yet another compiler compiler") is a popular tool which originated at
AT&T Bell Labs. YACC takes a context free grammar as input, and generates a
parser as output. Several independent, compatible implementations (AT&T
yacc, Berkeley yacc, GNU Bison) for C exist, as well as many implementations
for other popular languages.
YACC files end in .y and take the form
declarations
%%
grammar
%%
subroutines
The declarations section defines the terminal symbols (tokens) and
nonterminal symbols. The most useful declarations are:
- %token a
- declares terminal symbol a; YACC can generate a set of #define's
that map these symbols onto integers, in a y.tab.h file. Note: don't
#include your y.tab.h file from your grammar .y file, YACC generates the
same definitions and declarations directly in the .c file, and including
the .tab.h file will cause duplication errors.
- %start A
- specifies the start symbol for the grammar (defaults to nonterminal
on left side of the first production rule).
The grammar gives the production rules, interspersed with program code
fragments called semantic actions that let the programmer do what's
desired when the grammar productions are reduced. They follow the
syntax
A : body ;
Where body is a sequence of 0 or more terminals, nonterminals, or semantic
actions (code, in curly braces) separated by spaces. As a notational
convenience, multiple production rules may be grouped together using the
vertical bar (|).
Bottom Up Parsing
Bottom up parsers start from the sequence of terminal symbols and work
their way back up to the start symbol by repeatedly replacing grammar
rules' right hand sides by the corresponding non-terminal. This is
the reverse of the derivation process, and is called "reduction".
Example. For the grammar
(1) S->aABe
(2) A->Abc
(3) A->b
(4) B->d
the string "abbcde" can be parsed bottom-up by the following reduction
steps:
abbcde
aAbcde
aAde
aABe
S
Handles
Definition: a handle is a substring that
- matches a right hand side of a production rule in the grammar and
- whose reduction to the nonterminal on the left hand side of that
grammar rule is a step along the reverse of a rightmost derivation.
Shift Reduce Parsing
A shift-reduce parser performs its parsing using the following structure
Stack Input
$ w$
At each step, the parser performs one of the following actions.
- Shift one symbol from the input onto the parse stack
- Reduce one handle on the top of the parse stack. The symbols
from the right hand side of a grammar rule are popped of the
stack, and the nonterminal symbol is pushed on the stack in their place.
- Accept is the operation performed when the start symbol is alone
on the parse stack and the input is empty.
- Error actions occur when no successful parse is possible.
The YACC Value Stack
- YACC's parse stack contains only "states"
- YACC maintains a parallel set of values
- $ is used in semantic actions to name elements on the value stack
- $$ denotes the value associated with the LHS (nonterminal) symbol
- $n denotes the value associated with RHS symbol at position n.
- Value stack typically used to construct the parse tree
- Typical rule with semantic action: A : b C d { $$ = tree(R,3,$1,$2,$3); }
- The default value stack is an array of integers
- The value stack can hold arbitrary values in an array of unions
- The union type is declared with %union and is named YYSTYPE
Getting Lex and Yacc to talk
YACC uses a global variable named yylval, of type YYSTYPE, to receive
lexical information from the scanner. Whatever is in this variable
each time yylex() returns to the parser will get copied over to the
top of the value stack when the token is shifted onto the parse stack.
You can either declare that struct token may appear in the %union,
and put a mixture of struct node and struct token on the value stack,
or you can allocate a "leaf" tree node, and point it at your struct
token. Or you can use a tree type that allows tokens to include
their lexical information directly in the tree nodes. If you have
more than one %union type possible, be prepared to see type conflicts
and to declare the types of all your nonterminals.
Getting all this straight takes some time; you can plan on it. Your best
bet is to draw pictures of how you want the trees to look, and then make the
code match the pictures. No pictures == "Dr. J will ask to see your
pictures and not be able to help if you can't describe your trees."
Declaring value stack types for terminal and nonterminal symbols
Unless you are going to use the default (integer) value stack, you will
have to declare the types of the elements on the value stack. Actually,
you do this by declaring which
union member is to be used for each terminal and nonterminal in the
grammar.
Example: in the cocogram.y that I gave you we could add a %union declaration
with a union member named treenode:
%union {
nodeptr treenode;
}
This will produce a compile error if you haven't declared a nodeptr type
using a typedef, but that is another story. To declare that a nonterminal
uses this union member, write something like:
%type < treenode > function_definition
Terminal symbols use %token to perform the corresponding declaration.
If you had a second %union member (say struct token *tokenptr) you
might write:
%token < tokenptr > SEMICOL
Announcements
Having trouble debugging your grammar? "bison -v" generates a .output
file that gives the gory details of conflicts and such.
lecture #12 began here
Announcements
In honor of Dr. Jeffery's 10th anniversary, a minor extension in Homework #3.
Conflicts in Shift-Reduce Parsing
"Conflicts" occur when an ambiguity in the grammar creates a situation
where the parser does not know which step to perform at a given point
during parsing. There are two kinds of conflicts that occur.
- shift-reduce
- a shift reduce conflict occurs when the grammar indicates that
different successful parses might occur with either a shift or a reduce
at a given point during parsing. The vast majority of situations where
this conflict occurs can be correctly resolved by shifting.
- reduce-reduce
- a reduce reduce conflict occurs when the parser has two or more
handles at the same time on the top of the stack. Whatever choice
the parser makes is just as likely to be wrong as not. In this case
it is usually best to rewrite the grammar to eliminate the conflict,
possibly by factoring.
Example shift reduce conflict:
S->if E then S
S->if E then S else S
In many languages two nested "if" statements produce a situation where
an "else" clause could legally belong to either "if". The usual rule
(to shift) attaches the else to the nearest (i.e. inner) if statement.
Example reduce reduce conflict:
(1) S -> id LP plist RP
(2) S -> E GETS E
(3) plist -> plist, p
(4) plist -> p
(5) p -> id
(6) E -> id LP elist RP
(7) E -> id
(8) elist -> elist, E
(9) elist -> E
By the point the stack holds ...id LP id
the parser will not know which rule to use to reduce the id: (5) or (7).
Further Discussion of Reduce Reduce and Shift Reduce Conflicts
The following grammar, based loosely on our expression grammar from
last time, illustrates a reduce reduce conflict, and how you have to
exercise care when using epsilon productions. Epsilon productions
were helpful for some of the grammar rewriting methods, such as removing
left recursion, but used indiscriminately, they can cause much trouble.
T : F | F T2 ;
T2 : p F T2 | ;
F : l T r | v ;
The reduce-reduce conflict occurs after you have seen an F. If the next
symbol is a p there is no question of what to do, but if the next symbol
is the end of file, do you reduce by rule #1 or #4 ?
A slightly different grammar is needed to demonstrate a shift-reduce conflict:
T : F g;
T : F T2 g;
T2 : t F T2 ;
T2 : ;
F : l T r ;
F : v ;
This grammar is not much different than before, and has the same problem,
but the surrounding context (the "calling environments") of F cause the
grammar to have a shift-reduce instead of reduce-reduce. Once again,
the trouble is after you have seen an F and dwells on the question of
whether to reduce the epsilon production, or instead to shift, upon
seeing a token g.
The .output file generated by "bison -v" explains these conflicts in
considerable detail. Part of what you need to interpret them are the
concepts of "items" and "sets of items" discussed below.
YACC precedence and associativity declarations
YACC headers can specify precedence and associativity rules for otherwise
heavily ambiguous grammars. Precedence is determined by increasing order
of these declarations. Example:
%right ASSIGN
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER
%%
expr: expr ASSIGN expr
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr POWER expr
;
YACC error handling and recovery
- Use special predefined token
error where errors expected
- On an error, the parser pops states until it enters one that has an
action on the error token.
- For example: statement: error ';' ;
- The parser must see 3 good tokens before it decides it has recovered.
- yyerrok tells parser to skip the 3 token recovery rule
- yyclearin throws away the current (error-causing?) token
- yyerror(s) is called when a syntax error occurs (s is the error message)
Improving YACC's Error Reporting
yyerror(s) overrides the default error message, which usually just says either
"syntax error" or "parse error", or "stack overflow".
You can easily add information in your own yyerror() function, for example
GCC emits messages that look like:
goof.c:1: parse error before '}' token
using a yyerror function that looks like
void yyerror(char *s)
{
fprintf(stderr, "%s:%d: %s before '%s' token\n",
yyfilename, yylineno, s, yytext);
}
You could instead, use the error recovery mechanism to produce better messages.
For example
lbrace : LBRACE | { error_code=MISSING_LBRACE; } error ;
Where LBRACE is an expected token {
This uses a global variable error_code to pass parse information to yyerror().
Another related option is to call yyerror() explicitly with a better message
string, and tell the parser to recover explicitly:
package_declaration: PACKAGE_TK error
{ yyerror("Missing name"); yyerrok; } ;
But, using error recovery to perform better error reporting runs against
conventional wisdom that you should use error tokens very sparingly.
What information from the parser determined we had an error in the first
place? Can we use that information to produce a better error message?
LR Syntax Error Messages: Advanced Methods
The pieces of information that YACC/Bison use to determine that there
is an error in the first place are the parse state (yystate) and the
current input token (yychar). These are exactly the pieces of information
one might use to produce better diagnostic error messages without
relying on the error recovery mechanism and mucking up the grammar
with a lot of extra production rules that feature the error token.
Even just the parse state is enough to do pretty good error messages.
yystate is not part of YACC's public interface, though, so you may
have to play some tricks to pass it as a parameter into yyerror() from
yyparse(). Say, for example:
#define yyerror(s) __yyerror(s,yystate)
Inside __yyerror(msg, yystate) you can use a switch statement or a global
array to associate messages with specific parse states. But, figuring
out which parse state means which syntax error message would be by trial
and error.
A tool called Merr is available that let's you generate this yyerror
function from examples: you supply the sample syntax errors and messages,
and Merr figures out which parse state integer goes with which message.
Merr also uses the yychar (current input token) to refine the diagnostics
in the event that two of your example errors occur on the same parse state.
See the Merr web page.
lecture #13 began here
Announcements
The TA's HW2 grades are available from the TA. The distribution (out of 80) was
76, 74, 74, 74, 73, 72, 66, 65, 55, 52, 46, 35, 30, 30, 30, 15, 14
1/3rd of the class got an "A". The rest of you need to visit the TA, see how
the grades were measured, see the professor, and most important, get a lexical
analyzer working well enough to complete the later assignments in this course.
If your grade was below 70, you probably want to get it working and resubmit
it, I have asked the TA to accept resubmissions and average the grades
(example: you got a 30, fixed it and resubmitted it and got a 70; your overall
grade is a 50). This option is valid until the due date for the next homework.
After all of this adjustment, you are being graded relative to your
peers, not on an absolute 90/80/... scale. Depending on your peers'
performance, a 60% score at the end of the semester could be a "B" for all I
know. The purpose of the late penalty is to encourage you not to fall
further and further behind as the semester progresses, and to encourage you
to in fact catch up if you do fall behind.
For HW3 (syntax checker), make sure your tar file
unpacks OK and that "make" just works for us out of the box. In your paper
turnin, make sure you DO include the lex .l and yacc .y files, and make
sure you do NOT include the .c files generated from the lex .l and
yacc .y files (lex.yy.c, y.tab.c, whatever). Include all .h files and your
makefile.
For HW3, test your work on as many test cases as possible.
Midterm Exam is coming up, March 16. Midterm review March 14.
Three more lectures before that.
LR vs. LL vs. LR(0) vs. LR(1) vs. LALR(1)
The first char ("L") means input tokens are read from the left
(left to right). The second char ("R" or "L") means parsing
finds the rightmost, or leftmost, derivation. Relevant
if there is ambiguity in the grammar. (0) or (1) or (k) after
the main lettering indicates how many lookahead characters are
used. (0) means you only look at the parse stack, (1) means you
use the current token in deciding what to do, shift or reduce.
(k) means you look at the next k tokens before deciding what
to do at the current position.
LR Parsers
LR denotes a class of bottom up parsers that is capable of handling virtually
all programming language constructs. LR is efficient; it runs in linear time
with no backtracking needed. The class of languages handled by LR is a proper
superset of the class of languages handled by top down "predictive parsers".
LR parsing detects an error as soon as it is possible to do so. Generally
building an LR parser is too big and complicated a job to do by hand, we use
tools to generate LR parsers.
The LR parsing algorithm is given below.
ip = first symbol of input
repeat {
s = state on top of parse stack
a = *ip
case action[s,a] of {
SHIFT s': { push(a); push(s') }
REDUCE A->beta: {
pop 2*|beta| symbols; s' = new state on top
push A
push goto(s', A)
}
ACCEPT: return 0 /* success */
ERROR: { error("syntax error", s, a); halt }
}
}
Constructing SLR Parsing Tables:
Note: in Spring 2006 this material is FYI but you will not be
examined on it.
Definition: An LR(0) item of a grammar G is a production
of G with a dot at some position of the RHS.
Example: The production A->aAb gives the items:
A -> . a A b
A -> a . A b
A -> a A . b
A -> a A b .
Note: A production A-> ε generates
only one item:
A -> .
Intuition: an item A-> α . β denotes:
- α - we have already seen a string
derivable from α
- β - we hope to see a string derivable
from β
Functions on Sets of Items
Closure: if I is a set of items for a grammar G, then closure(I)
is the set of items constructed as follows:
- Every item in I is in closure(I).
- If A->α . Bβ
is in closure(I) and B->γ
is a production, then add B-> .γ
to closure(I).
These two rules are applied repeatedly until no new items can
be added.
Intuition: If A -> α . B β is in
closure(I) then we hope to see a string derivable from B in the
input. So if B-> γ is a production,
we should hope to see a string derivable from γ.
Hence, B->.γ is in closure(I).
Goto: if I is a set of items and X is a grammar symbol, then goto(I,X)
is defined to be:
goto(I,X) = closure({[A->αX.β] | [A->α.Xβ]
is in I})
Intuition:
- [A->α.Xβ]
is in I => we've seen a string derivable
from α; we hope to see a string derivable
from Xβ.
- Now suppose we see a string derivable from X
- Then, we should "goto" a state where we've seen
a string derivable from αX, and where
we hope to see a string derivable from β.
The item corresponding to this is [A->αX.β]
- Example: Consider the grammar
E -> E+T | T
T -> T*F | F
F -> (E) | id
Let I = {[E -> E . + T]} then:
goto(I,+) = closure({[E -> E+.T]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F], [F-> .(E)], [F -> .id]})
= { [E -> E + .T],[T -> .T * F],[T -> .F],[F -> .(E)],[F -> .id]}
The Sets of Items Construction
- Given a grammar G with start symbol S, construct the augmented
grammar by adding a special production S'->S where S' does
not appear in G.
- Algorithm for constructing the canonical collection of LR(0)
items for an augmented grammar G':
begin
C := { closure({[S' -> .S]}) };
repeat
for each set of items I in C:
for each grammar symbol X:
if goto(I,X) != 0 and goto(I,X) is not in C then
add goto(I,X) to C;
until no new sets of items can be added to C;
return C;
end
Valid Items: an item A -> β
1. β 2
is valid for a viable prefix α
β 1 if
there is a derivation:
S' =>*rm αAω =>*rmα β1β 2ω
Suppose A -> β1.β 2 is valid for αβ1,
and αB1 is on the parsing
stack
- if β2 != ε,
we should shift
- if β2 = ε,
A -> β1 is the handle,
and we should reduce by this production
Note: two valid items may tell us to do different things for the
same viable prefix. Some of these conflicts can be resolved using
lookahead on the input string.
Constructing an SLR Parsing Table
- Given a grammar G, construct the augmented grammar by adding
the production S' -> S.
- Construct C = {I0, I1,
In},
the set of sets of LR(0) items for G'.
- State I is constructed from Ii, with parsing action
determined as follows:
- [A -> α.aB] is in
Ii, where a is a terminal; goto(Ii,a) = Ij
: set action[i,a] = "shift j"
- [A -> α.] is in
Ii : set action[i,a] to "reduce A -> x"
for all a e FOLLOW(A), where A != S'
- [S' -> S] is in Ii :
set action[i,$] to "accept"
- goto transitions constructed as follows: for all non-terminals:
if goto(Ii, A) = Ij, then goto[i,A] = j
- All entries not defined by (3) & (4) are made "error".
If there are any multiply defined entries, grammar is not SLR.
- Initial state S0 of parser: that constructed from
I0 or [S' -> S]
Example:
S -> aABe FIRST(S) = {a} FOLLOW(S) = {$}
A -> Abc FIRST{A} = {b} FOLLOW(A) = {b,d}
A -> b FIRST{B} = {d} FOLLOW{B} = {e}
B -> d FIRST{S'}= {a} FOLLOW{S'}= {$}
I0 = closure([S'->.S]
= closure([S'->.S],[S->.aABe])
goto(I0,S) = closure([S'->S.]) = I1
goto(I0,a) = closure([S->a.Abe])
= closure([S->a.Abe],[A->.Abc],[A->.b]) = I2
goto(I2,A) = closure([S->aA.Be],[A->A.bc])
= closure([S->aA.Be],[A->A.bc],[B->.d]) = I3
goto(I2,B) = closure([A->b.]) = I4
goto(I3,B) = closure([S->aAB.e]) = I5
goto(I3,b) = closure([A->Ab.c]) = I6
goto(I3,d) = closure([B->d.]) = I7
goto(I5,e) = closure([S->aABe.]) = I8
goto(I6,c) = closure([A->Abc.]) = I9
lecture #14 began here
On Tree Traversals
Trees are classic data structures. Trees have nodes and edges, so they are
a special case of graphs. Tree edges are directional, with roles "parent"
and "child" attributed to the source and destination of the edge.
A tree has the property that every node has zero or one parent. A node
with no parents is called a root. A node with no children is called a leaf.
A node that is neither a root nor a leaf is an "internal node". Trees have
a size (total # of nodes), a height (maximum count of nodes from root to a leaf),
and an "arity" (maximum number of children in any one node).
Parse trees are k-ary, where there is a
variable number of children bounded by a value k determined by the grammar.
You may wish to consult your old data structures book, or look at some books
from the library, to learn more about trees if you are not totally
comfortable with them.
#include <stdarg.h>
struct tree {
short label; /* what production rule this came from */
short nkids; /* how many children it really has */
struct tree *child[1]; /* array of children, size varies 0..k */
};
struct tree *alctree(int label, int nkids, ...)
{
int i;
va_list ap;
struct tree *ptr = malloc(sizeof(struct tree) +
(nkids-1)*sizeof(struct tree *));
if (ptr == NULL) {fprintf(stderr, "alctree out of memory\n"); exit(1); }
ptr->label = label;
ptr->nkids = nkids;
va_start(ap, nkids);
for(i=0; i < nkids; i++)
ptr->child[i] = va_arg(ap, struct tree *);
va_end(ap);
return ptr;
}
Besides a function to allocate trees, you need to write one or more recursive
functions to visit each node in the tree, either top to bottom (preorder),
or bottom to top (postorder). You might do many different traversals on the
tree in order to write a whole compiler: check types, generate machine-
independent intermediate code, analyze the code to make it shorter, etc.
You can write 4 or more different traversal functions, or you can write
1 traversal function that does different work at each node, determined by
passing in a function pointer, to be called for each node.
void postorder(struct tree *t, void (*f)(struct tree *))
{
/* postorder means visit each child, then do work at the parent */
int i;
if (t == NULL) return;
/* visit each child */
for (i=0; i < t-> nkids; i++)
postorder(t->child[i], f);
/* do work at parent */
f(t);
}
You would then be free to write as many little helper functions as you
want, for different tree traversals, for example:
void printer(struct tree *t)
{
if (t == NULL) return;
printf("%p: %d, %d children\n", t, t->label, t->nkids);
}
Semantic Analysis
Semantic ("meaning") analysis refers to a phase of compilation in which the
input program is studied in order to determine what operations are to be
carried out. The two primary components of a classic semantic analysis
phase are variable reference analysis and type checking. These components
both rely on an underlying symbol table.
What we have at the start of semantic analysis is a syntax tree that
corresponds to the source program as parsed using the context free grammar.
Semantic information is added by annotating grammar symbols with
semantic attributes, which are defined by semantic rules.
A semantic rule is a specification of how to calculate a semantic attribute
that is to be added to the parse tree.
So the input is a syntax tree...and the output is the same tree, only
"fatter" in the sense that nodes carry more information.
Another output of semantic analysis are error messages detecting many
types of semantic errors.
Two typical examples of semantic analysis include:
- variable reference analysis
- the compiler must determine, for each use of a variable, which
variable declaration corresponds to that use. This depends on
the semantics of the source language being translated.
- type checking
- the compiler must determine, for each operation in the source code,
the types of the operands and resulting value, if any.
Notations used in semantic analysis:
- syntax-directed definitions
- high-level (declarative) specifications of semantic rules
- translation schemes
- semantic rules and the order in which they get evaluated
In practice, attributes get stored in parse tree nodes, and the
semantic rules are evaluated either (a) during parsing (for easy rules) or
(b) during one or more (sub)tree traversals.
Two Types of Attributes:
- synthesized
- attributes computed from information contained within one's children.
These are generally easy to compute, even on-the-fly during parsing.
- inherited
- attributes computed from information obtained from one's parent or siblings
These are generally harder to compute. Compilers may be able to jump
through hoops to compute some inherited attributes during parsing,
but depending on the semantic rules this may not be possible in general.
Compilers resort to tree traversals to move semantic information around
the tree to where it will be used.
Attribute Examples
Isconst and Value
Not all expressions have constant values; the ones that do may allow
various optimizations.
| CFG | Semantic Rule
|
|
E1 : E2 + T
|
E1.isconst = E2.isconst && T.isconst
if (E1.isconst)
E1.value = E2.value + T.value
|
|
E : T
|
E.isconst = T.isconst
if (E.isconst)
E.value = T.value
|
|
T : T * F
|
T1.isconst = T2.isconst && F.isconst
if (T1.isconst)
T1.value = T2.value * F.value
|
|
T : F
|
T.isconst = F.isconst
if (T.isconst)
T.value = F.value
|
|
F : ( E )
|
F.isconst = E.isconst
if (F.isconst)
F.value = E.value
|
|
F : ident
|
F.isconst = FALSE
|
|
F : intlit
|
F.isconst = TRUE
F.value = intlit.ival
|
|
lecture #15 began here
Questions from the board and from the floor
Symbol Table Module
Symbol tables are used to resolve names within name spaces. Symbol
tables are generally organized hierarchically according to the
scope rules of the language. Although initially concerned with simply
storing the names of various that are visible in each scope, symbol
tables take on additional roles in the remaining phases of the compiler.
In semantic analysis, they store type information. And for code generation,
they store memory addresses and sizes of variables.
- mktable(parent)
- creates a new symbol table, whose scope is local to (or inside) parent
- enter(table, symbolname, type, offset)
- insert a symbol into a table
- lookup(table, symbolname)
- lookup a symbol in a table; returns structure pointer including type and offset. lookup operations are often chained together progressively from most local scope on out to global scope.
- addwidth(table)
- sums the widths of all entries in the table. ("widths" = #bytes, sum of
widths = #bytes needed for an "activation record" or "global data section").
Worry not about this method until code generation you wish to implement.
- enterproc(table, name, newtable)
- enters the local scope of the named procedure
Variable Reference Analysis
The simplest use of a symbol table would check:
- for each variable, has it been declared? (undeclared error)
- for each declaration, is it already declared? (redeclared error)
Reading Tree Leaves
In order to work with your tree, you must be able to tell, preferably
trivially easily, which nodes are tree leaves and which are internal nodes,
and for the leaves, how to access the lexical attributes.
Options:
- encode in the parent what the types of children are
- encode in each child what its own type is (better)
How do you do option #2 here?
Perhaps the best approach to all this is to unify the tokens and parse tree
nodes with something like the following, where perhaps an nkids value of -1
is treated as a flag that tells the reader to use
lexical information instead of pointers to children:
struct node {
int code; /* terminal or nonterminal symbol */
int nkids;
union {
struct token { ... } leaf;
struct node *kids[9];
}u;
} ;
There are actually nonterminal symbols with 0 children (nonterminal with
a righthand side with 0 symbols) so you don't necessarily want to use
an nkids of 0 is your flag to say that you are a leaf.
Type Checking
Perhaps the primary component of semantic analysis in many traditional
compilers consists of the type checker. In order to check types, one first
must have a representation of those types (a type system) and then one must
implement comparison and composition operators on those types using the
semantic rules of the source language being compiled. Lastly, type checking
will involve adding (mostly-) synthesized attributes through those parts of
the language grammar that involve expressions and values.
Type Systems
Types are defined recursively according to rules defined by the source
language being compiled. A type system might start with rules like:
- Base types (int, char, etc.) are types
- Named types (via typedef, etc.) are types
- Types composed using other types are types, for example:
- array(T, indices) is a type. In some
languages indices always start with 0, so array(T, size) works.
- T1 x T2 is a type (specifying, more or
less, the tuple or sequence T1 followed by T2;
x is a so-called cross-product operator).
- record((f1 x T1) x (f2 x T2) x ... x (fn x Tn)) is a type
- in languages with pointers, pointer(T) is a type
- (T1 x ... Tn) -> Tn+1 is a
type denoting a function mapping parameter types to a return type
- In some language type expressions may contain variables whose values
are types.
In addition, a type system includes rules for assigning these types
to the various parts of the program; usually this will be performed
using attributes assigned to grammar symbols.
lecture #16 began here
Midterm Exam Review
The Midterm will cover lexical analysis, finite automatas, context free
grammars, syntax analysis, and parsing. Sample problems:
- Write a regular expression for numeric quantities of U.S. money
that start with a dollar sign, followed by one or more digits.
Require a comma between every three digits, as in $7,321,212.
Also, allow but do not require a decimal point followed by two
digits at the end, as in $5.99
- Use Thompson's construction to write a non-deterministic finite
automaton for the following regular expression, an abstraction
of the expression used for real number literal values in C.
(d+pd*|d*pd+)(ed+)?
- Write a regular expression, or explain why you can't write a
regular expression, for Modula-2 comments which use (* *) as
their boundaries. Unlike C, Modula-2 comments may be nested,
as in (* this is a (* nested *) comment *)
- Write a context free grammar for the subset of C expressions
that include identifiers and function calls with parameters.
Parameters may themselves be function calls, as in f(g(x)),
or h(a,b,i(j(k,l)))
- What are the FIRST(E) and FOLLOW(T) in the grammar:
E : E + T | T
T : T * F | F
F : ( E ) | ident
- What is the ε-closure(move({2,4},b)) in the following NFA?
That is, suppose you might be in either state 2 or 4 at the time
you see a symbol b: what NFA states might you find yourself in
after consuming b?
(automata to be written on the board)
Q: What else is likely to appear on the midterm?
A: questions that allow you to demonstrate that you know the difference
between an DFA and an NFA, questions about lex and flex and tokens
and lexical attributes, questions about context free grammars:
ambiguity, factoring, removing left recursion, etc.
On the mysterious TYPE_NAME
The C language typedef construct is an example where all the beautiful
theory we've used up to this point breaks down. Once a typedef is
introduced (which can first be recognized at the syntax level), certain
identifiers should be legal type names instead of identifiers. To make
things worse, they are still legal variable names: the lexical analyzer
has to know whether the syntactic context needs a type name or an
identifier at each point in which it runs into one of these names. This
sort of feedback from syntax or semantic analysis back into lexical
analysis is not un-doable but it requires extensions added by hand to
the machine generated lexical and syntax analyzer code.
typedef int foo;
foo x; /* a normal use of typedef... */
foo foo; /* try this on gcc! is it a legal global? */
void main() { foo foo; } /* what about this ? */
370-C does not support typedef's and without working typedef's the
TYPE_NAME token simply will never occur. Typedef's are fair game for
extra credit points.
Representing C (C++, Java, etc.) Types
The type system is represented using data structures in the compiler's
implementation language.
In the symbol table and in the parse tree attributes used in type checking,
there is a need to represent and compare source language types. You might
start by trying to assign a numeric code to each type, kind of like the
integers used to denote each terminal symbol and each production rule of the
grammar. But what about arrays? What about structs? There are an infinite
number of types; any attempt to enumerate them will fail. Instead, you
should create a new data type to explicitly represent type information.
This might look something like the following:
struct c_type {
int base_type; /* 1 = int, 2=float, ... */
union {
struct array {
int size;
struct c_type *elemtype;
} a;
struct ctype *p;
struct struc {
char *label;
struct field **f;
} s;
} u;
}
struct field {
char *name;
struct ctype *elemtype;
}
Given this representation, how would you initialize a variable to
represent each of the following types:
int [10][20]
struct foo { int x; char *s; }
Example Semantic Rules for Type Checking
| grammar rule | semantic rule
|
| E1 : E2 PLUS E3
| E1.type = check_types(PLUS, E2.type, E3.type)
|
Where check_types() returns a (struct c_type *) value. One of the values
it should be able to return is Error. The operator (PLUS) is included in
the check types function because behavior may depend on the operator --
the result type for array subscripting works different than the result
type for the arithmetic operators, which may work different (in some
languages) than the result type for logical operators that return booleans.
Type Promotion and Type Equivalence
When is it legal to perform an assignment x = y? When x and y are
identical types, sure. Many languages such as C have automatic
promotion rules for scalar types such as shorts and longs.
The results of type checking may include not just a type attribute,
they may include a type conversion, which is best represented by
inserting a new node in the tree to denote the promoted value.
Example:
int x;
long y;
y = y + x;
For records/structures, some languages use name equivalence, while
others use structure equivalence. Features like typedef complicate
matters. If you have a new type name MY_INT that is defined to be
an int, is it compatible to pass as a parameter to a function that
expects regular int's? Object-oriented languages also get interesting
during type checking, since subclasses usually are allowed anyplace
their superclass would be allowed.
Implementing Structs
- storing and retrieving structs by their label -- the struct label is
how structs are identified. You do not have to do typedefs and such.
The labels can be keys in a separate hash table, similar to the global
symbol table. You can put them in the global symbol table so long as
you can tell the difference between them and variable names.
- You have to store fieldnames and their types, from where the struct is
declared. You could use a hash table for each struct, but a link list
is OK as an alternative.
- You have to use the struct information to check the validity of each
dot operator like in rec.foo. To do this you'll have to lookup rec
in the symbol table, where you store rec's type. rec's type must be
a struct type for the dot to be legal, and that struct type should
include a hash table or link list that gives the names and types of
the fields -- where you can lookup the name foo to find its type.
lecture #17 began here
Run-time Environments
How does a compiler (or a linker) compute the addresses for the various
instructions and references to data that appear in the program source code?
To generate code for it, the compiler has to "lay out" the data as it will
be used at runtime, deciding how big things are, and where they will go.
- Relationship between source code names and data objects during execution
- Procedure activations
- Memory management and layout
- Library functions
lecture #18 began here
Announcements
-
Affinity Research Group Workshop this Saturday, 9-3 in SH 124.
Extra credit: 20 points will be added to your midterm exam grade
for attending and providing sincere attention at this workshop.
Lunch is also provided.
- HW#5 is available
Scopes and Bindings
Variables may be declared explicitly or implicitly in some languages
Scope rules for each language determine how to go from names to declarations.
Each use of a variable name must be associated with a declaration.
This is generally done via a symbol table. In most compiled languages
it happens at compile time (in contrast, for example ,with LISP).
Environment and State
Environment maps source code names onto storage addresses (at compile time),
while state maps storage addresses into values (at runtime). Environment
relies on binding rules and is used in code generation; state operations
are loads/stores into memory, as well as allocations and deallocations.
Environment is concerned with scope rules, state is concerned with things
like the lifetimes of variables.
Runtime Memory Regions
Operating systems vary in terms of how the organize program memory
for runtime execution, but a typical scheme looks like this:
| code
|
|---|
| static data
|
|---|
| stack (grows down)
|
|---|
| heap (may grow up, from bottom of address space)
|
The code section may be read-only, and shared among multiple instances
of a program. Dynamic loading may introduce multiple code regions, which
may not be contiguous, and some of them may be shared by different programs.
The static data area may consist of two sections, one for "initialized data",
and one section for uninitialized (i.e. all zero's at the beginning).
Some OS'es place the heap at the very end of the address space, with a big
hole so either the stack or the heap may grow arbitrarily large. Other OS'es
fix the stack size and place the heap above the stack and grow it down.
Questions to ask about a language, before writing its code generator
- May procedures be recursive? (Duh, all modern languages...)
- What happens to locals when a procedure returns? (Lazy deallocation rare)
- May a procedure refer to non-local, non-global names?
(Pascal-style nested procedures, and object field names)
- How are parameters passed? (Many styles possible, different
declarations for each (Pascal), rules hardwired by type (C)?)
- May procedures be passed as parameters? (Not too awful)
- May procedures be return values? (Adds complexity for non-local names)
- May storage be allocated dynamically (Duh, all modern languages...
but some languages do it with syntax (new) others with library (malloc))
- Must storage by deallocated explicitly (garbage collector?)
Activation Records
Activation records organize the stack, one record per method/function call.
| return value
|
| parameter
|
| ...
|
| parameter
|
| previous frame pointer (FP)
|
| saved registers
|
| ...
|
| FP--> | saved PC
|
| local
|
| ...
|
| local
|
| temporaries
|
| SP--> | ...
|
At any given instant, the live activation records form a chain and
follow a stack discipline. Over the lifetime of the program, this
information (if saved) would form a gigantic tree. If you remember
prior execution up to a current point, you have a big tree in which
its rightmost edge are live activation records, and the non-rightmost
tree nodes are an execution history of prior calls.
"Modern" Runtime Systems
The preceding discussion has been mainly about traditional languages such as
C. Object-oriented programs might be much the same, only every activation
record has an associated object instance; they need one extra "register" in
the activation record. In practice, modern OO runtime systems have many
more differences than this, and other more exotic language features imply
substantial differences in runtime systems. Here are a few examples of
features found in runtimes such as the Java Virtual Machine and .Net CLR.
Goal-directed programs have an activation tree each instant, due to
suspended activations that may be resumed for additional results. The
lifetime view is a sort of multidimensional tree, with three types of nodes.
Having Trouble Debugging?
To save yourself on the semester project in this class, you really do have
to learn gdb and/or ddd as well as you can. Sometimes it can help you
find your bug in seconds where you would have spent hours without it. But
only if you take the time to read the manual and learn the debugger.
To work on segmentation faults: recompile all .c files with -g and run your
program inside gdb to the point of the segmentation fault. Type the gdb
"where" command. Print the values of variables on the line mentioned in the
debugger as the point of failure. If it is inside a C library function, use
the "up" command until you are back in your own code, and then print the
values of all variables mentioned on that line.
There is one more tool you should know about, which is useful for certain
kinds of bugs, primarily subtle memory violations. It is called electric
fence. To use electric fence you add
/home/uni1/jeffery/ef/ElectricFence-2.1/libefence.a
to the line in your makefile that links your object files together to
form an executable.
lecture #19 began here
Need Help with Type Checking?
- Implement the C Type Representation given in lecture #16.
- Read the Book
- What OPERATIONS (functions) do you need, in order to check
whether types are correct? What parameters will they take?
Intermediate Code Generation
Goal: list of machine-independent instructions for each procedure/method
in the program. Basic data layout of all variables.
Can be formulated as syntax-directed translation
- add new attributes where necessary, e.g. for expression E we might have
- E.place
- the name that holds the value of E
- E.code
- the sequence of intermediate code statements evaluating E.
- new helper functions, e.g.
newtemp()
- returns a new temporary variable each time it is called
newlabel()
- returns a new label each time it is called
- actions that generate intermediate code formulated as semantic rules
| Production | Semantic Rules |
| S -> id ASN E | S.code = E.code || gen(ASN, id.place, E.place)
|
| E -> E1 PLUS E2 | E.place = newtemp();
E.code = E1.code || E2.code || gen(PLUS,E.place,E1.place,E2.place);
|
| E -> E1 MUL E2 | E.place = newtemp();
E.code = E1.code || E2.code || gen(MUL,E.place,E1.place,E2.place);
|
| E -> MINUS E1 | E.place = newtemp();
E.code = E1.code || gen(NEG,E.place,E1.place);
|
| E -> LP E1 RP | E.place = E1.place;
E.code = E1.code;
|
| E -> IDENT | E.place = id.place;
E.code = emptylist();
|
Three-Address Code
Basic idea: break down source language expressions into simple pieces that:
- translate easily into real machine code
- form a linearized representation of a syntax tree
- allow us to check our own work to this point
- allow machine independent code optimizations to be performed
- increase the portability of the compiler
Instruction set:
| mnemonic | C equivalent | description
|
|---|
| ADD, SUB,MUL,DIV | x := y op z | store result of binary operation on y and z to x
|
|---|
| NEG | x := op y | store result of unary operation on y to x
|
|---|
| ASN | x := y | store y to x
|
|---|
| ADDR | x := &y | store address of y to x
|
|---|
| LCONT | x := *y | store contents pointed to by y to x
|
|---|
| SCONT | *x := y | store y to location pointed to by x
|
|---|
| GOTO | goto L | unconditional jump to L
|
|---|
| BLESS,... | if x rop y then goto L | binary conditional jump to L
|
|---|
| BIF | if x then goto L | unary conditional jump to L
|
|---|
| BNIF | if !x then goto L | unary negative conditional jump to L
|
|---|
| PARM | param x | store x as a parameter
|
|---|
| CALL | call p,n,x | call procedure p with n parameters, store result in x
|
|---|
| RET | return x | return from procedure, use x as the result
|
|---|
Declarations (Pseudo instructions):
These declarations list size units as "bytes"; in a uniform-size environment
offsets and counts could be given in units of "slots", where a slot (4 bytes
on 32-bit machines) holds anything.
| global x,n1,n2 | declare a global named x at offset n1 having n2 bytes of space
|
|---|
| proc x,n1,n2 | declare a procedure named x with n1 bytes of parameter space and n2 bytes of local variable space
|
|---|
| local x,n | declare a local named x at offset n from the procedure frame
|
|---|
| label Ln | designate that label Ln refers to the next instruction
|
|---|
| end | declare the end of the current procedure
|
|---|
TAC Adaptations for Object Oriented Code
| x := y field z | lookup field named z within y, store address to x
|
|---|
| class x,n1,n2 | declare a class named x with n1 bytes of class variables and n2 bytes of class method pointers
|
|---|
| field x,n | declare a field named x at offset n in the class frame
|
|---|
| new x | create a new instance of class name x
|
|---|
Variable Allocation and Access Issues
Given a variable name, how do we compute its address?
- globals
- easy, symbol table lookup
- locals
- easy, symbol table gives offset in (current) activation record
- objects
- easy, symbol table gives offset in object, activation record has
pointer to object in a standard location
- locals in some enclosing block/method/procedure
- ugh. Pascal, Ada, and friends offer their own unique kind of pain.
Q: does the current block support recursion? Example: for procedures
the answer would be yes; for nested { { } } blocks in C the answer
would be no.
- if no recursion, just count back some number of frame pointers based
on source code nesting
- if recursion, you need an extra pointer field in activation record
to keep track of the "static link", follow static link back some
# of times to find a name defined in an enclosing scope
Sizing up your Regions and Activation Records
Add a size field to every symbol table entry. Many types are not required
for your C370 project but we might want to discuss them anyhow.
- The size of integers is 4 (for x86; varies by CPU).
- The size of reals is... ? (for x86; varies by CPU).
- The size of strings is... <= 256? You could allocate static
256 character arrays in the global area, but better to do them as a
descriptor consisting of a length and a pointer.
- The size of arrays is (sizeof (struct descrip)) * the number of elements? Do we know an array size?
- Are arrays all int, or all real, or can they be mixed?
(in BASIC and other dynamic languages, they can be mixed!)
- Are there arrays of strings? -- yes
- what about sizes of structs?
You do this sizing up once for each scope. The size of each scope is the
sum of the sizes of symbols in its symbol table.
Run Time Type Information
Some languages would need the type information around at runtime; for
example, dynamic object-oriented languages. Its almost the case that one
just writes the type information, or symbol table information that includes
type information, into the generated code in this case, but perhaps one
wants to attach it to the actual values held at runtime.
struct descrip {
short type;
short size;
union {
char *string;
int ival;
float rval;
struct descrip *array;
/* ... for other types */
} value;
};
Compute the Offset of Each Variable
Add an address field to every symbol table entry.
The address contains a region plus an offset in that region.
No two variables may occupy the same memory at the same time.
Locals and Parameters are not Contiguous
For each function you need either to manage two separate regions
for locals and for parameters, or else you need to track where
in that region the split between locals and parameters will be.
Basic Blocks
Basic blocks are defined to be sequence of 1+ instructions in which
there are no jumps into or out of the middle. In the most extreme
case, every instruction is a basic block. Start from that perspective
and then lump adjacent instructions together if nothing can come between
them.
What are the basic blocks in the following 3-address code?
("read" is a 3-address code to read in an integer.)
read x
t1 = x > 0
if t1 == 0 goto L1
fact = 1
label L2
t2 = fact * x
fact = t2
t3 = x - 1
x = t3
t4 = x == 0
if t4 == 0 goto L2
t5 = addr const:0
param t5 ; "%d\n"
param fact
call p,2
label L1
halt
Basic blocks are often used in order to talk about
specific types of optimizations that rely on basic blocks. So if they are
used for optimization, why did I introduce basic blocks? You can view
every basic block as a hamburger; it will be a lot easier to eat if you
sandwich it inside a pair of labels (first and follow)!
Intermediate Code for Control Flow
Code for control flow (if-then, switches, and loops) consists of
code to test conditions, and the use of goto instructions and
labels to route execution to the correct code. Each chunk of code
that is executed together (no jumps into or out of it) is called
a basic block. The basic blocks are nodes in a control flow graph,
where goto instructions, as well as falling through from one basic block
to another, are edges connecting basic blocks.
Depending on your source language's semantic rules for things like
"short-circuit" evaluation for boolean operators, the operators
like || and && might be similar to + and * (non-short-circuit) or
they might be more like if-then code.
A general technique for implementing control flow code is to add
new attributes to tree nodes to hold labels that denote the
possible targets of jumps. The labels in question are sort of
analogous to FIRST and FOLLOW; for any given list of instructions
corresponding to a given tree node,
we might want a .first attribute to hold the label for the beginning
of the list, and a .follow attribute to hold the label for the next
instruction that comes after the list of instructions. The .first
attribute can be easily synthesized. The .follow attribute must be
inherited from a sibling.
The labels have to actually be allocated and attached to instructions
at appropriate nodes in the tree corresponding to grammar production
rules that govern control flow. An instruction in the middle of a
basic block need neither a first nor a follow.
| C code | Attribute Manipulations
|
|---|
| S->if E then S1 | E.true = newlabel();
E.false = S.follow;
S1.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.code
|
| S->if E then S1 else S2
| E.true = newlabel();
E.false = newlabel();
S1.follow = S.follow;
S2.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.code || gen(GOTO, S.follow) ||
gen(LABEL, E.false) || S2.code
|
Exercise: OK, so what does a while loop look like?
lecture #20 began here
Announcement
Co-op positions available for fall 2006 at Los Alamos
National Laboratory-in the Computing, Telecommunications, and
Networking Division.
LANL is seeking outstanding SOPHOMORE, JUNIOR AND NON-
GRADUATING SENIOR LEVEL Computer Science majors to work in
the areas of networking, desktop support, high performance
computing or software engineering. Positions are available
for the fall 2006 semester. MUST HAVE A GPA OF 3.0 OR HIGHER.
To request a referral go to www.nmsu.edu/pment, click on "Co-
op Job Listings", Job #86 or call the co-op office at 646-
4115. LANL is requiring a cover letter to also be sent,
please send that via email at coop@nmsu.edu in the subject
line put attn: LANL cover letter.
Co-op Office
646-4115
More on Generating Code for Boolean Expressions
Last time we started to look at code generation for control structures
such as if's and while's. Of course, before we can see the big
picture on these we have to understand how to generate code for the
boolean expressions that control these constructs.
Comparing Regular and Short Circuit Control Flow
Different languages have different semantics for booleans; for example
Pascal treats them as identical to arithmetic operators, while the
C family of languages (and many ) others specify "short-circuit"
evaluation in which operands are not evaluated once the answer to
the boolean result is known. Some ("kitchen-sink" design) languages
have two sets of boolean operators: short circuit and non-short-circuit.
(Does anyone know a language that has both?)
Implementation techniques for these alternatives include:
- treat boolean operators same as arithmetic operators, evaluate
each and every one into temporary variable locations.
- add extra attributes to keep track of code locations that are
targets of jumps. The attributes store link lists of those instructions
that are targets to backpatch once a destination label is known.
Boolean expressions' results evaluate to jump instructions and program
counter values (where you get to in the code implies what the boolean
expression results were).
- one could change the machine execution model so it implicity routes
control from expression failure to the appropriate location. In
order to do this one would
- mark boundaries of code in which failure propagates
- maintain a stack of such marked "expression frames"
Non-short Circuit Example
a<b || c<d && e<f
translates into
100: if a<b goto 103
t1 = 0
goto 104
103: t1 = 1
104: if c<d goto 107
t2 = 0
goto 108
107: t2 = 1
108: if e<f goto 111
t3 = 0
goto 112
111: t3 = 1
112: t4 = t2 AND t3
t5 = t1 OR t4
Short-Circuit Example
a<b || c<d && e<f
translates into
if a<b goto L1
if c<d goto L2
goto L3
L2: if e<f goto L1
L3: t = 0
goto L4
L1: t = 1
L4: ...
Note: L3 might instead be the target E.false; L1 might instead be E.true;
no computation of a 0 or 1 into t might be needed at all.
While Loops
So, a while loop, like an if-then, would have attributes similar to:
| C code | Attribute Manipulations
|
|---|
| S->while E do S1 | E.true = newlabel();
E.false = S.follow;
S1.follow = E.first;
S.code = gen(LABEL, E.first) || E.code || gen(LABEL, E.true)||
S1.code ||
gen(GOTO, E.first)
|
C for-loops are trivially transformed into while loops, so they pose no new
code generation issues.