lecture #1 began here
In order to make this scheduled class time more bearable (for me), I propose to hold office hours Tuesday and Thursday after class from 12-1pm in the Corbett Center "food court". You may eat with me, or just ask questions. I am always going to be eating the Mexican "Daily" Special at the food court, which has not changed at NMSU in three years (The answer to official NM State Paradox: one red and one green).
Of course, I am also available in my office by appointment, and sometimes happy to take drop-in visits. E-mail works best though.
On the other hand, if we tried to look at, say, GCC as our compiler of study, it is very likely that most of you would enjoy the process. Most real compilers are quite complex, and were not designed to be "read". An exception is [Fraser/Hanson 95], see below.
For the reasons given above, in this course I propose to describe for you, and provide essential documentation on, my own open source compiler for the Unicon language, and the compilers for the Icon language upon which Unicon builds.
For example, figuring out how to implement high-level constructs has moved beyond while loops, or function calls, to more advanced topics such as object-orientation, or concurrency. Improving the efficiency of generated code might mean: special-purpose transformations to take advantage of novel features of modern hardware, or it might mean: customizing a virtual machine, or mixing virtual machine code and native code.
One big difference between this course and the first compiler course is: in the first compiler course, the major focus is on writing a compiler from scratch, starting from the ground up with lexical and syntax analysis. In this course, we will instead study some existing compilers that are part of a language family (the Icon and Unicon language), and your assignments and semester project will add to or improve those compilers, rather than having to write a whole compiler yourself.
The first two versions of the Icon language were developed in Ratfor in the late 1970's. Starting around 1980 this implementation was discarded and the language rewritten from scratch in C and UNIX. A virtual machine was adopted and refined over a decade. Over time, many parts originally written in assembler were rewritten in C as the semantics of the language came to be better understood.
Around 1990, another radical rewrite was produced in order to support an optimizing compiler with much higher performance than the virtual machine. The new compiler was written from scratch, but its runtime system was produced by ingeniously adapting the code from the virtual machine. At approximately the time of Ralph Griswold's retirement in the mid-1990's, the Icon language was more or less frozen; the culminating work on the subject being the 3rd edition of "The Icon Programming Language", published in 1997.
But people in the Internet Age wanted applications to do more than they did in the 1980's. Additions to the open-source language were made by various authors. Around 2000, several such additions were merged together by a group led by Clint Jeffery under the name Unicon; the project was placed under the GPL and moved to SourceForge.net.
The most recent major addition to the language, done at NMSU, has been a set of portable high-level 3D graphics facilities, built on top of but much easier to learn and use than OpenGL. Current and on-going work includes the addition of portable audio and video support, in order to accomodate the needs of Collaborative Virtual Environments. With such extensions, the goal is not to add access to a popular C or C++ or Java API, but rather, to design higher level abstractions at the level most programmers are comfortable working, and then figure out how to provide those abstractions.
lecture #2 began here
+-bin------ executable binaries and support files
+-config--+ configurations
| +-unix---+
| | +-intel_linux-
| | +-sun_gcc-----
| +-win32---
| | +-gcc---------
| | +-msvc--------
/-unicon--+-src------ source code (C)
| +-common--
| +-h-------
| +-icont---
| +-iconc---
| +-rtt-----
| +-runtime-
+-tests---- tests
| +-general-
| +-posix---
+-unicon--- source code (Unicon)
+-unicon--
lecture #3 began here
s ? exprcauses a string s to be the subject of scanning by expr. Within expr, s is a default argument to all string functions, and a scanning position is maintained and used to move through the string looking for specified patterns. move(rel) and tab(abs) move to a position and return the substring between the old and new position. Several functions such as find() and many() return positions within the string that can be used as the destinations for tab(). The unary operator =x is tab(match(x)), moving the position past x if x is the next thing in the string.
It is reasonable to ask whether Icon string scanning makes processing of strings by hand easier than doing it in C or Java, and whether Icon string scanning is similar to the scanning done by compiler scanners.
$include "ytab_h.icn" # yacc's token categories global yytext # lexeme global yyin # source file we are reading global yytoken # token (a record) global yylineno, yycolno, yyfilename # source location
global O, D, L, H, R, FS, IS, W, idchars procedure init_csets() O := '01234567' D := &digits L := &letters ++ '_' H := &digits ++ 'abcdefABCDEF' R := &digits ++ &letters FS := 'fFlL' IS := 'uUlL' W := ' \t\v' idchars := L ++ D end
record token(tok, s, line, column, filename)
procedure reswords()
static t
initial {
t := table([Beginner+Ender, IDENT])
t["abstract"] := [0, ABSTRACT]
t["break"] := [Beginner+Ender, BREAK]
t["by"] := [0, BY]
t["case"] := [Beginner, CASE]
t["class"] := [0, CLASS]
t["create"] := [Beginner, CREATE]
t["default"] := [Beginner, DEFAULT]
t["do"] := [0, DO]
t["else"] := [0, ELSE]
t["end"] := [Beginner, END]
t["every"] := [Beginner, EVERY]
t["fail"] := [Beginner+Ender, FAIL]
t["global"] := [0, GLOBAL]
t["if"] := [Beginner, IF]
t["import"] := [0, IMPORT]
t["initial"] := [Beginner, iconINITIAL]
t["initially"] := [Ender, INITIALLY]
t["invocable"] := [0, INVOCABLE]
t["link"] := [0, LINK]
t["local"] := [Beginner, LOCAL]
t["method"] := [0, METHOD]
t["next"] := [Beginner+Ender, NEXT]
t["not"] := [Beginner, NOT]
t["of"] := [0, OF]
t["package"] := [0, PACKAGE]
t["procedure"] := [0, PROCEDURE]
t["record"] := [0, RECORD]
t["repeat"] := [Beginner, REPEAT]
t["return"] := [Beginner+Ender, RETURN]
t["static"] := [Beginner, STATIC]
t["suspend"] := [Beginner+Ender, SUSPEND]
t["then"] := [0, THEN]
t["to"] := [0, TO]
t["until"] := [Beginner, UNTIL]
t["while"] := [Beginner, WHILE]
}
return t
end
This "big-inhale" model did not work well on original 128K PDP-11 UNIX computers, but works well in this century. At present, the code assumes Unicon source files are less than a megabyte -- a lazy programmer's error. Although Unicon programs are much shorter than C programs, an upper limit of 1MB is bound to be reached someday. Homework: fix the lexical analyzer so it works correctly on files of arbitrarily large size.
procedure yylex_reinit()
yytext := ""
yylineno := 0
yycolno := 1
lastchar := ""
if type(yyin) == "file" then
buffer := reads(yyin, 1000000)
else
buffer := yyin
tokflags := 0
end
This little procedure is entirely hidden from the regular lexical analyzer code by writing that regular code in a helper function yylex2(), and writing the semicolon insertion logic in a yylex() function that calls yylex2 when it needs a new token.
Initialization for the yylex() function shows the static variables used to implement the one token of lookahead. If the global variable buffer doesn't hold a string anymore, /buffer will succeed and it must be that we are at end-of-file and should return 0.
procedure yylex()
static saved_tok, saved_yytext
local rv, ender
initial {
if /buffer then
yylex_reinit()
}
if /buffer then {
if \debuglex then
write("yylex() : 0")
return 0
}
If we inserted a semicolon last time we were called, the saved_tok will
be the first token of the next line; we should return it.
if \saved_tok then {
rv := saved_tok
saved_tok := &null
yytext := saved_yytext
yylval := yytoken := token(rv, yytext, yylineno, yycolno, yyfilename)
if \debuglex then
write("yylex() : ",tokenstr(rv), "\t", image(yytext))
return rv
}
Otherwise, we should obtain the next token by calling yylex2().
We have to check for end of file, remember if the last token could
end an expression, call yylex2(), and update buffer to be the smaller
string remaining after the token.
ender := iand(tokflags, Ender)
tokflags := 0
if *buffer=0 then {
buffer := &null
if \debuglex then
write("yylex() : EOFX")
return EOFX
}
buffer ? {
if rv := yylex2() then {
buffer := tab(0)
}
else {
buffer := &null
yytext := ""
if \debuglex then
write("yylex() : EOFX")
return EOFX
}
}
After fetching a new token, we have to decide whether to insert a semicolon
or not. This is based on global variable ender (whether the previous token
could end an expression) and global variable tokflags (which holds both whether
the current token could begin an expression, and whether a newline occurred
between the last token and the current token. iand() is a bitwise AND,
equivalen to C language & operator, used to pick bits out of a set of
boolean flags encoded as bits within an integer.
if ender~=0 & iand(tokflags, Beginner)~=0 & iand(tokflags, Newline)~=0 then {
saved_tok := rv
saved_yytext := yytext
yytext := ";"
rv := SEMICOL
}
Returning a token requires allocation of a token() record instance, which
is stored in a global variable.
yylval := yytoken := token(rv, yytext, yylineno, yycolno, yyfilename)
if \debuglex then
write("yylex() : ", tokenstr(rv), "\t", image(yytext))
return rv
end
procedure yylex2()
static punc_table
initial {
init_csets()
reswords := reswords()
punc_table := table(uni_error)
punc_table["'"] := do_literal
punc_table["\""] := do_literal
punc_table["!"] := do_bang
punc_table["%"] := do_mod
punc_table["&"] := do_and
punc_table["*"] := do_star
punc_table["+"] := do_plus
punc_table["-"] := do_minus
punc_table["."] := do_dot
punc_table["/"] := do_slash
punc_table[":"] := do_colon
punc_table["<"] := do_less
punc_table["="] := do_equal
punc_table[">"] := do_greater
punc_table["?"] := do_qmark
punc_table["@"] := do_at
punc_table["\\"] := do_backslash
punc_table["^"] := do_caret
punc_table["|"] := do_or
punc_table["~"] := do_tilde
punc_table["("] := do_lparen
punc_table[")"] := do_rparen
punc_table["["] := do_lbrack
punc_table["]"] := do_rbrack
punc_table["{"] := do_lbrace
punc_table["}"] := do_rbrace
punc_table[","] := do_comma
punc_table[";"] := do_semi
punc_table["$"] := do_dollar
every punc_table[!&digits] := do_digits
every punc_table["_" | !&letters] := do_letters
}
The main lexical analyzer code strips comments and whitespace, and
calls the function table for the first non-whitespace character it
finds. Note support for #line directives, and the use of string
scanning.
yycolno +:= *yytext
repeat {
if pos(0) then fail
if
="#" then {
if ="line " then {
if yylineno := integer(tab(many(&digits))) then {
=" \""
yyfilename := tab(find("\"")|0)
}
}
tab(find("\n") | 0)
next
}
if ="\n" then {
yylineno +:= 1
yycolno := 1
if tokflags < Newline then
tokflags +:= Newline
next
}
if tab(any(' ')) then { yycolno +:= 1; next }
if tab(any('\v\^l')) then { next }
if tab(any('\t')) then {
yycolno +:= 1
while (yycolno-1) % 8 ~= 0 do yycolno +:= 1
next
}
yytext := move(1)
return punc_table[yytext]()
}
end
The functions in the punctuation table select integer codes and match
the rest of the lexeme. do_comma() illustrates an unambiguous token
selection, while do_plus() illustrates a more common case where the
"+" character could start any of 5 different tokens depending on the
character(s) that follow it. Tokens starting with "letters" are looked
up in a reserved words
table, which tells whether they are special, or just a variable name.
procedure do_comma()
return COMMA
end
procedure do_plus()
if yytext ||:= =":" then {
if yytext ||:= ="=" then { return AUGPLUS }
return PCOLON
}
if yytext ||:= ="+" then {
if yytext ||:= =":=" then {return AUGUNION}
return UNION
}
tokflags +:= Beginner
return PLUS
end
procedure do_letters()
yytext ||:= tab(many(idchars))
x := reswords[yytext]
tokflags +:= x[1]
return x[2]
end
lecture #4 began here
buffer := reads(yyin, 1000000)
buffer := ""; while buffer ||:= reads(yyin, 1000000)
buffer := reads(yyin, stat(yyin).size)
The lexical analyzer has a header src/h/lexdef.h containing some global declarations and useful macros. src/common/lextab.h contains a "token table" similar to Unicon's reswords table, and an "operator table" encoding for each operator both its beginner/ender flags and whether it is unary, binary, or is used both ways. lextab.h also includes a function getopr() that takes 400 lines of switches and if-statements to examine characters one at a time and calculate the longest operator when several operator characters are adjacent.
Although mostly handwritten, there are a couple juicy specification files src/common/tokens.txt and src/common/op.txt that define the tokens and operators; these are processed by an Icon program (src/common/mktoktab.icn) to generate certain .h files if the ultrarare event of adding a new token or operator ever occurs. These operations are so rare that the Makefiles do not even consider them by default.
The main "lexical analyzer" file is src/common/yylex.h. The logic in this lexical analyzer is similar to Unicon's lexical analyzer. When semicolon insertion, whitespaces, and comments are handled and a new actual token is being examined, the code boils down to four possibilities:
if (isalpha(c) || (c == '_')) { /* gather ident or reserved word */
if ((t = getident(c, &cc)) == NULL)
goto loop;
}
else if (isdigit(c) || (c == '.')) { /* gather numeric literal or "." */
if ((t = getnum(c, &cc)) == NULL)
goto loop;
}
else if (c == '"' || c == '\'') { /* gather string or cset literal */
if ((t = getstring(c, &cc)) == NULL)
goto loop;
}
else { /* gather longest legal operator */
if ((n = getopr(c, &cc)) == -1)
goto loop;
t = &(optab[n].tok);
yylval = OpNode(n);
}
Aside from syntax and the larger # of lines of code, some of the main
differences between the C-based lexical analyzer and the Unicon lexical
analyzer for this language include:.
yyin := "" every yyin ||:= preprocessor(fName, uni_predefs) do yyin ||:= "\n"
#begdef foo(x) ...body of foo can be as long as you like ...more body of foo ... #enddefNote that the conventional ANSI C preprocessor has already "tokenized" the input into chunks in order to look for macro names and such; the rtt lexical analyzer gets text chunks one at a time, and categorizes them. src/rtt/rttlex.c is only 430 lines of code, but the C preprocessor that is doing much of the work is another 6500 lines, in the src/preproc/ directory! It also keeps its size down by utilizing the symbol table code (a hash table) to handle the reserved words.
There is some amount of lexical awkwardness in rtt due to the extended C syntax. For example, p ** q in regular C might mean "multiply p by what q points at"; under rtt C code fragments may use this meaning but the extra syntax (for type inferencing) uses ** as a set intersection operator, and a global variable flag has to keep track and change the lexical analyzer behavior depending on whether one is in "type mode" or "regular C mode".
The yylex() function, since its preprocessor has already provided it with the sequence of characters that will comprise the token, is mainly looking at the characters to select what integer code to return. For example, for strings that might be identifiers:
if (yylval.t->tok_id == Identifier) {
/*
* See if this is an identifier, a reserved word, or typedef name.
*/
sym = sym_lkup(yylval.t->image);
if (sym != NULL)
yylval.t->tok_id = sym->tok_id;
}
do
tok = cpp_get_token (parse_in);
while (tok->type == CPP_PADDING);
...
switch (tok->type) {
/* 10 branches for different C preprocessor categories,
that require special handling, including errors and constants
*/
}
return tok->type;
Some apparent lexical-analysis information is located more in the parser
file, c-parse.y. For example, a reserved words array (reswords) lives
in that file, including the actual spelling of all the C reserved
words. There are extra complications from the fact that the gcc C front
end supports gcc, g++, and GNU Objective C.
At first glance, the lexical analyzer appears to live in Parser/tokenizer.[ch], about 1500 lines of C. The lexical analyzer is interesting in that Python is often used in an interactive interpreter mode; the lexical analyzer has to worry about whether its reading from a file or a terminal.
There are about 53 token categories in Python. A lot of lexical analyzer code handles Unicode, and includes tricks for international character handling, for example the code specification is supplied in an optional special comment.
Because the lexical analyzer is feeding an interpreter, Python tokens are full-blown Python objects, not just C structs.
The main lexical analysis function is int tok_get(tok, pstart, pend), 400 lines or so, which reads from a function tok_nextc(), calculates an indentation level (used in Python parsing), and handles a Lot of weird special cases. One point here, true in most lexical analyzers, is the need to lookahead at characters in order to decide what token a certain character might be. If you are string scanning, lookahead is very simple, but if you are working with file I/O it can get complicated.
do {
c = tok_nextc(tok);
} while (c == ' ' || c == '\t' || c == '\014');
...
if (c == '#') { /* skip comment code */... }
if (c == EOF) { /* return ENDMARKER or ERRORTOKEN */
if (isalpha(c) || c == '_') { /* identifier */ }
if (c == '\n') { /* newline */
if (c == '.') { /* period, or number starting with period? */ }
if (isdigit(c)) { /* number */ }
if (c == '\'' || c == '"') { /* string */ }
if (c == '\\') { /* line continuation */ }
...
/* Check for two-character token */
{
int c2 = tok_nextc(tok);
int token = PyToken_TwoChars(c, c2);
if (token != OP) {
int c3 = tok_nextc(tok);
int token3 = PyToken_ThreeChars(c, c2, c3);
if (token3 != OP) {
token = token3;
} else {
tok_backup(tok, c3);
}
*p_start = tok->start;
*p_end = tok->cur;
return token;
}
tok_backup(tok, c2);
}
...
return PyToken_OneChar(c);
Cute code:
int
PyToken_OneChar(int c)
{
switch (c) {
case '(': return LPAR;
case ')': return RPAR;
...
default: return OP;
}
}
int
PyToken_TwoChars(int c1, int c2)
{
switch (c1) {
case '=':
switch (c2) {
case '=': return EQEQUAL;
}
break;
...
return OP;
}
int
PyToken_ThreeChars(int c1, int c2, int c3)
{
switch (c1) {
case '<':
switch (c2) {
case '<':
switch (c3) {
case '=':
return LEFTSHIFTEQUAL;
}
break;
}
break;
...
return OP;
}
lecture #5 began here
The start symbol for the grammar is named program, and the
semantic action code fragment for this nonterminal calls the rest of the
compiler (semantic analysis and code generation) directly on the root of
the syntax tree, rather than storing it in a global variable for the
main() procedure to examine.
program : decls EOFX { Progend($1);} ;
Many context free grammar rules are recursive, with an empty production to
terminate the recursion. The rule for declarations is typical:
decls : { $$ := EmptyNode }
| decls decl {
if yynerrs = 0 then iwrites(&errout,".")
$$ := node("decls", $1, $2)
} ;
The "semantic action" (code fragment) for every production rule builds a
syntax tree node and assigns it to $$ for the nonterminal left-hand side of
the rule.Another common grammar pattern is a production rule that has many different alternatives, such as the one for individual declarations:
decl : record
| proc
| global
| link
| package
| import
| invocable
| cl
;
For such "unary" productions, child's syntax tree node suffices for the parent, no
new tree node is needed.
Some nonterminals mostly correspond to a specific sequence of terminals, as is the case for package references:
packageref : IDENT COLONCOLON IDENT { $$ := node("packageref", $1,$2,$3) }
| COLONCOLON IDENT { $$ := node("packageref", $1,$2) }
;
The lexical analyzer has already constructed a valid "leaf" for each terminal symbol, so if a production rule has only one terminal symbol in it, for a syntax tree we can simply use the leaf for that nonterminal (for a parse tree, we would need to allocate an extra unary internal node):
lnkfile : IDENT ; | STRINGLIT ;
The expressions (which comprise about half of the grammar) use a separate nonterminal for each level of precedence instead of YACC's tricks for resolving precedence. This may be up to around 20 levels of nonterminals. A typical rule looks like:
expr6 : expr7 ;
| expr6 PLUS expr7 { $$ := node("Bplus", $1,$2,$3);} ;
| expr6 DIFF expr7 { $$ := node("Bdiff", $1,$2,$3);} ;
| expr6 UNION expr7 { $$ := node("Bunion", $1,$2,$3);} ;
| expr6 MINUS expr7 { $$ := node("Bminus", $1,$2,$3);} ;
The "B" stands for "binary", to distinguish these operators from their unary brethren.
The 20 levels of nonterminals approach is inherited from Icon
and probably makes the parser way larger than it has to be.
Potential syntax homework!
lecture #6 began here
Icon employed a relatively clever approach to doing syntax error messages with YACC -- the parse state at the time of error was enough to do fairly good diagnoses. But, every time the grammar changed, the parse state numbers could change wildly. For Unicon I developed the Merr tool, which associates parse error example fragments with the corresponding diagnostic error message, and detects/infers the parse state for you, reducing the maintenance problem when changing the grammar. Merr also considers the current input token in deciding what error message to emit, making it fundamentally more precise than Icon's approach.
proc : prochead SEMICOL locals initial procbody END {
Proc1($1,$2,$3,$4,$5,$6);
} ;
The definitions of macros such as Proc1 are in icont/tgrammar.c and the
analogous cgrammar.c file. For the compilers, they generally do the typical
syntax tree construction, as in
#define Proc1(x1,x2,x3,x4,x5,x6) $$ = tree6(N_Proc,x1,x1,x4,x5,x6)In icont, there are many exceptions to this general case, as many kinds of work seem to be done during parsing. The macros do not all assign to $$, especially for unary nodes with a leaf/token on the righthand side, where the leaf can be propagated up to the parent, as for IDENT.
For all its generality, the Icon grammar hardwires processing at a global declaration level. It seems to require code generation on a per-procedure basis, since The rule for declarations has no macro for combining them
decls : ; | decls decl ;This seems to make sense for icont but not for iconc, which does substantial analysis on the entire program; iconc is most likely doing its whole program analysis using the global symbol table in which all declarations are stored. In any case there should probably be a macro for decls:decls decl.
For another thing, the Icon grammar used repeatedly a technique which I have scrupulously avoided, putting semantic actions in the middle of a rule.
prochead: PROCEDURE IDENT {Prochead1($1,$2);} LPAREN arglist RPAREN {
Prochead2($1,$2,$3,$4,$5,$6);
} ;
The reason to avoid this is that adding new epsilon productions frequently
introduces reduce-reduce conflicts, which can be a major problem. When is
it safe to use this technique: when there is only one possible parse at the
location of the epsilon. Why do it? Because (in this example) you might
want to create some data structures for later rules to populate.
A more urgent tweak is to replace calls to yyerror() to give the Icon compilers more control over their syntax error messages. A regular call yyerror("syntax error") is replaced by a call that passes the current input token and parse state in (yyerror(yychar, yylval, yy_state)). Other messages (mainly stack overflow) are rerouted as internal parser errors rather than programmer errors.
To get rid of pscript, the best solution would be to switch icont/iconc over to using the Merr syntax error generator tool.
rttgram.y stores three different kinds of things on the YACC value stack: token pointers, treenode pointers, and long integers. It uses many different syntax tree node constructors, with names indicating how many children. node0..node5 all take an int label, a token pointer, and a number of node *'s. The token pointer is a lexical "anchor" for error reporting purposes. The use of 6 functions is avoiding C's ... variable arguments construct for no good reason.
There is something to be said for a system that allows an entire large language's grammar to be specified in 107 lines of code. However, the parser generator has no "semantic action" flexibility, so modifying the parser or adapting it for other tools is a painful exploration of C code that lives nowhere near the grammar rules, and it is not easy to customize behavior on a per-rule basis as is done heavily in Unicon/Icon/GCC.
lecture #7 began here
The external public interface of the preprocessor is line-oriented, consisting of a generator preproc(filename, predefinedsyms) which suspends each line of the output, one after another. Its invocation from the main() procedure looks like:
yyin := "" every yyin ||:= preprocessor(fName, uni_predefs) do yyin ||:= "\n"Since the preprocessor outputs line-by-line, there is a mismatch between it and the lexical analyzer's big-inhale model. The preprocessor could be modified to fit better with the lexical analyzer or vice versa.
The preprocessor function takes the filename to read from, along with a table of predefined symbols which allows the preprocessor to respond to lines like
$ifdef _SQLbased on what libraries are available and how Unicon was built on a given platform.
The preprocessor() function itself starts each call off with initializations:
static nonpunctuation
initial {
nonpunctuation := &letters ++ &digits ++ ' \t\f\r'
}
preproc_new(fname,predefined_syms)
The initialization code opens fname, creates empty stacks to keep track of nested
$ifdef's and $include's, initializes counters to 0 and so forth.
The preprocessor is line-oriented. For each line, it looks for a preprocessor directive, and if it does not find one, it just scans for symbols to replace and returns the line. The main loop looks like
while line := preproc_read() do line ? {
preproc_space() # eat whitespace
if (="#" & match("line")) | (="$" & any(nonpunctuation)) then {
suspend preproc_scan_directive()
}
else {
&pos := 1
suspend preproc_scan_text()
}
}
The procedures preproc_scan_directive() and preproc_scan_text() work on
special and ordinary lines, respectively. The line is not a parameter
because it is held in the current string scanning environment. The
preproc_scan_directive() starts by discardign whitespace and identifying
the first word on the line (which must be a valid preprocessor directive).
A case expression handles the various directives (define, undef, ifdef, etc.).
Defined symbols are stored in a table. $ifdef and $ifndef are handled using
a global variable preproc_if_state to track the boolean conditions.
A count of $ifdef's is maintained, in order to handle matching endif's.
Include files are handled using a stack, but an additional set of filenames is kept to prevent infinite recursion when files include each other. When a new include directive is encountered it is checked against the preproc_include_set and if OK, it is opened. The including file (and its associated name, line, etc) are pushed onto a list named preproc_file_stack. It is possible to run out of open files under this model, although this is not easy under modern operating systems.
Include files are searched on an include file path, consisting of a list of directories given on an optional environment variable (LPATH) followed by a list of standard directories. The standard directories are expected to be found relative to the location of the virtual machine binaries.
The procedure preproc_scan_text has the relatively simple job of replacing any symbols by their definitions within an ordinary source line. Since macros do not have parameters, it is vastly simpler than in a C preprocessor. The main challenges are to avoid macro substitutions when a symbol is in a comment or within quotes (string or cset literals). An additional issue is to handle multiline string literals, which occur in Icon when a string literal is not closed on a line, and instead the line ends with an underscore indicating that it is continued on the next line. Skipping over quoted text sounds simple, but is trickier than it looks. Escape characters mean you can't just look for the closing quote without considering what comes before it, and you can't just look at the preceding character since it might have been escaped, as in "\\". The code looks similar to:
repeat {
while tab(upto('"\\')) do {
case move(1) of {
"\\": move(1)
default: {
break break
}
}
}
# ...
if not match("_",,-1) then
break
&subject := preproc_read() | fail
# ...
}
The code in preproc_read() for reading a line does a regular Icon read(); end of file causes the preprocessor file_stack to be popped for the previous file's information. Performance has not been perceived as a significant problem, it it would be interesting to convert preproc_read() to use a big-inhale model to see if any statistical difference could be observed. When an include is encountered under a big-inhale, the saved state would contain the string of remaining file contents, instead of the open file value.
typedef struct fstruct { /* input file structure */
struct fstruct *prev; /* previous file */
char *fname; /* file name */
long lno; /* line number */
FILE *fp; /* stdio file pointer */
int m4flag; /* nz if preprocessed by m4 */
int ifdepth; /* $if nesting depth when opened */
} infile;
Compared with Unicon, there is extra code for handling standard in ("-") and
a command-line option to pipe the input through the m4 preprocessor before
the Icon preprocessor works on it! This might be overengineering a bit, and
it is a portability problem (works only on UNIX).
The list of directives is given as an array whose elements include the name (used in an old-fashioned linear search) and a function pointer for handling the various directives. The array of function pointers helps organize the code but one can't help thinking a binary search or a hash table might be good. But, this is performed only proportional to the number of preprocessor directives, which are rare in most programs, and there are only 10 directives.
The C preprocessor interface is character-oriented, not line-oriented. This allows function ppch() to substitute for whatever character fetching function the lexical analyzer was using prior to the preprocessor's existence. ppch() has some semi-complicated buffering and is a gigantic, messy function (150 lines).
Symbol definitions are stored in yet another implementation of C hash tables. There are several in the compiler, each a little different but mostly redundant.
Since it does a similar job as ipp.c only more, there are a lot of similar features in rtt's preprocessor. As far as I know, no attempt was made to share code. ANSI C has sticky, detailed rules for how macro parameters are applied, which are better than the pre-ANSI C language whose preprocessor semantics were not entirely well-defined. Besides the complex paramterized macros, a C preprocessor has to have a mini-expression evaluation mechanism to handle complex boolean conditional expressions such as
#if (THIS || THAT) && (defined(OTHER) || (ALT1 && ALT2))This almost amounts to the whole C expression evaluation rules, all over again in the preprocessor (preproc/evaluate.c).
Because RTT is written in C and Icon runs on many operating systems, there is a fair amount of conditional code in the preprocessor, especially in the handling of the search through standard system directories for include files. UNIX has certain conventions, but many platforms have multiple C compilers installed, and some platforms (anyone use MVS?) have complicated naming for system include's. Some of this #ifdef code probably needs to be removed as it refers to dead compilers on ancient platforms.
lecture #8 began here
Ultimately, semantics means meaning, and the true "meaning" of a program isn't just a property of the source code, but of the semantics of the language and of the underlying platform (the CPU or virtual machine, the runtime system, and the operating system). For this reason we will be studying semantics long after we move beyond syntax trees and the information the compiler is able to add to them. Similarly, the analyses performed by the optimization phases of a compiler revolve around understanding the program's semantics well enough to handle special cases while preserving correct behavior.
In conventional YACC, a %union declaration is necessary to handle the varying types of objects on the value stack including the type used for syntax tree nodes, but iyacc has no need of this awkward mechanism: the value stack like all structure types can hold any type of value in each slot. Similarly, tree nodes can hold children of any type, potentially eliminating any awkwardness of mixing tokens and internal nodes. Of course, you do still have to check what kind of value you are working with.
record treenode(label, children)holds one node worth of information. For convenience, a procedure node(label, kids[]) takes an arbitrary number of parameters and constructs the list of children for you. Leaves have a null children field.
Earlier we saw that the start symbol of the Unicon grammar had a semantic action that called a procedure Progend(). We will cover most of that procedure next week since it is all about object-orientation, but at the end Progend(), a call to yyprint() performs the tree traversal for code generation. A classic tree traversal pattern would look like:
procedure traverse(node)
if node is an internal node {
every child := ! node.children do traverse(child)
generate code for this internal node (postfix)
}
else
generate code for this leaf
end
The code generator traversal yyprint() is a lot more complicated than that,
but fits the general pattern. The main work done at various nodes is to
write some text to the output file, yyout. Most ordinary internal nodes are
of type treenode as described above. But because there are several kinds of
internal nodes and several kinds of leaves, the "if node is an internal node"
is implemented as a case expression. Besides a regular treenode, the other
kinds of internal nodes are objects of type declaration, class, and argument
list. For regular treenodes, another case expression on the node's label
field is used to determine
what kind of code to generate, if any, besides visiting children and
generating their code.
The default behavior for an internal node is to just visit the children, generating their code. For ordinary syntax constructs (if, while, etc.) this works great and a copy of the code is written out, token by token. But several exceptions occur, mainly for the pieces of Unicon syntax that extend Icon's repertoire. For example, packages and imports are not in Icon and require special treatment.
procedure yyprint(node)
static lasttok
case type(node) of {
"treenode" : {
case node.label of {
"package": { } # handled by semantic analysis
"import": { print_imports(node.children[2]) }
# implement packages via name mangling
"packageref": {
if *node.children = 2 then
yyprint(node.children[2]) # ::ident
else { # ident :: ident
yyprint(node.children[1])
writes(yyout, "__")
outcol +:= ((* writes(yyout, node.children[3].s)) + 2)
}
}
New syntax constructs such as procedure parameter defaults and type
restrictions, and variable initializers, are other examples where
the default traversal would output things illegal in Icon.
They are implemented by skipping some of
the children (assignment and value) in the regular pass, and
adding extra code elsewhere, discussed below.
"varlist2"|"stalist2": { yyprint(node.children[1]) }
"varlist4"|"stalist4": {
yyprint(node.children[1])
yyprint(node.children[2])
yyprint(node.children[3])
}
Much of this special logic is orchestrated by the code for traversing a
procedure; it can visit its arguments and variable declarations and apply
special rules to them.
"proc": {
yyprint(node.children[1])
every yyprint(node.children[2 to 3])
if exists_statlists(node.children[3]) then {
ini := node.children[4]
yyprint("\ninitial {")
if ini ~=== EmptyNode then { # append into existing initial
yyprint(ini.children[2])
yyprint(";\n")
}
yystalists(node.children[3])
yyprint("\n}\n")
}
else
every yyprint(node.children[4])
(node.children[1].fields).coercions()
yyvarlists(node.children[3])
yyprint(node.children[5])
yyprint(node.children[6])
}
The default behavior of visiting one's children is very simple, as is the
handling of other kinds of internal nodes, which are objects.
For the objects, a method Write() is invoked.
"error": fail
default:
every yyprint(!node.children)
}
"declaration__state" | "Class__state" | "argList__state":
node.Write(yyout)
The outer case expression of yyprint() continues with various kinds of leaf
(token) nodes. These mainly know how to write their lexemes out. But, a
lot of effort is made to try to keep line and column number information
consistent. Variables outline and outcol are maintained as each token is
written out. Integers and string literals found in the syntax tree are
written out as themselves. Since they have no attached lexical attributes,
they are a bit suspect in terms of maintaining debugging consistency.
It turns out the reason they occur at all, and the reason they have no
source lexical attributes, is that artificial syntax subtrees
are generated to handle certain object-oriented constructs, and within
those subtrees strings and integers may be placed, which do not correspond
to anywhere in the source code.
"integer": {
writes(yyout, node); outcol +:= *string(node)
}
"string": {
node ? {
while writes(yyout, tab(find("\n")+1)) do {
outline+:=1; outcol:=1;
}
node := tab(0)
}
writes(yyout, node); outcol +:= *node
}
"Normally", tokens are written out at exactly the line and column they
appear at in the source code. But a myriad of constructs may bump them
around. If the output falls behind (in lines, or columns) extra whitespace
can be inserted to stay in sync. If output gets ahead by lines, a #line
directive can back it up, but if output gets ahead by columns, there is
nothing much one can do, except make sure subsequent tokens don't
accidentally get attached/concatenated onto earlier tokens. This occurs,
for example, when the output code for an object-oriented construct in an
expression is longer than the source expression, perhaps due to name mangling.
Specific token combinations are checked, but the list here may be incomplete
(possible BUG!).
For source tokens, not only might the line and column change, the
filename could be different as well.
"token": {
if outfilename ~== node.filename | outline > node.line then {
write(yyout,"\n#line ", node.line-1," \"", node.filename,"\"")
outline := node.line
outcol := 1
outfilename := node.filename
}
while outline < node.line do {
write(yyout); outline +:= 1; outcol := 1
}
if outcol >= node.column then {
# force space between idents and reserved words, and other
# deadly combinations (need to add some more)
if ((\lasttok).tok = (IDENT|INTLIT|REALLIT) & reswords[node.s][2]~=IDENT)|
(((\lasttok).tok = NMLT) & (node.tok = MINUS)) |
((\lasttok).tok = node.tok = PLUS) |
((\lasttok).tok = node.tok = MINUS) |
((reswords[(\lasttok).s][2]~=IDENT) & (node.tok=(IDENT|INTLIT|REALLIT)))|
((reswords[(\lasttok).s][2]~=IDENT) & (reswords[node.s][2]~=IDENT))
then
writes(yyout, " ")
}
else
while outcol < node.column do { writes(yyout, " "); outcol +:= 1 }
Most tokens' lexemes are finally written out by writing node.s:
writes(yyout, node.s)
outcol +:= *node.s
lasttok := node
}
"null": { }
default: write("its a ", type(node))
}
end
lecture #9 began here
The Unicon OOP facilities were originally prototyped as a semester class project in a "special topics" graduate course. Writing the prototype in a very high-level language like Icon, and developing it as a preprocessor with name mangling, allowed the initial class mechanism to be developed in a single evening, and a fairly full, usable system with working inheritance to be developed in the first weekend. By the end of the semester, the system was robust enough to write it in itself, and it was released to the public shortly afterwards as a package for Icon called "Idol". Many many improvements were made after this point, often at the suggestion of users.
An initial design goal was to make the absolute smallest additions to the language that were necessary to support object-orientation. Classes were viewed as a version of Icon's record data type, retaining its syntax for fields (member variables), but appending a set of associated procedures. Because records have no concept of public and private, neither did classes. Another graduate student criticized this lack of privacy, and for several versions, everything was made private unless an explicit public keyword was used. But eventually support for privacy was dropped on the grounds that it added no positive capabilities and was un-Iconish. The existence of classes with hundreds of "getter" and "setter" methods was considered a direct proof that "private" was idiotic in a rapid prototyping language.
class A(x,y)
method m()
write("hello")
end
end
These five lines generate 25 lines for Icont to translate into virtual
machine code. The first two lines are line directives showing from whence
this source code originated:
#line 0 "/tmp/uni13804206" #line 0 "a.icn"Global declarations (including procedures) would be passed through the preprocessor pretty nearly intact, but for the class, we get a bunch of very different code. Methods are written out, with names mangled to a classname_methodname format.
procedure A_m(self)
#line 2 "a.icn"
write("hello");
end
Two record types are defined, one for the class instances and one for the
"methods vector", or "operation record". The methods vector is instantiated
exactly once in a global variable in classname__oprec format.
record A__state(__s,__m,x,y) record A__methods(m) global A__oprecThe default constructor for a class takes fields as parameters and uses them directly for initialization purposes. The first time it is called, a methods vector is created. Instances are given a pointer to themselves in an __s field (mainly for historical reasons) and to the methods vector in an __m field. Current NMSU grad student Sumant Tambe did an independent study project to get rid of __s and __m with partial success, but his work is not finished or robust enough to be enabled by default.
procedure A(x,y)
local self,clone
initial {
if /A__oprec then Ainitialize()
}
self := A__state(&null,A__oprec,x,y)
self.__s := self
return self
end
procedure Ainitialize()
initial A__oprec := A__methods(A_m)
end
# Build local_vars from the params and local var expressions. local_vars := set() extract_identifiers(node.children[1].fields, local_vars) extract_identifiers(node.children[3], local_vars)
Eventually, every identifier in every expression is checked against local_vars, and if not found there, against the class variables stored in a variable self_vars:
self_vars := set() every insert(self_vars, c.foreachmethod().name) every insert(self_vars, c.foreachfield()) every insert(self_vars, (!c.ifields).ident) every insert(self_vars, (!c.imethods).ident)For an IDENT node, the tests boil down to:
if node.tok = IDENT then {
if not member(\local_vars, node.s) then {
if member(\self_vars, node.s) then
node.s := "self." || node.s
else
node.s := mangle_sym(node.s)
}
}
Undeclared locals and globals are mangled to include the current
package name if there is one.
class subclass : super1 : super2 : ... ( ...fields... )The semantics of inheritance, and particularly of multiple inheritance, are interesting in Unicon; the implementation is relatively simple. An example of inheritance is given by class Class, from uni/unicon/idol.icn
class declaration(name,fields,tag,lptoken,rptoken) ... end ... class Class : declaration (supers, methods, text, imethods, ifields, glob, linkfile, dir, unmangled_name, supers_node)Unique perspective on inheritance in Unicon comes from the actual acquisition of inherited data fields and methods by the subclass. Some object-oriented languages do this inheritance "by aggregation", creating a copy of the superclass in the subclass. This is fine, but it makes "overriding" an anomaly, when overriding the parent with new/different behavior is entirely routine. Unicon instead inherits by the child looking for things in the parent (and the parent's parent, etc.) that they don't already have. In the above example, class declaration effectively appends 5 fields from class declaration onto the end of its field list. The generated code for instances looks like
record Class__state(__s,__m,
supers,methods,text,imethods,ifields,
glob,linkfile,dir,unmangled_name,supers_node,
name,fields,tag,lptoken,rptoken)
The inheritance semantics is called "closure based" because the process
of looking for things to add from parent superclasses iterates until no
new information can be added, after which the subclass is said to be closed
on its parents. Other forms of closure appear frequently in CS.
lecture #10 began here
Java answer: there is no concrete multiple inheritance in Java, only abstract multiple inheritance via interfaces.)
Method transitive_closure() is one of the cleaner demonstrations of why Unicon is a fun language in which to write complex algorithms. It is walking through a class graph, but by the way it is not recursive.
method transitive_closure()
count := supers.size()
while count > 0 do {
added := taque()
every sc := supers.foreach() do {
if /(super := classes.lookup(sc)) then
halt("class/transitive_closure: couldn't find superclass ",sc)
every supersuper := super.foreachsuper() do {
if / self.supers.lookup(supersuper) &
/added.lookup(supersuper) then {
added.insert(supersuper)
}
}
}
count := added.size()
every self.supers.insert(added.foreach())
}
end
Now, given what I've said about Unicon providing a depthfirst inheritance
hierarchy semantics, what is wrong with this picture? The code is stable
and hasn't needed changes in several years, so I am not fishing for syntax
bugs, or claiming that there is a bug. But there is something odd. A
chocolate "peanut butter cup" is available in my office for the first
correct description of the problem.
The method resolve() within class Class finds the inherited fields and methods from the linearized list of superclasses.
#
# resolve -- primary inheritance resolution utility
#
method resolve()
#
# these are lists of [class , ident] records
#
self.imethods := []
self.ifields := []
ipublics := []
addedfields := table()
addedmethods := table()
every sc := supers.foreach() do {
if /(superclass := classes.lookup(sc)) then
halt("class/resolve: couldn't find superclass ",sc)
every superclassfield := superclass.foreachfield() do {
if /self.fields.lookup(superclassfield) &
/addedfields[superclassfield] then {
addedfields[superclassfield] := superclassfield
put ( self.ifields , classident(sc,superclassfield) )
if superclass.ispublic(superclassfield) then
put( ipublics, classident(sc,superclassfield) )
} else if \strict then {
warn("class/resolve: '",sc,"' field '",superclassfield,
"' is redeclared in subclass ",self.name)
}
}
every superclassmethod := (superclass.foreachmethod()).name() do {
if /self.methods.lookup(superclassmethod) &
/addedmethods[superclassmethod] then {
addedmethods[superclassmethod] := superclassmethod
put ( self.imethods, classident(sc,superclassmethod) )
}
}
every public := (!ipublics) do {
if public.Class == sc then
put (self.imethods, classident(sc,public.ident))
}
}
end
Unicon generates in each source directory an NDBM database (named uniclass.dir and uniclass.pag) that includes a mapping from class name to: what file the class lives in, plus, what superclasses, fields, and methods appear in that class. From these specifications, "link" declarations are generated for superclasses within subclass modules, plus the subclass can perform inheritance resolution. The code to find a class specification is given in idol.icn's fetchspec(). A key fragment looks like
if f := open(dir || "/" || env, "dr") then {
if s := fetch(f, name) then {
close(f)
return db_entry(dir, s)
}
close(f)
}
Unicon searches for "link" declarations in a particular order, given by the current directory followed by directories in an IPATH (Icode path, or perhaps Icon path) environment variable, followed by system library directories such as ipl/lib and uni/lib. This same list of directories is searched for inherited classes.
The string stored in uniclass.dir and returned from fetch() for class Class is:
idol.icn class Class : declaration(supers,methods,text,imethods,ifields,glob,linkfile,dir,unmangled_name,supers_node) ismethod isfield Read ReadBody has_initially ispublic foreachmethod foreachsuper foreachfield isvarg transitive_closure writedecl WriteSpec writemethods Write resolve end
procedure Progend(x1)
package_level_syms := set()
package_level_class_syms := set()
set_package_level_syms(x1)
scopecheck_superclass_decs(x1)
outline := 1
outcol := 1
#
# export specifications for each class
#
native := set()
every cl := classes.foreach_t() do {
cl.WriteSpec()
insert(native, cl)
}
#
# import class specifications, transitively
#
repeat {
added := 0
every super := ((classes.foreach_t()).foreachsuper() | !imports) do {
if /classes.lookup(super) then {
added := 1
readspec(super)
cl := classes.lookup(super)
if /cl then halt("can't inherit class '",super,"'")
iwrite(" inherits ", super, " from ", cl.linkfile)
writelink(cl.dir, cl.linkfile)
outline +:= 1
}
}
if added = 0 then break
}
#
# Compute the transitive closure of the superclass graph. Then
# resolve inheritance for each class, and use it to apply scoping rules.
#
every (classes.foreach_t()).transitive_closure()
every (classes.foreach_t()).resolve()
scopecheck_bodies(x1)
if \thePackage then {
every thePackage.insertsym(!package_level_syms)
}
#
# generate output
#
yyprint(x1)
write(yyout)
Besides classes and packages, Unicon adds to Icon one additional syntax construct in support of this kind of program: type checking and coercion of parameters. Parameters and return values are the points at which type errors usually occur, during an integration phase in a large project where one person's code calls another. The type checking and coercion syntax was inspired by the type checks done by the Icon runtime system at the boundary where Icon program code calls the C code for a given function or operator.
One additional comment about types is that the lack of types in declarations for ordinary variables such as "local x" does not prevent the Icon compiler iconc from determining the exact types of well over 90% of uses at compile time using type inference. Type checking can generally be done at compile time even if variable declarations do not refer to types... as long as the type information is available across file and module boundaries.
lecture #11 began here
C++ was very large to begin with, but after its initial development, C++ greatly changed and grew almost every mechanism any programmer could want; especially, it seemed to play the doppelganger on Ada, acquiring every feature of Ada necessary to kill it and take its place. A C++ compiler is a very large and complicated undertaking.
AT&T C++ was very buggy in 1985 and it took several years for compilers to stabilize and for the language to commercialize successfully. This would have failed had it originated someplace smaller than AT&T, and it almost failed anyhow. The C++ community owes a great debt to Borland, without whom the computing world would not be what it is today. Microsoft's agenda of killing Borland forced it into the C++ business, late and apparently against its will, but many current students seem to think Microsoft invented C++ and that it is the only systems language on the only platform that matters.
One of the main object-orientation implementation features of C++ that is almost unique is the support for non-virtual methods. C++ experts may disagree but arguably this is a performance hack that adds complexity to the language for the sake of execution speed of the generated code. Basically the symbol table for each class must track virtual and non-virtual properties, and generate C-like function calls for non-virtuals while generating virtuals through a methods vector (aka virtual function table), with an extra memory reference.
Another feature whose implementation in C++ or Java raises additional challenges is function overloading. The symbol table must hash not on method names but on method signatures. Signatures can generally be represented by strings. When automatic type conversions come into play, multiple possible interpretations of a call require more semantic checking than in simpler languages' compilers. For example, with methods f(char c) and f(float x), a call like f(32) is ambiguous and produces a semantic error (kind of like the multiple inheritance ambiguity we saw earlier).
Operator overloading adds a similar challenge to the semantic analysis of operators; sometimes + is an ADD instruction and sometimes it is a function call.
The Pascal language used a virtual machine instruction set called p-code to improve the language's portability. For each machine, the compiler would generate the same instructions, but the representation details of those instructions might vary from machine to machine, e.g. little-endian versus big-endian machines. The instruction set was very simple, enabling it to be implemented by single instructions or small sequences of instructions on most typical hardware of that day.
The modern Java virtual machine has similar goals, but with the modern twist of (in theory) true machine-independent VM code. Instruction portability is solved at a tiny cost in performance, but Java programs still have portability problems: they have traded instruction compatibility problems for library version compatibility problems. Microsoft's C# and .Net CLR started out as cheap replacements for Java, but CLR quickly gained a much larger goal, that of supporting many different languages, such as Visual BASIC and C#, with good interoperability and performance.
SmallTalk, Prolog, and Icon/Unicon have Pascal-style virtual machines, with the added twist of higher-level language semantics, semantics complicated enough that they do not map easily down to underlying harware. For such languages, a natural migration path is to develop a virtual machine that captures the semantics and allows experience and experimentation, followed (optionally, much later) by a compiler that produces native code.
The virtual machine instruction set does not go into the details of data types, the way JVM does. It is sort of like "values" and "memory" are virtual, not just "instructions" are virtual.
Some changes since the early days: itran+ilink merged to become icont. In future, these should just go ahead and jump into the VM itself (iconx).
Appendices B and C of the IC document the virtual machine instruction set. In addition, a crude Ucode Code Generation Guide is available.
| d-word |
|---|
| v-word |
The vword has either: nothing (if dword holds null type), or an integer value, or a pointer to char, or a pointer to a "block".
lecture #12 began here
A: Read section 3.1 of the IC for Griswold's answer to this question. To it reasons given there, I would add:
The question you should be asking is: why is it easier to write a VM for a platform than a backend for a compiler which generates C code for that platform?
There is a gigantic union type, union block, to denote a value who is a block that has not been examined yet. Each type of block then has a separate struct type associated with it. Some types require multiple block types to handle their representation.
Most blocks' sizes are statically determined by their type. An exception to this rule would be for records; the sizes of different records vary, so a size in bytes is stored in the block. Note: this is stupid, the size is a property of the record type, why are we wasting 4 bytes per instance to store its size in bytes? Fixing it could easily be a homework problem.
Here's a straw-man one-word descriptor design to allow us to consider the matter. You might come up with a better one-word design, this is kind of an exercize at identifying the minimal changes.
Unicon has 25 type codes, so 5 bits would suffice for them. Four more bits are used in current descriptors for flags NVPT. On a 32-bit platform, that leaves us with 23 bits: only enough to address 8MB of address space. Is this useful? Let's see.
For integers, numbers larger than 4M or smaller than -4M would spill out into "large integer" blocks. One would have to do some statistics to tell how often integers are in between 4M and 2B.
For blocks, if we play tricks, like aligning blocks on 8 byte-boundaries, we get three bits back! 64MB of blocks, hurray. But we have introduced a bit of fragmentation where memory is wasted, and the block pointer is now computed as
BlkLoc(d) = (d & 7FFFFF) << 3;This might be slower than just d.vword.bptr, which is just fetching the memory 4 bytes off from the address of d. We have extra instructions to fetch to make up for any fewer memory fetches from the smaller descriptor.
There might be additional tricks we could play to increase the addressable block memory, like allocating a different region for each data type. This would allow types to be aligned on larger boundaries than just 8-bytes, and allow each type its own 64MB or larger region (by adding the type's region address to the pointer). But now, following these pointers is slower, there is another addition along with all that bit twiddling.
And what about strings? They are majorly special-cased for high-performance in the current implementation, and many Icon programs do heavy string processing. Do we redesign strings to use a block, so we can store the length and a pointer? If so, we've added another level of indirection to strings, and we are paying a new 12 byte block for every string. Can we fit the length and pointer into the one-word descriptor? The existing design would allow us 31 bits for string length and string pointer. We'd quickly hit unacceptable limits, such as only allowing 8MB of string space (23 bits) and strings of max length 256 (8 bits). How about storing the length at the beginning of every string like Pascal and some BASIC's? If we use all 31 bits for the char *, substring sectioning becomes a more expensive copy operation, and some string concatenation optimizations are no longer possible.
We might actually want to implementing two internal types for strings: short strings (length <= 256, 8MB of them) and long strings (using a type code and an external block), mirroring the integer implementation.
One thing is for sure: one-word descriptors would add some strong limits on memory (on 32-bit machines, that is) and potentially a lot of complexity. If anyone wants to do them for a semester project, it would be a high-risk gamble with a high potential payoff, especially on 64-bit machines.
There are some special cases where assignments need special treatment. Special keyword variables have semantic rules which must be enforced, such as &subject must always be a string. These special cases are handled by trapped variables. A reference to a trapped variable allocates a block with special typecode so that a subsequent assignment (if there is one) can enforce the semantic rules. The Icont program (and hence Unicon) lazy about this and allocate the trapped variables just in case, even in syntactic contexts where the variable will be dereferenced rather than assigned.
lecture #13 began here
The core of the Unicon VM lives in runtime/interp.r, which has a gigantic function named interp(). This function does a fetch/decode/execute loop on VM instructions. Some instructions are implemented immediately; most invoke runtime system functions in other modules. The instructions are The fetch-decode-execute loop looks like
for (;;) {
...
lastop = GetOp; /* Instruction fetch */
...
switch ((int)lastop) {
... something like 100 cases, 25 for binary operators
}
The most exciting thing about this VM loop is that, when an expression
generates a result (and might have more), the way expressions suspend
is by calling interp() recursively.
The RTL code is illustrated by the implementation of the size operator (unary asterisk, as in *s). Compared with C, RTL has a documentation string, modified function header syntax, an "abstract" clause, a "type_case" operation, and separate code fragments for the different types. When the type of x is unknown, this expands to a switch statement executed at runtime, but when type information is known, the correct branch is selected at compile time.
"*x - return size of string or object x."
operator{1} * size(x)
abstract {
return integer
}
type_case x of {
string: inline {
return C_integer StrLen(x);
}
list: inline {
return C_integer BlkLoc(x)->list.size;
}
table: inline {
return C_integer BlkLoc(x)->table.size;
}
set: inline {
return C_integer BlkLoc(x)->set.size;
}
cset: inline {
register word i;
i = BlkLoc(x)->cset.size;
if (i < 0)
i = cssize(&x);
return C_integer i;
}
record: inline {
return C_integer BlkLoc(x)->record.recdesc->proc.nfields;
}
coexpr: inline {
return C_integer BlkLoc(x)->coexpr.size;
}
file: inline {
int status = BlkLoc(x)->file.status;
#ifdef Dbm
if ((status & Fs_Dbm) == Fs_Dbm) {
int count = 0;
DBM *db = (DBM *)BlkLoc(x)->file.fd;
datum key = dbm_firstkey(db);
while (key.dptr != NULL) {
count++;
key = dbm_nextkey(db);
}
return C_integer count;
}
#endif /* Dbm */
#ifdef ISQL
if ((status & Fs_ODBC) == Fs_ODBC) { /* ODBC file */
struct ISQLFile *fp;
int rc;
#if (ODBCVER >= 0x0351)
SQLLEN numrows;
#else /* ODBCVER >= 0x0351 */
SQLINTEGER numrows;
#endif /* ODBCVER >= 0x0351 */
fp = (struct ISQLFile *) BlkLoc(x)->file.fd;
rc = SQLRowCount(fp->hstmt, &numrows);
return C_integer(numrows);
}
#endif /* ISQL */
runerr(1100, x); /* not ODBC file */
}
default: {
/*
* Try to convert it to a string.
*/
if !cnv:tmp_string(x) then
runerr(112, x); /* no notion of size */
inline {
return C_integer StrLen(x);
}
}
}
end
lecture #14 began here
The icont source code lives in src/icont. The code generator apparently lives in icont/tcode.c, an 1100 line source file. The rest of the t*.c files are related to the compiler front-end (itran); the l*.c files are for the linker (ilink).
Code generation to the .u1 VM code file proceeds on a per-procedure basis, when the grammar action macro Procdcl(x) calls a C function codegen(x) on a parse tree node for a procedure. codegen(t) just resets a label counter and calls traverse(t), the "real" code generator function. The start nonterminal's action macro Progend calls gout(), which writes global (.u2) information out. The .u2 is then concatenated with the .u1 to form a .u file.
Function traverse(t) traverses the syntax (sub)tree rooted at t. It is a simple function with a long switch statement with 35 or so branches for different kinds of tree nodes. The bodies of most branches call helper functions to generate specific instructions, plus call traverse() recursively on child nodes.
static int traverse(t)
register nodeptr t;
{
register int lab, n, i;
struct loopstk loopsave;
static struct loopstk loopstk[LoopDepth]; /* loop stack */
static struct loopstk *loopsp;
static struct casestk casestk[CaseDepth]; /* case stack */
static struct casestk *casesp;
static struct creatstk creatstk[CreatDepth]; /* create stack */
static struct creatstk *creatsp;
n = 1;
switch (TType(t)) {
...
Interestingly, this code has static limits on the maximum lexical depth
of major control structures, which may be nested at most 20 deep (loops),
10 deep (case expressions), and 10 deep (co-expression creations).
These static limits do not appear to be checked. This is very possibly
a cause of (rare) coredumps from icont on especially large, complicated
.icn source files, such as the 475KB uniconlex.icn generated by ulex
by one of your illustrious classmates for an earlier homework. POSSIBLE
HOMEWORK: add checks, raise or better yet eliminate static limits on loopstk,
casestk, and creatstk!
The case branches for different node types rely on macros to pick out information from the tree nodes, so the actual representation of the tree is entirely hidden from this code! The actual tree representation as well as the definitions of these macros live in src/icont/tree.h. The macros hide it, but in reality, each "child" of a tree node can be either a (long) int, a (char *) string, or a pointer to another tree node:
union field {
long n_val; /* integer-valued fields */
char *n_str; /* string-valued fields */
nodeptr n_ptr; /* subtree pointers */
};
struct node {
int n_type; /* node type */
char *n_file; /* name of file containing source program */
int n_line; /* line number in source program */
int n_col; /* column number in source program */
union field n_field[1]; /* variable-content fields */
};
Tree0(t) picks out child 0,
which had better be a tree node, Val0(t) picks out child 0, which had
better be a (long integer) "immediate value" in the tree, etc.
This allows the code for each kind of tree node to be fairly simple.
The following example illustrates code generation for perhaps the most common syntax construct, binary operators. The code uses function emit() to generate individual VM instructions, and helper functions setloc() and binop() to do some of the dirty work. The overall template is: push a (null) descriptor to make space for the result, generate code for the left operand, generate code for the right operand, and generate the instruction for the operator. The pnull at the beginning may seem unnecessary; in many stack machines the protocol is: pop two operands and push the result in their place. The pnull is not an accident for Icon and Unicon, see if you can come up with a guess as to why you might not always want to throw away the operands right away and overwrite them with your result. Note that the tree node order for binary operators is "prefix" order, the operator in child 0, followed by operands in child 1 and 2.
case N_Augop: /* augmented assignment */
case N_Binop: /* or a binary operator */
emit("pnull");
traverse(Tree1(t));
if (TType(t) == N_Augop)
emit("dup");
traverse(Tree2(t));
setloc(t);
binop((int)Val0(Tree0(t)));
free(Tree0(t));
Working our way backwards through this code:
There are actually several variations on the emit() function: emitlab() for labels, emit(), emitl() for instructions that reference labels, emitn() for instructions with a numeric argument, and emits() for instructions with a string argument. These functions all use good-old fprintf, to a global variable codefile, and every single write is checked for failure to avoid surprises. Strangely enough the writecheck() is a function that should probably be inline or a macro.
static void emit(s)
char *s;
{
writecheck(fprintf(codefile, "\t%s\n", s));
}
I have omitted some of the details, such as opening files, but this whole
code generator is one of the simplest "production compiler" arrangements
I have ever seen. One of the most useful possible homework #4's I could
assign would be to rewrite this VM code generator for the Unicon translator,
so it would not need to write out Icon source code and then call the
icont program which re-parses everything. If we got rid of the itran
part of icont, it would eliminate ~6K lines of C code from
our distribution. But, there are enough details that it might be too
large for a homework. However, it would make an excellent semester project.
Function ilink() in link.c performs three passes, one to obtain global information from ucode files, one to omit unreferenced procedures, and one to do the "real" code generation. We will focus on the third pass. Looping through the list of files to link is done by walking through a linked list:
lfls = llfiles;
while ((lf = getlfile(&lfls)) != 0) {
For each file, function gencode() is called. After all files have
been visited, a function gentables() generates several global tables
containing information about record types and fields, globals and
statics.Function gencode() reads from a global open FILE named infile; it relies on helper functions such as getopc() to handle the details of reading "tokens" from the ucode file. The majority of gencode() is a giant switch statement on the opcodes of each (real and pseudo) instruction:
while ((op = getopc(&name)) != EOF) {
switch (op) {
For many opcodes, the binary instruction is very straightforward.
Sixty three opcodes share the following switch branch:
case Op_Plus:
newline();
lemit(op, name);
break;
The opcodes that don't share this branch must have special needs or
instruction immediate in-line operands (as opposed to operands pushed as
descriptors onto the interpreter stack), such as: constants, for which the
instruction and the value (or a pointer to it) are both emitted into the
binary bytecode. The procedure call instruction and the record field
operator VM instruction similarly have an integer code operand (specifying
how many parameters, and which field, respectively) rather than a
descriptor. Some of these special cases are mentioned in later sections.
The helper functions used in the common case are illustrative and worth
covering in detail. newline() is actually moving the input file past
the next line feed, i.e. to the next instruction. lemit() is the helper
function that is doing the real work, except that it is just a wrapper
(with optional debugging code) around outop(op), which is a macro that
calls intout(n) for opcodes. Instruction operands would instead be
generated by wordout() which tends to be long-sized/pointer-sized, which
might or might not be the same number of bytes as wordout().
word pc = 0; /* simulated program counter */Similar to emit(), at the binary level the actual code is "emitted" by helper functions such as wordout(), which emits a word. This might be tricker to get consistent than it sounds, at least when one is writing for "all possible platforms". Counterintuitively, these helper functions don't actually write to files, but rather, write bytes to a (gigantic) in-memory image of the icode file (named codep), an array of bytes that is grown dynamically.
/*
* wordout(i) outputs i as a word that is used by the runtime system
* WordSize bytes must be moved from &oword[0] to &codep[0].
*/
static void wordout(oword)
word oword;
{
int i;
union {
word i;
char c[WordSize];
} u;
CodeCheck(WordSize);
u.i = oword;
for (i = 0; i < WordSize; i++)
codep[i] = u.c[i];
codep += WordSize;
pc += WordSize;
}
In a previous software engineering course, a 25K LOC student project used enough record types that a size problem was identified. Happily, other researchers in the SmallTalk community had solved a similar problem in SmallTalk. An undergraduate student named Richard Hatch experimented with several techniques and eventually came up with a simple, cheap compression technique, which is described in a UTSA technical report. In a very large program, Field Table Compression has been demonstrated to take a 464KB field table and reduce it to 24KB, without requiring decompression and at an execution cost of around 2%.
lecture #15 began here
lecture #16 began here
write("hello " x)
would print
hello MervynSuggest how this feature might be implemented in the lexical analyzer of a compiler (not just for "write()", but for all expressions). What problems would occur if we added this feature to Unicon? Are there any semantic problems deep enough that we should NOT do it?
t := ["Washington": "Deleware", "Egypt": "Nile", 3.14: 3](This creates a table of size 3, with keys "Washington", "Egypt", and 3.14 mapped onto elements "Deleware", "Nile", and integer 3. It makes no sense.)
optionalExprs : exprs* ;
How would you handle YACC's notation for semantic actions to allow
this kind of production rule to build syntax trees?
lecture #17 began here
I want to talk about runtime system data layouts and garbage collection, give you some more examples of runtime system code, and then move on to the optimizing compiler. But first, let's talk about a little compiler research problem I have:
global foundsome
procedure main(av)
if av[1] == "-phase1" then {
f := open("ytab_h.icn") | stop("can't open ytab_h.icn")
f2 := open("meta.err2", "w") | stop("can't write meta.err2")
write(f2, "#Merr2 token samples")
while line := read(f) do {
line ? {
if tab(any('#$')) & ="define " &
(tok:=tab(many(&letters))) then {
write(f2, tok, " = ",map(tok))
}
}
}
close(f)
close(f2)
}
else if av[1] == "-phase2" then {
f := open("meta.err2")
if not ((line := read(f)) == "# Merr2 token samples") then
stop("# Merr2 token samples expected")
L := []
while line := read(f) do {
line ? {
if ="%%" then break
if tab(find("= ")+2) & &pos < *&subject then
put(L, tab(0))
}
}
close(f)
#
# now, iterate merr generating fragments
# that generate errors
#
foundsome := 0
iteration := 1
repeat {
generate_errs(L, iteration)
legalprefixes := newlegalprefixes
write("iteration ", iteration, " complete, found ",
foundsome, " prefixes ", *legalprefixes)
iteration +:= 1
if foundsome = 0 & iteration > 2 then break
}
}
end
global legalprefixes, newlegalprefixes, statesserved
# generate errors for fragments of length = i tokens
procedure generate_errs(L,i)
initial statesserved := set()
newlegalprefixes := []
every err := generrs(L, i) do {
f2 := open("errfrag.icn","w") | stop("can't open errfrag.icn")
write(f2, err)
close(f2)
system("unicon -E errfrag &> err.foo")
# writes(".")
f3 := open("err.foo")
Lmsgs := []
while(msgline := read(f3)) do {
# write(msgline)
put(Lmsgs, msgline)
}
close(f3)
if find("unexpected end of file"|"No errors"|" \"\":", msgline) then {
put(newlegalprefixes, err)
}
else if i = 1 then {
write(err, " not deemed a legal prefix:")
every write("\t", !Lmsgs)
}
if find("syntax error", msgline) then {
msgline ? {
tab(find("syntax error"))
="syntax error"
if =" (" & stt := integer(tab(many(&digits))) then {
if not member(statesserved, stt) then {
insert(statesserved, stt)
every write(!Lmsgs)
f2 := open("meta.err2", "a")
write(f2, err)
write(f2, "::: syntax error")
close(f2)
foundsome +:= 1
}
}
}
}
}
end
# generate errors for fragments of length = i tokens that start with t
procedure generrs(L, i)
write("generrs ", i, " L ", image(L))
if i = 1 then suspend !L
else {
suspend !legalprefixes ||" "|| !L
}
end
lecture #18 began here
struct b_real { /* real block */
word title; /* T_Real */
double realval; /* value */
};
struct b_cset { /* cset block */
word title; /* T_Cset */
word size; /* size of cset */
unsigned int bits[CsetSize]; /* array of bits */
};
Possible future change:
struct b_file { /* file block */
word title; /* T_File */
FILE *fd; /* stdio file pointer, socket, or wbp */
word status; /* file status */
struct descrip fname; /* file name (string qualifier) */
};
Imminent future change:
struct b_lelem { /* list-element block */
word title; /* T_Lelem */
word blksize; /* size of block */
union block *listprev; /* previous list-element block */
union block *listnext; /* next list-element block */
word nslots; /* total number of slots */
word first; /* index of first used slot */
word nused; /* number of used slots */
struct descrip lslots[1]; /* array of slots */
};
struct b_list { /* list-header block */
word title; /* T_List */
word size; /* current list size */
word id; /* identification number */
union block *listhead; /* pointer to first list-element block */
union block *listtail; /* pointer to last list-element block */
};
In a previous course similar to this one, I had students make insert
and delete into the middle of a list cheaper, by splitting list
element blocks. Another possible change:
struct b_proc { /* procedure block */
word title; /* T_Proc */
word blksize; /* size of block */
union { /* entry points for */
int (*ccode)(); /* C routines */
uword ioff; /* and icode as offset */
pointer icode; /* and icode as absolute pointer */
} entryp;
word nparam; /* number of parameters */
word ndynam; /* number of dynamic locals */
word nstatic; /* number of static locals */
word fstatic; /* index (in global table) of first static */
struct descrip pname; /* procedure name (string qualifier) */
struct descrip lnames[1]; /* list of local names (qualifiers) */
};
struct b_record { /* record block */
word title; /* T_Record */
word blksize; /* size of block */
word id; /* identification number */
union block *recdesc; /* pointer to record constructor */
struct descrip fields[1]; /* fields */
};
/*
* Alternate uses for procedure block fields, applied to records.
*/
#define nfields nparam /* number of fields */
#define recnum nstatic /* record number */
#define recid fstatic /* record serial number */
#define recname pname /* record name */
struct b_table { /* table-header block */
word title; /* T_Table */
word size; /* current table size */
word id; /* identification number */
word mask; /* mask for slot num, equals n slots - 1 */
struct b_slots *hdir[HSegs]; /* directory of hash slot segments */
struct descrip defvalue; /* default table element value */
};
struct b_slots { /* set/table hash slots */
word title; /* T_Slots */
word blksize; /* size of block */
union block *hslots[HSlots]; /* array of slots (HSlots * 2^n entries) */
};
struct b_telem { /* table-element block */
word title; /* T_Telem */
union block *clink; /* hash chain link */
uword hashnum; /* for ordering chain */
struct descrip tref; /* entry value */
struct descrip tval; /* assigned value */
};
Tables are now pretty ingenious. HSegs defaults to 12, HSlots defaults to 8.
What is the starting size of an empty table in bytes? Possible future
change:
struct b_tvsubs { /* substring trapped variable block */
word title; /* T_Tvsubs */
word sslen; /* length of substring */
word sspos; /* position of substring */
struct descrip ssvar; /* variable that substring is from */
};
struct b_tvtbl { /* table element trapped variable block */
word title; /* T_Tvtbl */
union block *clink; /* pointer to table header block */
uword hashnum; /* hash number */
struct descrip tref; /* entry value */
};
lecture #19 began here
Today's lecture is mostly from the text, Chapter 11.
Reality: actually there is a third heap (static region), and what happens when we Really run out of space in a heap?
Historical version #1: "fixed regions", statics OK via malloc, halt with an error if garbage collection fails to produce the space you need.
Historical version #2: "expandable regions", on UNIX only, grow the address space using brk/sbrk if garbage collection fails, mallocs are no longer OK.
Current: "multiple regions": maintain link list of "regions", malloc a new (larger) one whenever you run out of space, garbage collect both new and old regions.
/*
* AlcBlk - allocate a block.
*/
#begdef AlcBlk(var, struct_nm, t_code, nbytes)
{
/*
* Ensure that there is enough room in the block region.
*/
if (DiffPtrs(blkend,blkfree) < nbytes && !reserve(Blocks, nbytes))
return NULL;
/*
* Decrement the free space in the block region by the number of bytes
* allocated and return the address of the first byte of the allocated
* block.
*/
blktotal += nbytes;
var = (struct struct_nm *)blkfree;
blkfree += nbytes;
var->title = t_code;
}
#enddef
...
#begdef alclist_raw_macro(f,e_list,e_lelem)
/*
* alclist - allocate a list header block in the block region.
* A corresponding list element block is also allocated.
* Forces a g.c. if there's not enough room for the whole list.
* The "alclstb" code inlined so as to avoid duplicated initialization.
*
* alclist_raw() - as per alclist(), except initialization is left to
* the caller, who promises to initialize first n==size slots w/o allocating.
*/
struct b_list *f(uword size, uword nslots)
{
register struct b_list *blk;
register struct b_lelem *lblk;
register word i;
if (!reserve(Blocks, (word)(sizeof(struct b_list) + sizeof (struct b_lelem)
+ (nslots - 1) * sizeof(struct descrip)))) return NULL;
EVVal(sizeof (struct b_list), e_list);
EVVal(sizeof (struct b_lelem) + (nslots-1) * sizeof(struct descrip), e_lelem);
AlcFixBlk(blk, b_list, T_List)
AlcVarBlk(lblk, b_lelem, T_Lelem, nslots)
blk->size = size;
blk->id = list_ser++;
blk->listhead = blk->listtail = (union block *)lblk;
lblk->nslots = nslots;
lblk->first = 0;
lblk->nused = size;
lblk->listprev = lblk->listnext = (union block *)blk;
/*
* Set all elements beyond size to &null.
*/
for (i = size; i < nslots; i++)
lblk->lslots[i] = nulldesc;
return blk;
}
#enddef
#ifdef MultiThread
alclist_raw_macro(alclist_raw_0,0,0)
alclist_raw_macro(alclist_raw_1,E_List,E_Lelem)
#else /* MultiThread */
alclist_raw_macro(alclist_raw,0,0)
#endif /* MultiThread */
All of this doesn't matter unless a program exhibits "pathological memory behavior" of some sort. Pathological memory behavior could be defined as: when time spent allocating or deallocating memory is more than proportional to the amount of data being computed/processed. Example: program that allocates a list of size n for each 1 data element processed. Allocations are O(n2). Example: program that is garbage collecting every k instructions, where the garbage collecting is taking time proportional to N, the total data footprint of the program.
if (Qual(d)) postqual(&d); else if (Pointer(d)) markblock(&d);
Fortunately, when the Iconc compiler came along a general solution was found: tended locals link themselves on to a link list when a function is called, and unlink themselves when that C function returns. Generally, any pointer that may point into the heap should be declared "tended" IF a function may in fact, cause a garbage collection.
Note that you are always collecting for one particular region, but that you have to validate all pointers in all regions (if you have multiple regions) in order to find all the pointers in the collecting region. A smarter "generational" system would keep track of (rare?) pointers from old regions to new data to reduce the marking time.
There are four global/static arrays: firstd, firstp, ptrno, and bsizes that are of use in the garbage collection process. Their subscripts are the type codes. Firstd tells for each type, if it is a block type that holds descriptors, where is the location of the first descriptor within it. Firstp tells the same information for types that hold (union block *) pointers instead of descriptors. ptrno tells how many (union block *) pointers a block holds, and bsizes holds the size in bytes for each block (0 for variable-size blocks, which have a size in bytes as their second word). Unless you modify the garbage collector, the reason you care about this information is that if you need to add a new block type to the garbage collected runtime system, you need to add entries for it to these arrays.
Sweeping the string region consists of: taking your array of (pointers to) "live" string descriptors, sorting them, and sliding the clumps of descriptors up into a contiguous chunk (rmemmgt.r, scollect()). You overwrite the StrLoc() of every string descriptor as you walk down this list.
static void scollect(extra)
word extra;
{
register char *source, *dest;
register dptr *qptr;
char *cend;
if (qualfree <= quallist) {
/*
* There are no accessible strings. Thus, there are none to
* collect and the whole string space is free.
*/
strfree = strbase;
return;
}
/*
* Sort the pointers on quallist in ascending order of string
* locations.
*/
qsort((char *)quallist, (int)(DiffPtrs((char *)qualfree,(char *)quallist)) /
sizeof(dptr *), sizeof(dptr), (QSortFncCast)qlcmp);
/*
* The string qualifiers are now ordered by starting location.
*/
dest = strbase;
source = cend = StrLoc(**quallist);
/*
* Loop through qualifiers for accessible strings.
*/
for (qptr = quallist; qptr < qualfree; qptr++) {
if (StrLoc(**qptr) > cend) {
/*
* qptr points to a qualifier for a string in the next clump.
* The last clump is moved, and source and cend are set for
* the next clump.
*/
while (source < cend)
*dest++ = *source++;
source = cend = StrLoc(**qptr);
}
if ((StrLoc(**qptr) + StrLen(**qptr)) > cend)
/*
* qptr is a qualifier for a string in this clump; extend
* the clump.
*/
cend = StrLoc(**qptr) + StrLen(**qptr);
/*
* Relocate the string qualifier.
*/
StrLoc(**qptr) = StrLoc(**qptr) + DiffPtrs(dest,source) + (uword)extra;
}
/*
* Move the last clump.
*/
while (source < cend)
*dest++ = *source++;
strfree = dest;
}
Sweeping the block region consists of: sliding the live blocks up. All
the descriptors that point at the block were turned into a link list
pointed to by its title word; when you slide it up you walk through the
link list updating their pointers to point at the new location
(rmemmgt.r, adjust() and compact()). One gross part about this is, it
is split into two processes: adjust() to point all the live pointers
where the blocks are going to be, and a second pass (collect())
to actually memmove the blocks up into place. Extra brownie points
(say, a Coke or a piece of chocolate) if you can tell me why.
static void adjust(source,dest)
char *source, *dest;
{
register union block **nxtptr, **tptr;
/*
* Loop through to the end of allocated block region, moving source
* to each block in turn and using the size of a block to find the
* next block.
*/
while (source < blkfree) {
if ((uword)(nxtptr = (union block **)BlkType(source)) > MaxType) {
/*
* The type field of source is a back pointer. Traverse the
* chain of back pointers, changing each block location from
* source to dest.
*/
while ((uword)nxtptr > MaxType) {
tptr = nxtptr;
nxtptr = (union block **) *nxtptr;
*tptr = (union block *)dest;
}
BlkType(source) = (uword)nxtptr | F_Mark;
dest += BlkSize(source);
}
source += BlkSize(source);
}
}
/*
* compact - compact good blocks in the block region. (Phase III of garbage
* collection.)
*/
static void compact(source)
char *source;
{
register char *dest;
register word size;
/*
* Start dest at source.
*/
dest = source;
/*
* Loop through to end of allocated block space, moving source
* to each block in turn, using the size of a block to find the next
* block. If a block has been marked, it is copied to the
* location pointed to by dest and dest is pointed past the end
* of the block, which is the location to place the next saved
* block. Marks are removed from the saved blocks.
*/
while (source < blkfree) {
size = BlkSize(source);
if (BlkType(source) & F_Mark) {
BlkType(source) &= ~F_Mark;
if (source != dest)
mvc((uword)size,source,dest);
dest += size;
}
source += size;
}
/*
* dest is the location of the next free block. Now that compaction
* is complete, point blkfree to that location.
*/
blkfree = dest;
}
lecture #20 began here
"Speed isn't everything... it is the ONLY thing." - Seymour CrayIcon was not designed with execution speed as a first priority: first priority was programmer speed. This is consistent with Moore's law and the software crisis. But compared with other popular "scripting language", Icon fares well; Keith Waclena of University of Chicago documented many languages' performance on several benchmarks that show Icon comfortably ahead of most competing languages (Perl, Tcl, Python, etc.) on performance, although it remains comfortably behind C. But for a significant portion of real world applications, the speed difference between Icon/Unicon and C/C++ will rule out the former, or relegate it to a "prototyping language" role in those application domains where speed is critical, and limit it to a prototyping simulation role in domains with hard real-time constraints.
For the next month, much of our attention will be devoted to consideration of analysis and optimization techniques in Iconc, using it as a real world example as well as a "poster child" candidate for humanitarian aid. The optimization and analysis Iconc does includes:
Note that once you limit programmer time to a finite value (6 months, 70 years, whatever), C/C++ do not win by as much, or do not necessarily win at all. Our visitor last week Dr. Thomas was complaining because it was taking the icont linker (a C program) 43 seconds on a medium-speed (Pentium 4) processor to link ivib, a program of approximately 165 source files, 27K LOC, and 200 or so classes. Icont is a C program, so it should blaze with speed. Profiling showed 96% of the link time was spent in a single function, where code was traversing long link lists with naive code to search for every field name in every record type. The programmer who wrote it (a grad student) no doubt knew hash tables well, but didn't have time to do them. A linker written in Unicon would no doubt avoid this performance bug due to the availability of built-in hash tables.
Ken Walker wrote that a motivation for his compiler was "to have a vehicle for exploring optimization techniques".
Ken Walker stayed on a year after his graduation as a post-doc, working to make the compiler "production grade". The compiler was released publically. Some users were wildly enthusiastic and reported speedups of 500-1000+ %; others reported code generation bugs (especially in Large programs), and relatively little speedup for what seemed like excessively slow compiles of excessively large C code output. Despite a heroic effort, more work was needed to polish and mature the compiler.
Ralph Griswold, facing retirement, decided by 1993 or 1994 that he could not maintain (i.e. fix bugs in) Ken's optimizing compiler. Since then it became "deprecated" in the sense that it is included in Icon distributions but not built by default. In 1996, undergraduate Anthony Jones at U of Texas San Antonio did a Bachelor's Honor's Thesis on several improvements to iconc. In one year this undergraduate did: a 65% reduction in runtime memory requirements for type inferencing, 10-30% reduction in compile times, and a 5-10% improvement in the execution speed of the resulting code. As we will see from generate code samples, more significant improvements will be fairly "easy" to achieve.
Of the 25K LOC in iconc/, 5K LOC are a type inferencer (typeinfer.c) while 7K+ are the C code generator (ccode.c, codegen.c). The other half of the compiler is divided into many smaller tasks (compiler front end, various analysis and optimizations, symbol tables, etc.). For our purposes it is perhaps most interesting to study iconc from the outside in, so build it using "make Iconc" from the top level, and try it out on some sample programs.
lecture #21 began here
ccomp.o: ccomp.c $(CC) $(CFLAGS) -DICONC_XLIB="\"$(LIBS)\"" -c ccomp.cAfter a configure step, you also have to either "make Unicon" before trying to "make Iconc", or else you have to tweak config/unix/Config/runtime.mak so that the make Iconc will know to rebuild libtp (tweak available in CVS or ~jeffery/unicon/unicon/config/unix/Config).
Techniques for implementing this kind of behavior on top of conventional imperative control structures were done originally for Icon's VM in assembler, done more elegantly by O'Bagy ("the recursive interpreter") and refined by Walker in his compiler. Later, Proebsting devised yet another, even more elegant formulation for Jcon, a JVM implementation of Icon.
Iconc uses continuations in its generated code, with three tweaks to the basic paradigm:
*result = /* ... descriptor value produced */ return A_Continue;When a routine fails, result is not filled in and "return A_Resume" is executed. In the caller, this is checked by something like:
switch (operation(args, &result)) {
case A_Continue: break;
case A_Resume: goto failure_label;
}
This handles the various case, but begs to be simplified in as many
situations as possible, such as: when the operation cannot fail.
Consider the operation i to j. This operation can be implemented in Icon with a procedure like
procedure To(i, j)
while i <= j do {
suspend i
i +:= 1
}
fail
end
It can be implemented by an analogous C function similar to the following (for simplicity, C ints are used here instead of Icon values).
int to(i, j, result, succ_cont)
int i, j;
int *result;
int (*succ_cont)();
{
int signal;
while (i <= j) {
*result = i;
signal = (*succ_cont)();
if (signal != A_Resume)
return signal;
++i;
}
return A_Resume;
}
There is no explicit failure label in this code, but it is possible to view
the code as if an implicit failure label occurs before the ++i.
Note that the above example shows a very common pattern, which is: if
a signal is returned that isn't one the function knows about, it returns
that signal. In this sense, signals can be thought of as a simplified
kind of exception, and can certainly be used to return through many
levels of C calls back to some point high above in the call tree.
The Icon expression
every write(1 to 3)
can be compiled into the following code (for simplicity, the write function
has been translated into printf and scoping issues for result have been
ignored). Note that the every simply introduces failure.
switch (to(1, 3, &result, sc)) { /* standard signal-handling code */
...
}
int sc() {
printf("%d\n", result);
return A_Resume;
}
The generated code for how execution is to proceed forward (i.e. the success continuation) has to unwind all the suspended generators off the C stack. A location-specific signal (unique integer code) is returned, which is just returned by C functions until the bounded expression is encountered, where the signal can be converted to a goto:
switch (move(1, &trashcan, sc)) {
case 1:
goto L1;
case A_Resume:
goto L1;
}
L1: /* bounding label & failure label */
int function-name(nargs, args, result, succ_cont)
int nargs; dptr args; dptr result;
continuation succ_cont;
{
...
}
succ_cont is allowed to be NULL, indicating that a function will never
be resumed, so iconc already does the optimization I mentioned as a
candidate for the VM.
lecture #22 began here
Walker's dissertation uses some really weird characters; I've figured out how to get them into HTML, and get them to view OK in OpenOffice, but not how to print them (.ps and .pdf shows blank for those characters). If anyone has a method of getting ŝ and ĥ to be printable in openoffice, better than inserting them as images, I'd be very greatful.
case type(rv := f(x)) of {
"integer": do_int(rv)
"string": do_string(rv)
"list": do_list(rv)
...
}
This is a fairly common pattern in many applications. If selecting a case
is extremely cheap, it is not too bad, but often it is really a long nested
sequence of if-then-elseif's we are replacing here.
procedure f(x, cont_int, cont_str, cont_list)
...
# ... at point where we would return from f, we usually know its type
# write:
cont_str(s)
# instead of: return s
... same for other cases (int, list, etc.)
end
...
f(x, do_int, do_str, do_list)
...
If you want more on continuation passing style, I'll keep dribbling out a
bit on it in future lectures, but the examples from the Wiki page are not
especially excellent and we will need to look further.
Variables may take different types at runtime:
x := read() x := numeric(x)The most general implementation of subsequent operations on x would have to check the types (x may be string, integer, real, or null after this code executes, assuming it was null to begin with).
Usually you can identify a unique type, or maybe a unique type or else a null; in the above example, the function call numeric(x) knows x is a string...unless it is a null because read() failed. The typecheck for numeric(x) might be simplified in this case. In the following example, the second use of variable x can guarantee exactly one type (string) even though the value was supplied by read(), which can fail. No typecheck is needed.
if x := read() then y := x || ";"But how does the compiler analyze the types for all possible programs, not just for these little toy examples?
Before we run off and try and infer types for the whole language, we need to understand much simpler approximations of both the language semantics we are dealing with and the type information we wish to compute. Start with: ignore semantics except program counter and variable bindings. In the following example, there are four program points at which we may compute information (labeled 1: through 4:).
procedure main()
local s, n
# 1:
s := read()
# 2:
every n := 1 to 2 do {
# 3:
write(s[n])
}
# 4:
end
If the standard input is "abc", the program has the following bindings:
1: [s = null, n = null]
2: [s = "abc", n = null]
3: [s = "abc", n = 1]
3: [s = "abc", n = 2]
4: [s = "abc", n = 2]
A first approximation to what we might want to eventually compute might be:
for every program point, what is the set of all possible bindings?
Since read() can return any string, often there will be infinite sets of
bindings possible at any given point. Type inferencing abstracts the
values and produces abstract bindings of type information:
1: [s = null, n = null]
2: [s = string, n = null]
3: [s = string, n = integer]
4: [s = string, n = integer]
By the way, what is wrong with this analysis? If given an empty file,
read() can fail, so reality is more like:
1: [s = null, n = null]
2: [s = string ++ null, n = null]
3: [s = string ++ null, n = integer]
4: [s = string, n = integer]
Location 4 does not list null as a possibility...why?
In any case, this is getting a bit ahead of our thread, and we should restrain ourselves from reality for a bit.
procedure main() every write(1 to 3) endHere is the parse tree and the flow graph. Note that the flow graph can be derived automatically from the parse tree.
|
|
|
|---|
if x = 7 then {
...
# x is 7 and y is 3
}
else {
...
# (x is null and y is 1) or (x is "abc" and y is 2)
}
x := y + 2
The assignment at the bottom of this has two incoming edges; information
from the two paths is merged via "union".
If expressions and the like, split out two edges with more specific information known based on which edge was taken:
lecture #23 began here
Thanks to Kosta for solving my ŝ and ĥ problem (answer was: run a newer version of OpenOffice on a newer Linux distribution).
% iconc typeinf-example.icn Translating to C: typeinf-example.icn: No errors; no warnings Compiling and linking C code: /tmp/ccrIzgC3.o(.text+0x214): In function `P000_main': : undefined reference to `O11_subsc' collect2: ld returned 1 exit status *** C compile and link failed ***Most likely you have accidentally got yourself an incomplete Iconc runtime system library rt.a.
[] subsc 11 2(ud,d) {0,1} fr_e t
"x[y] - access yth character or element of x."
0
1 use_trap $c int use_trap = 0; $e
tcase2 1 4
1 T2 lst abstr nil . vartyp 1 f
if2 ! cnv1 ci 2
call 1 s fr_e_ t 0 0 2 $c dptr dx $e $c &($r1 ) $e $c dptr y $e $c &($m2 ) $e
call 2 s fr___ t 0 0 2 $c C_integer y $e $c $r2 $e $c dptr dx $e $c &($r1 ) $e
1 T11 lst abstr = . vartyp 1 C3 vartyp 2 ++ . vartyp 1 C4 new T6 1 vartyp 1
call 3 s _r_e_ t 0 0 2 $c dptr dx $e $c &($m1 ) $e $c dptr y $e $c &($m2 ) $e
1 T10 lst lst abstr nil . vartyp 1 C2
if1 ! cnv1 ci 2
lst if1 cnv1 T1 2
block _ 0 $c $fail $e
runerr2 101 2
call 4 s fr___ t 0 0 2 $c C_integer y $e $c $r2 $e $c dptr dx $e $c &($r1 ) $e
1 T12 lst abstr nil ++ typ T0 typ T1
call 5 s fr_e_ t 0 0 2 $c dptr y $e $c &($m2 ) $e $c dptr dx $e $c &($m1 ) $e
lst lst if2 && is v 0 is T0 1
lst abstr nil new T5 1 vartyp 0
block t 0 $c $m3 = 1; $e
if2 cnv1 ts 1
abstr nil typ T0
runerr2 114 1
if1 ! cnv1 ci 2
lst if1 cnv1 T1 2
block _ 0 $c $fail $e
runerr2 101 2
call 0 s fr___ t 0 0 4 $c dptr x $e $c &($m0 ) $e $c int use_trap $e $c $r3 $e $c dptr dx $e $c &($r1 ) $e $c C_integer y $e $c $r2 $e
$end
If you build with type tracing available (by default it is not, but unless you can prove to me that it has a measurable performance impact I would think we would want to permanently enable this feature), then if an environment variable TYPTRC is set, iconc will write out a "trace" to that named file (if the "filename" starts with a | then it is a program to whom type tracing is piped as its standard input).
procedure main()
local s, n
# 1:
s := read()
# 2:
every n := 1 to 2 do {
# 3:
write(s[n])
}
# 4:
write(s)
end
If we:
typeinf-example.icn (9,20) tvtbl
typeinf-example.icn (9,20) tvsubs
string(s) sub-types: 1
integer(i) sub-types: 1
record(R) sub-types: 0
proc sub-types: 1
coexpr(C) sub-types: 1
tvsubs(sstv) sub-types: 2
tvtbl(tetv) sub-types: 2
null(n) sub-types: 1
cset(c) sub-types: 1
real(r) sub-types: 1
list(L) sub-types: 2
table(T) sub-types: 1
file(f) sub-types: 1
set(S) sub-types: 1
kywdint sub-types: 1
kywdsubj sub-types: 1
kywdpos sub-types: 1
kywdevent sub-types: 1
kywdwin sub-types: 1
kywdstr sub-types: 1
**** iteration 1 ****
(null) (0,0) main()
typeinf-example.icn (5,19) read() =>> {s}
typeinf-example.icn (5,12) s := {s}
typeinf-example.icn (7,23) to({i}, {i}) =>> {i}
typeinf-example.icn (7,18) n := {i}
typeinf-example.icn (9,20) subsc({var:s}->{s n}, {i}) =>> {s sstv1}
typeinf-example.icn (9,18) write({s}) =>> {s}
typeinf-example.icn (12,15) write({s n}) =>> {s n}
**** iteration 2 ****
(null) (0,0) main()
typeinf-example.icn (5,19) read() =>> {s}
typeinf-example.icn (5,12) s := {s}
typeinf-example.icn (7,23) to({i}, {i}) =>> {i}
typeinf-example.icn (7,18) n := {i}
typeinf-example.icn (9,20) subsc({var:s}->{s n}, {i}) =>> {s sstv1}
typeinf-example.icn (9,18) write({s}) =>> {s}
typeinf-example.icn (12,15) write({s n}) =>> {s n}
**** inferencing time: 0 milliseconds
**** inferencing space: 0 bytes
We now return you to your regularly scheduled type-inference lecture. :-)
Last lecture, we looked at an abstract interpretation model ("collecting semantics") based on values, and then noted that doing abstract interpretation basd on values would require too much memory (at least in the general case) and said we'd progress from that to type inferencing by three successive abstractions, each of which throws additional information away until we get the nice, usable type inferencing semantics of Iconc.
One interesting thing to note about last lecture is that we were using the program flow graph, and that the information was represented in the edges of the flow graph, so here is an example (as if you needed one) where the "edge" of a graph is a crucial, information-rich data structure, not just a "pointer" to a node.
every i := 1 to 3 do
every j := 4 to 6 do
every k := 2 to 5 do
# ... do something with i,j,k
Consider the flow graph node for the innermost "every" expression.
This node has only failure paths (every never "succeeds"), and in
fact, it has two failure paths: control may resume the generator
(4 to 6) if it has additional results, or it may resume the generator
(1 to 3) if j was already 6 and the middle generator was exhausted.
In general there may be an arbitrary nesting of generators and multiple
failure edges out of a node in the flow graph. This certainly poses
a problem in trying to do values-based abstract interpretation, but
for type inferencing what we will do is abstract it away.
Given these two concepts, the first step toward type inferencing is to take the union of the environments propagated along all the failure paths from a node in the collecting semantics, and propagating that union along each of the failure paths. We are forgetting the distinction between the failure edge that would resume the j-loop and the one that would resume the i-loop and sending the same information along both edges. We are betting that it won't affect the types we compute for expressions, not in a significant way anyhow.
envir[1] = store[1] × heap[1] store[1] = variables → values values = integers ∪ strings ∪ ... ∪ pointers ∪ variables heap[1] = pointers → lists, where lists = integers → variablesConsider
a := ["abc"]
Let p1 be the pointer to the list and let
v1 be the (anonymous) variable within the
list. The resulting environment, e ∈ envir[1],
might be
e = (s,h), where s ∈ store[1], h ∈ heap[1] s( a ) = p1 s( v1 ) = "abc" h ( p1 ) = L1 , where L1 ∈ lists L1 (1) = v1If the statement
a[1] := "xyz"is executed, the subscripting operation dereferences a producing p1, then uses the heap to find L1, which it applies to 1 to produce the result v1. The only change in the environment at this point is to temporary variables that are not shown. The assignment then updates the store, producing
e1 = ( s1, h) s1 (a) = p1 s1 ( v1 ) = "xyz"Assignment does not change the heap. On the other hand, the expression
put(a, "xyz")adds the string xyz to the end of the list; if it is executed in the environment e, it alters the heap along with adding a new variable to the store.
e1 = (s1, h1 ) s1 (a) = p1 s1 ( v1 ) = "abc" s1 ( v2 ) = "xyz" h1 ( p1 ) = L2 L2 (1) = v1 L2 (2) = v2If a formal model were developed for the collecting semantics, it would have an environment similar to the one in Model 1. However, it would need a third component with which to represent the backtracking stack.
lecture #24 began here
Store will map sets of variables to sets of values; heap maps sets of pointers to sets of lists.
Environments contain a store and a heap; what the abstraction buys us is that each program location has only 1 environment.
envir[2] = store[2] × heap[2] store[2] = 2 variables → 2 values heap[2] = 2 pointers → 2 listsIn this model, the semantics of an operation like + is to produce the set of values that could be produced.
Example: if Model 1 had these two environments possible at a given program location, for Model 2 we want to lump them together:
e1,e2 ∈ envir[1] e1 = (s1, h1 ) s1 (x) = 1 s1 (y) = p1 h1 ( p1 ) = L1 e2 = (s2, h2 ) s2 ( x ) = 2 s2 ( y ) = p1 h2 ( p1 ) = L2Under Model 2 the program point is annotated with the single environment ê ∈ envir[2], where
ê = ( ŝ , ĥ )
ŝ ({x}) = {1, 2}
ŝ ({y}) = {p1}
ŝ ({x, y}) = {1, 2, p1 }
ĥ ( {p1 } ) = { L1, L2}
In going to Model 2
information is lost. In the last example, the fact that x =
1 is paired with p1 = L1 and x = 2 is
paired with p1 = L2 is not
represented in Model 2.Definitions of operators such as + under this abstraction, where we are working with sets of values, is pretty easy, for example
{1, 3, 5} + {2, 4} = {1 + 2, 1 + 4, 3 + 2, 3 + 4, 5 + 2, 5 + 4}
= {3, 5, 5, 7, 7, 9}
= {3, 5, 7, 9}
When using variables, instead of just constants, the information we lost
by lumping the environments together is going to mean that our resulting
sets are bigger (less precise) than they could be. Consider
z := x + yWhen we lump together
[x = 1, y = 2, z = 0] [x = 3, y = 2, z = 0] [x = 5, y = 4, z = 0]After lumping environments together, our calculation of the set of possible values of zed includes some environmentally impossible possibilities, such as the value 7. This is a "conservative overestimate" for the values, which we can live with.
Three classifications of basic types:
types = { {}, integers, strings,..., (integers ∪ strings),..., (integers ∪ strings ∪ ... ∪ {vk}) }
The store and heap both map types to types. The heap is ALWAYS just mapping a pointer type to a variable type h(Pi)=Vi, so it can sort of be skipped, or computed mechanically, in the type computations.
From Model 2 to Model 3 you pick the smallest type that fit the sets of values you had, for example
{1, 4, 5, "23", "0"}
becomes
integer ∪ string
Suppose an environment from Model 2 is
e ∈ envir[2]
e = (s, h)
s({a}) = { p1 , p2}
s({ v1 }) = {1, 2}
s({ v2 }) = {1}
s({ v3 }) = {12.03}
h({ p1 }) = { L1 , L2 }
h({ p2 }) = { L3 }
L1 (1) = v1
L2 (1) = v1
L2 (2) = v2
L3 (1) = v3
Suppose the pointers p1 and p2 are
both created at program point 1. Then the associated pointer
type is P1 and the associated variable type is V1.
The corresponding environment in Model 3 is
ê ∈ envir[3]
ê = ( ŝ , ĥ )
ŝ ({a}) = P1
ŝ ( V1 ) = integer ∪ real
ĥ ( P1 ) = V1
lecture #25 began here
Generators prolong the lifetime of temporary values. For example, in
expr1 = find(expr2, expr3)the temporary storing the result of expr0 cannot be discarded when find() produces its result and we do the compare. If find gets resumed, the compare will be repeated with the same expr0 result as before.
The methodology we will use is an attribute grammar. Attribute grammars may be review for some of you, but the gist of them is: they are rules for computing semantic information for the syntax rules of a CFG. There are two kinds of attributes (synthesized and inherited) defined by whether their information come from below (synthesized) or above (inherited) in the tree.
v1 = f1(); /* expr1 */
while(--v1) {
v2 = f2(); /* (new) result of find */
v3 = v1 + v2; /* use operands */
f3(v3); /* surrounding / later expression*/
}
v4 = 8;
Now, how many variables do we really need?
What can, and can't, be reused here, among v1-v4?
In order to do this kind of work on temporaries for Icon/Unicon, the extent of the (implicit) loops must be determined, at least approximately. To recognize the implicit loops and calculate their extent, iconc's liveness analysis has to know for every expression, whether it can fail, and whether it can generate multiple results. For built-ins, these are encoded in result sequence clauses {0} means always fails, {1} means always returns exactly one result, {*} means return 0 or more results, {0,2} means 0 to two results, etc.
Consider the following example
every write(\f, !x, ".")For this expression we might write the following sequence of instructions:
The implicit loop is from v8 back to the v5 := !v4. Lifetimes are at the left. The dots show liveness being extended by the implicit loop. Categorize v1-v8 based on their relationship to the implicit loop. To do liveness, we are going to do the analysis to detect the implicit loops.
Some terms:
lecture #26 began here
Resumption points: these include generators, and subexpressions that contain generators. (actually, there are a few more things, not quite generators, that count as resumption points). Resumption points are the starts of implicit loops. In our little example, what are the resumption points?
Liveness analysis uses four attributes:
expr ::= literal {
expr.failer := expr.resumer
expr.gen := false
}
&fail is not used much in explicit source code, but it can be used
to implement "every" and similar expressions. expr.node is a reference
to itself, since &fail will always be its "rightmost operation".
expr ::= &fail {
expr.failer := expr.node
expr.gen := false
}
Addition exemplifies the flow of information for non-base-case expressions.
If it looks like the resumer/failer information flow follows a particular
tree traversal order, that is probably not an accident.
As you may recall about attribute grammars, one writes rules at each level to say how the attribute information is to be passed around, but the actual algorithm to compute the attributes might be complex and involve multiple passes through the tree; do not confuse this with YACC semantic actions:
expr ::= expr1 + expr2 {
expr2.resumer := expr.resumer
expr2.lifetime := expr.node
expr1.resumer := expr2.failer
if expr2.gen & (expr.resumer ¬eq; null) then
expr1.lifetime := expr.resumer
else
expr1.lifetime := expr.node
expr.failer := expr1.failer
expr.gen := (expr1.gen | expr2.gen)
}
/expr shows some interesting differences from addition; besides being a
unary operator, it can fail.
expr ::= /expr1 {
if expr.resumer = null then
expr1.resumer := expr.node
else
expr1.resumer := expr.resumer
expr1.lifetime := expr.node
expr.failer := expr1.failer
expr.gen := expr1.gen
}
!expr is an example of a generator. Variable lifetimes have to be
extended as far forward as might get back to here.
expr ::= !expr1 {
if expr.resumer = null then {
expr1.resumer := expr.node
expr1.lifetime := expr.node
}
else {
expr1.resumer := expr.resumer
expr1.lifetime := expr.resumer
}
expr.failer := expr1.failer
expr.gen := true
}
The not operator bounds its expression (if the expr succeeded and not
returned fail, the expr is not going to get resumed, and not cannot be
resumed for additional results if the expr failed and not inverted it
into a success)
expr ::= not expr1 {
expr1.resumer := null
expr1.lifetime := null
if expr.resumer = null then
expr.failer := expr.node
else
expr.failer := expr.resumer
expr.gen := false
}
Alternation is another generator with interesting semantics:
expr ::= expr1 | expr2 {
expr2.resumer:= expr.resumer
expr2.lifetime := expr.lifetime
expr1.resumer := expr.resumer
expr1.lifetime := expr.lifetime
expr.failer := expr2.failer
expr.gen := true
}
if-then-else bounds the control expression, but not the then- or else-part
expr ::= if expr1 then expr2 else expr3 {
expr3.resumer := expr.resumer
expr3.lifetime := expr.lifetime
expr2.resumer := expr.resumer
expr2.lifetime := expr.lifetime
expr1.resumer := null expr1.lifetime := null
if expr.resumer = null & (expr1.failer null | expr2.failer null) then
expr.failer := expr.node
else
expr.failer = expr.resumer
expr.gen := (expr2.gen | expr3.gen)
}
every-loops look like:
expr ::= every expr1 do expr2 {
expr2.resumer := null
expr2.lifetime := null
expr1.resumer := expr.node
expr1.lifetime := null
expr.failer := expr.node
expr.gen := false
}
lecture #27 began here
Earlier we saw examples of the RTL language in which the runtime system is written, an extended dialect of C. Now we can say: the extensions to the C language support the type inferencing and liveness analyses.
Example operation headers:
function{0,1+} move(i)
function{} bal(c1,c2,c3,s,i,j)
operator{1} [...] llist(elems[n])
operator{0,1} / null(underef x -> dx)
keyword{3} regions
Sample type checks (outside inline/body = done at compile time):
if is:list(x) then ... if cnv:string(s) then ...There are also forms of these that provide default values, and versions that convert all the way down to common C types.
Sample C code implementations:
inline { extended C }
body { extended C }
Comments on the "database" rt.db. It contains the entire code for each operation, in pre-parsed form, except for "body" code fragments, for which it contains the name of the function to call. From the database version of the operation, the code generator can produce a custom ("in-line") version of the invocation of that operation, taking into account the type inferencing information that is known, which usually reduces all or at least part of the type checks. Ideally, the compiler produces a small piece of type-specific in-line C code.
Besides the basic types and structure types discussed earlier in the type inferencing, there are many additional types necessary to handle the full semantics of the language, spelled out in detail in Chapter 19. Most structure types get one subtype per constructor location. Records do not. One type per record declaration is considered enough since records are used more consistently/uniformly than other structure types that provide generic "glue" for different purposes in different places.
Although conceptually every edge in the flow graph needs an entire store for the whole program, in practice, only local variables have to be maintained in such ministores; globals and statics can be shared in a globalstore. Further: the edges of the flow graph are not represented explicitly, but recomputed on the fly during forward traversals on the syntax tree. On a per-procedure bases, there are there stores, one for all incoming calls, one for all resumptions, and one for all return/suspend/fails.
Four parameters: argc, argv, &result, and success_continuation. Various shortcuts on special cases: argv, result, and success_continuation are skipped by procedures that won't use them.
In the outer function, a "procedure frame" variable is declared; continuations have to refer to it via pointer to struct. Global variables pfp and argp are used when switching between frames, and e.g. for a continuation to find the procedure's frame.
struct PF00_main {
struct p_frame old_pfp;
dptr old_argp;
dptr rslt;
continuation succ_cont;
struct {
struct tend_desc *previous;
int num;
struct descrip d[5];
} tend;
};
Template for an empty procedure p():
static int P01_p(args, rslt)
dptr args;
dptr rslt;
{
struct PF01_p frame;
register int signal;
int i;
frame.old_pfp = pfp;
pfp = (struct p_frame )&frame;
frame.old_argp = argp;
frame.rslt = rslt;
frame.succ_cont = NULL;
for (i = 0; i < 3; ++i)
frame.tend.d[i].dword = D_Null;
argp = args;
frame.tend.num = 3;
frame.tend.previous = tend;
tend = (struct tend_desc )&frame.tend;
translation of the body of procedure p
L10: /* bound */
L4: /* proc fail */
tend = frame.tend.previous;
pfp = frame.old_pfp;
argp = frame.old_argp;
return A_Resume;
L8: /* proc return */
tend = frame.tend.previous;
pfp = frame.old_pfp;
argp = frame.old_argp;
return A_Continue;
}
If a continuation fails or returns; the signal is converted into a goto
when it reaches the outer function.
Sample continuation. This might get generated, e.g. in "if a = (1 | 2)":
static int P02_main()
{
register struct PF00_main *rpfp;
rpfp = (struct PF00_main *)pfp;
switch (O0o_numeq(2, &(rpfp->tend.d[1]), &trashcan, (continuation)NULL))
{
case A_Continue:
break;
case A_Resume:
return A_Resume;
}
return 4; /* bound */
}
Sample code for (return if a=(1|2) then "yes" else "no"):
frame.tend.d[1].dword = D_Var;
frame.tend.d[1].vword.descptr = &frame.tend.d[0] /* a */;
frame.tend.d[2].dword = D_Integer;
frame.tend.d[2].vword.integr = 1;
switch (P02_main()) {
case A_Resume:
goto L2 /* alt */;
case 4 /* bound */:
goto L4 /* bound */;
}
L2: /* alt */
frame.tend.d[2].dword = D_Integer;
frame.tend.d[2].vword.integr = 2;
switch (P02_main()) {
case A_Resume:
goto L5 /* else */;
case 4 /* bound */:
goto L4 /* bound */;
}
L4: /* bound */
rslt->vword.sptr = yes;
rslt->dword = 3;
goto L6 /* end if */;
L5: /* else */
rslt->vword.sptr = no;
rslt->dword = 2;
L6: /* end if */
deref(rslt, rslt);
goto L7 /* proc return */;
Amusing observation: sample code from Ken's dissertation shows a deref()
on the return, which obviously is a noop here.
lecture #28 began here
This section contains supplementary material (examples) on type inferencing and liveness analysis.