lecture #1 began here
In order to make this scheduled class time more bearable (for me), I propose to hold office hours Tuesday and Thursday after class from 12-1pm in the Corbett Center "food court". You may eat with me, or just ask questions. I am always going to be eating the Mexican "Daily" Special at the food court, which has not changed at NMSU in three years (The answer to official NM State Paradox: one red and one green).
Of course, I am also available in my office by appointment, and sometimes happy to take drop-in visits. E-mail works best though.
On the other hand, if we tried to look at, say, GCC as our compiler of study, it is very likely that most of you would enjoy the process. Most real compilers are quite complex, and were not designed to be "read". An exception is [Fraser/Hanson 95], see below.
For the reasons given above, in this course I propose to describe for you, and provide essential documentation on, my own open source compiler for the Unicon language, and the compilers for the Icon language upon which Unicon builds.
For example, figuring out how to implement high-level constructs has moved beyond while loops, or function calls, to more advanced topics such as object-orientation, or concurrency. Improving the efficiency of generated code might mean: special-purpose transformations to take advantage of novel features of modern hardware, or it might mean: customizing a virtual machine, or mixing virtual machine code and native code.
One big difference between this course and the first compiler course is: in the first compiler course, the major focus is on writing a compiler from scratch, starting from the ground up with lexical and syntax analysis. In this course, we will instead study some existing compilers that are part of a language family (the Icon and Unicon language), and your assignments and semester project will add to or improve those compilers, rather than having to write a whole compiler yourself.
The first two versions of the Icon language were developed in Ratfor in the late 1970's. Starting around 1980 this implementation was discarded and the language rewritten from scratch in C and UNIX. A virtual machine was adopted and refined over a decade. Over time, many parts originally written in assembler were rewritten in C as the semantics of the language came to be better understood.
Around 1990, another radical rewrite was produced in order to support an optimizing compiler with much higher performance than the virtual machine. The new compiler was written from scratch, but its runtime system was produced by ingeniously adapting the code from the virtual machine. At approximately the time of Ralph Griswold's retirement in the mid-1990's, the Icon language was more or less frozen; the culminating work on the subject being the 3rd edition of "The Icon Programming Language", published in 1997.
But people in the Internet Age wanted applications to do more than they did in the 1980's. Additions to the open-source language were made by various authors. Around 2000, several such additions were merged together by a group led by Clint Jeffery under the name Unicon; the project was placed under the GPL and moved to SourceForge.net.
The most recent major addition to the language, done at NMSU, has been a set of portable high-level 3D graphics facilities, built on top of but much easier to learn and use than OpenGL. Current and on-going work includes the addition of portable audio and video support, in order to accomodate the needs of Collaborative Virtual Environments. With such extensions, the goal is not to add access to a popular C or C++ or Java API, but rather, to design higher level abstractions at the level most programmers are comfortable working, and then figure out how to provide those abstractions.
lecture #2 began here
+-bin------ executable binaries and support files
+-config--+ configurations
| +-unix---+
| | +-intel_linux-
| | +-sun_gcc-----
| +-win32---
| | +-gcc---------
| | +-msvc--------
/-unicon--+-src------ source code (C)
| +-common--
| +-h-------
| +-icont---
| +-iconc---
| +-rtt-----
| +-runtime-
+-tests---- tests
| +-general-
| +-posix---
+-unicon--- source code (Unicon)
+-unicon--
lecture #3 began here
s ? exprcauses a string s to be the subject of scanning by expr. Within expr, s is a default argument to all string functions, and a scanning position is maintained and used to move through the string looking for specified patterns. move(rel) and tab(abs) move to a position and return the substring between the old and new position. Several functions such as find() and many() return positions within the string that can be used as the destinations for tab(). The unary operator =x is tab(match(x)), moving the position past x if x is the next thing in the string.
It is reasonable to ask whether Icon string scanning makes processing of strings by hand easier than doing it in C or Java, and whether Icon string scanning is similar to the scanning done by compiler scanners.
$include "ytab_h.icn" # yacc's token categories global yytext # lexeme global yyin # source file we are reading global yytoken # token (a record) global yylineno, yycolno, yyfilename # source location
global O, D, L, H, R, FS, IS, W, idchars procedure init_csets() O := '01234567' D := &digits L := &letters ++ '_' H := &digits ++ 'abcdefABCDEF' R := &digits ++ &letters FS := 'fFlL' IS := 'uUlL' W := ' \t\v' idchars := L ++ D end
record token(tok, s, line, column, filename)
procedure reswords()
static t
initial {
t := table([Beginner+Ender, IDENT])
t["abstract"] := [0, ABSTRACT]
t["break"] := [Beginner+Ender, BREAK]
t["by"] := [0, BY]
t["case"] := [Beginner, CASE]
t["class"] := [0, CLASS]
t["create"] := [Beginner, CREATE]
t["default"] := [Beginner, DEFAULT]
t["do"] := [0, DO]
t["else"] := [0, ELSE]
t["end"] := [Beginner, END]
t["every"] := [Beginner, EVERY]
t["fail"] := [Beginner+Ender, FAIL]
t["global"] := [0, GLOBAL]
t["if"] := [Beginner, IF]
t["import"] := [0, IMPORT]
t["initial"] := [Beginner, iconINITIAL]
t["initially"] := [Ender, INITIALLY]
t["invocable"] := [0, INVOCABLE]
t["link"] := [0, LINK]
t["local"] := [Beginner, LOCAL]
t["method"] := [0, METHOD]
t["next"] := [Beginner+Ender, NEXT]
t["not"] := [Beginner, NOT]
t["of"] := [0, OF]
t["package"] := [0, PACKAGE]
t["procedure"] := [0, PROCEDURE]
t["record"] := [0, RECORD]
t["repeat"] := [Beginner, REPEAT]
t["return"] := [Beginner+Ender, RETURN]
t["static"] := [Beginner, STATIC]
t["suspend"] := [Beginner+Ender, SUSPEND]
t["then"] := [0, THEN]
t["to"] := [0, TO]
t["until"] := [Beginner, UNTIL]
t["while"] := [Beginner, WHILE]
}
return t
end
This "big-inhale" model did not work well on original 128K PDP-11 UNIX computers, but works well in this century. At present, the code assumes Unicon source files are less than a megabyte -- a lazy programmer's error. Although Unicon programs are much shorter than C programs, an upper limit of 1MB is bound to be reached someday. Homework: fix the lexical analyzer so it works correctly on files of arbitrarily large size.
procedure yylex_reinit()
yytext := ""
yylineno := 0
yycolno := 1
lastchar := ""
if type(yyin) == "file" then
buffer := reads(yyin, 1000000)
else
buffer := yyin
tokflags := 0
end
This little procedure is entirely hidden from the regular lexical analyzer code by writing that regular code in a helper function yylex2(), and writing the semicolon insertion logic in a yylex() function that calls yylex2 when it needs a new token.
Initialization for the yylex() function shows the static variables used to implement the one token of lookahead. If the global variable buffer doesn't hold a string anymore, /buffer will succeed and it must be that we are at end-of-file and should return 0.
procedure yylex()
static saved_tok, saved_yytext
local rv, ender
initial {
if /buffer then
yylex_reinit()
}
if /buffer then {
if \debuglex then
write("yylex() : 0")
return 0
}
If we inserted a semicolon last time we were called, the saved_tok will
be the first token of the next line; we should return it.
if \saved_tok then {
rv := saved_tok
saved_tok := &null
yytext := saved_yytext
yylval := yytoken := token(rv, yytext, yylineno, yycolno, yyfilename)
if \debuglex then
write("yylex() : ",tokenstr(rv), "\t", image(yytext))
return rv
}
Otherwise, we should obtain the next token by calling yylex2().
We have to check for end of file, remember if the last token could
end an expression, call yylex2(), and update buffer to be the smaller
string remaining after the token.
ender := iand(tokflags, Ender)
tokflags := 0
if *buffer=0 then {
buffer := &null
if \debuglex then
write("yylex() : EOFX")
return EOFX
}
buffer ? {
if rv := yylex2() then {
buffer := tab(0)
}
else {
buffer := &null
yytext := ""
if \debuglex then
write("yylex() : EOFX")
return EOFX
}
}
After fetching a new token, we have to decide whether to insert a semicolon
or not. This is based on global variable ender (whether the previous token
could end an expression) and global variable tokflags (which holds both whether
the current token could begin an expression, and whether a newline occurred
between the last token and the current token. iand() is a bitwise AND,
equivalen to C language & operator, used to pick bits out of a set of
boolean flags encoded as bits within an integer.
if ender~=0 & iand(tokflags, Beginner)~=0 & iand(tokflags, Newline)~=0 then {
saved_tok := rv
saved_yytext := yytext
yytext := ";"
rv := SEMICOL
}
Returning a token requires allocation of a token() record instance, which
is stored in a global variable.
yylval := yytoken := token(rv, yytext, yylineno, yycolno, yyfilename)
if \debuglex then
write("yylex() : ", tokenstr(rv), "\t", image(yytext))
return rv
end
procedure yylex2()
static punc_table
initial {
init_csets()
reswords := reswords()
punc_table := table(uni_error)
punc_table["'"] := do_literal
punc_table["\""] := do_literal
punc_table["!"] := do_bang
punc_table["%"] := do_mod
punc_table["&"] := do_and
punc_table["*"] := do_star
punc_table["+"] := do_plus
punc_table["-"] := do_minus
punc_table["."] := do_dot
punc_table["/"] := do_slash
punc_table[":"] := do_colon
punc_table["<"] := do_less
punc_table["="] := do_equal
punc_table[">"] := do_greater
punc_table["?"] := do_qmark
punc_table["@"] := do_at
punc_table["\\"] := do_backslash
punc_table["^"] := do_caret
punc_table["|"] := do_or
punc_table["~"] := do_tilde
punc_table["("] := do_lparen
punc_table[")"] := do_rparen
punc_table["["] := do_lbrack
punc_table["]"] := do_rbrack
punc_table["{"] := do_lbrace
punc_table["}"] := do_rbrace
punc_table[","] := do_comma
punc_table[";"] := do_semi
punc_table["$"] := do_dollar
every punc_table[!&digits] := do_digits
every punc_table["_" | !&letters] := do_letters
}
The main lexical analyzer code strips comments and whitespace, and
calls the function table for the first non-whitespace character it
finds. Note support for #line directives, and the use of string
scanning.
yycolno +:= *yytext
repeat {
if pos(0) then fail
if
="#" then {
if ="line " then {
if yylineno := integer(tab(many(&digits))) then {
=" \""
yyfilename := tab(find("\"")|0)
}
}
tab(find("\n") | 0)
next
}
if ="\n" then {
yylineno +:= 1
yycolno := 1
if tokflags < Newline then
tokflags +:= Newline
next
}
if tab(any(' ')) then { yycolno +:= 1; next }
if tab(any('\v\^l')) then { next }
if tab(any('\t')) then {
yycolno +:= 1
while (yycolno-1) % 8 ~= 0 do yycolno +:= 1
next
}
yytext := move(1)
return punc_table[yytext]()
}
end
The functions in the punctuation table select integer codes and match
the rest of the lexeme. do_comma() illustrates an unambiguous token
selection, while do_plus() illustrates a more common case where the
"+" character could start any of 5 different tokens depending on the
character(s) that follow it. Tokens starting with "letters" are looked
up in a reserved words
table, which tells whether they are special, or just a variable name.
procedure do_comma()
return COMMA
end
procedure do_plus()
if yytext ||:= =":" then {
if yytext ||:= ="=" then { return AUGPLUS }
return PCOLON
}
if yytext ||:= ="+" then {
if yytext ||:= =":=" then {return AUGUNION}
return UNION
}
tokflags +:= Beginner
return PLUS
end
procedure do_letters()
yytext ||:= tab(many(idchars))
x := reswords[yytext]
tokflags +:= x[1]
return x[2]
end
lecture #4 began here
buffer := reads(yyin, 1000000)
buffer := ""; while buffer ||:= reads(yyin, 1000000)
buffer := reads(yyin, stat(yyin).size)
The lexical analyzer has a header src/h/lexdef.h containing some global declarations and useful macros. src/common/lextab.h contains a "token table" similar to Unicon's reswords table, and an "operator table" encoding for each operator both its beginner/ender flags and whether it is unary, binary, or is used both ways. lextab.h also includes a function getopr() that takes 400 lines of switches and if-statements to examine characters one at a time and calculate the longest operator when several operator characters are adjacent.
Although mostly handwritten, there are a couple juicy specification files src/common/tokens.txt and src/common/op.txt that define the tokens and operators; these are processed by an Icon program (src/common/mktoktab.icn) to generate certain .h files if the ultrarare event of adding a new token or operator ever occurs. These operations are so rare that the Makefiles do not even consider them by default.
The main "lexical analyzer" file is src/common/yylex.h. The logic in this lexical analyzer is similar to Unicon's lexical analyzer. When semicolon insertion, whitespaces, and comments are handled and a new actual token is being examined, the code boils down to four possibilities:
if (isalpha(c) || (c == '_')) { /* gather ident or reserved word */
if ((t = getident(c, &cc)) == NULL)
goto loop;
}
else if (isdigit(c) || (c == '.')) { /* gather numeric literal or "." */
if ((t = getnum(c, &cc)) == NULL)
goto loop;
}
else if (c == '"' || c == '\'') { /* gather string or cset literal */
if ((t = getstring(c, &cc)) == NULL)
goto loop;
}
else { /* gather longest legal operator */
if ((n = getopr(c, &cc)) == -1)
goto loop;
t = &(optab[n].tok);
yylval = OpNode(n);
}
Aside from syntax and the larger # of lines of code, some of the main
differences between the C-based lexical analyzer and the Unicon lexical
analyzer for this language include:.
yyin := "" every yyin ||:= preprocessor(fName, uni_predefs) do yyin ||:= "\n"
#begdef foo(x) ...body of foo can be as long as you like ...more body of foo ... #enddefNote that the conventional ANSI C preprocessor has already "tokenized" the input into chunks in order to look for macro names and such; the rtt lexical analyzer gets text chunks one at a time, and categorizes them. src/rtt/rttlex.c is only 430 lines of code, but the C preprocessor that is doing much of the work is another 6500 lines, in the src/preproc/ directory! It also keeps its size down by utilizing the symbol table code (a hash table) to handle the reserved words.
There is some amount of lexical awkwardness in rtt due to the extended C syntax. For example, p ** q in regular C might mean "multiply p by what q points at"; under rtt C code fragments may use this meaning but the extra syntax (for type inferencing) uses ** as a set intersection operator, and a global variable flag has to keep track and change the lexical analyzer behavior depending on whether one is in "type mode" or "regular C mode".
The yylex() function, since its preprocessor has already provided it with the sequence of characters that will comprise the token, is mainly looking at the characters to select what integer code to return. For example, for strings that might be identifiers:
if (yylval.t->tok_id == Identifier) {
/*
* See if this is an identifier, a reserved word, or typedef name.
*/
sym = sym_lkup(yylval.t->image);
if (sym != NULL)
yylval.t->tok_id = sym->tok_id;
}
do
tok = cpp_get_token (parse_in);
while (tok->type == CPP_PADDING);
...
switch (tok->type) {
/* 10 branches for different C preprocessor categories,
that require special handling, including errors and constants
*/
}
return tok->type;
Some apparent lexical-analysis information is located more in the parser
file, c-parse.y. For example, a reserved words array (reswords) lives
in that file, including the actual spelling of all the C reserved
words. There are extra complications from the fact that the gcc C front
end supports gcc, g++, and GNU Objective C.
At first glance, the lexical analyzer appears to live in Parser/tokenizer.[ch], about 1500 lines of C. The lexical analyzer is interesting in that Python is often used in an interactive interpreter mode; the lexical analyzer has to worry about whether its reading from a file or a terminal.
There are about 53 token categories in Python. A lot of lexical analyzer code handles Unicode, and includes tricks for international character handling, for example the code specification is supplied in an optional special comment.
Because the lexical analyzer is feeding an interpreter, Python tokens are full-blown Python objects, not just C structs.
The main lexical analysis function is int tok_get(tok, pstart, pend), 400 lines or so, which reads from a function tok_nextc(), calculates an indentation level (used in Python parsing), and handles a Lot of weird special cases. One point here, true in most lexical analyzers, is the need to lookahead at characters in order to decide what token a certain character might be. If you are string scanning, lookahead is very simple, but if you are working with file I/O it can get complicated.
do {
c = tok_nextc(tok);
} while (c == ' ' || c == '\t' || c == '\014');
...
if (c == '#') { /* skip comment code */... }
if (c == EOF) { /* return ENDMARKER or ERRORTOKEN */
if (isalpha(c) || c == '_') { /* identifier */ }
if (c == '\n') { /* newline */
if (c == '.') { /* period, or number starting with period? */ }
if (isdigit(c)) { /* number */ }
if (c == '\'' || c == '"') { /* string */ }
if (c == '\\') { /* line continuation */ }
...
/* Check for two-character token */
{
int c2 = tok_nextc(tok);
int token = PyToken_TwoChars(c, c2);
if (token != OP) {
int c3 = tok_nextc(tok);
int token3 = PyToken_ThreeChars(c, c2, c3);
if (token3 != OP) {
token = token3;
} else {
tok_backup(tok, c3);
}
*p_start = tok->start;
*p_end = tok->cur;
return token;
}
tok_backup(tok, c2);
}
...
return PyToken_OneChar(c);
Cute code:
int
PyToken_OneChar(int c)
{
switch (c) {
case '(': return LPAR;
case ')': return RPAR;
...
default: return OP;
}
}
int
PyToken_TwoChars(int c1, int c2)
{
switch (c1) {
case '=':
switch (c2) {
case '=': return EQEQUAL;
}
break;
...
return OP;
}
int
PyToken_ThreeChars(int c1, int c2, int c3)
{
switch (c1) {
case '<':
switch (c2) {
case '<':
switch (c3) {
case '=':
return LEFTSHIFTEQUAL;
}
break;
}
break;
...
return OP;
}
lecture #5 began here
The start symbol for the grammar is named program, and the
semantic action code fragment for this nonterminal calls the rest of the
compiler (semantic analysis and code generation) directly on the root of
the syntax tree, rather than storing it in a global variable for the
main() procedure to examine.
program : decls EOFX { Progend($1);} ;
Many context free grammar rules are recursive, with an empty production to
terminate the recursion. The rule for declarations is typical:
decls : { $$ := EmptyNode }
| decls decl {
if yynerrs = 0 then iwrites(&errout,".")
$$ := node("decls", $1, $2)
} ;
The "semantic action" (code fragment) for every production rule builds a
syntax tree node and assigns it to $$ for the nonterminal left-hand side of
the rule.Another common grammar pattern is a production rule that has many different alternatives, such as the one for individual declarations:
decl : record
| proc
| global
| link
| package
| import
| invocable
| cl
;
For such "unary" productions, child's syntax tree node suffices for the parent, no
new tree node is needed.
Some nonterminals mostly correspond to a specific sequence of terminals, as is the case for package references:
packageref : IDENT COLONCOLON IDENT { $$ := node("packageref", $1,$2,$3) }
| COLONCOLON IDENT { $$ := node("packageref", $1,$2) }
;
The lexical analyzer has already constructed a valid "leaf" for each terminal symbol, so if a production rule has only one terminal symbol in it, for a syntax tree we can simply use the leaf for that nonterminal (for a parse tree, we would need to allocate an extra unary internal node):
lnkfile : IDENT ; | STRINGLIT ;
The expressions (which comprise about half of the grammar) use a separate nonterminal for each level of precedence instead of YACC's tricks for resolving precedence. This may be up to around 20 levels of nonterminals. A typical rule looks like:
expr6 : expr7 ;
| expr6 PLUS expr7 { $$ := node("Bplus", $1,$2,$3);} ;
| expr6 DIFF expr7 { $$ := node("Bdiff", $1,$2,$3);} ;
| expr6 UNION expr7 { $$ := node("Bunion", $1,$2,$3);} ;
| expr6 MINUS expr7 { $$ := node("Bminus", $1,$2,$3);} ;
The "B" stands for "binary", to distinguish these operators from their unary brethren.
The 20 levels of nonterminals approach is inherited from Icon
and probably makes the parser way larger than it has to be.
Potential syntax homework!
lecture #6 began here
Icon employed a relatively clever approach to doing syntax error messages with YACC -- the parse state at the time of error was enough to do fairly good diagnoses. But, every time the grammar changed, the parse state numbers could change wildly. For Unicon I developed the Merr tool, which associates parse error example fragments with the corresponding diagnostic error message, and detects/infers the parse state for you, reducing the maintenance problem when changing the grammar. Merr also considers the current input token in deciding what error message to emit, making it fundamentally more precise than Icon's approach.
proc : prochead SEMICOL locals initial procbody END {
Proc1($1,$2,$3,$4,$5,$6);
} ;
The definitions of macros such as Proc1 are in icont/tgrammar.c and the
analogous cgrammar.c file. For the compilers, they generally do the typical
syntax tree construction, as in
#define Proc1(x1,x2,x3,x4,x5,x6) $$ = tree6(N_Proc,x1,x1,x4,x5,x6)In icont, there are many exceptions to this general case, as many kinds of work seem to be done during parsing. The macros do not all assign to $$, especially for unary nodes with a leaf/token on the righthand side, where the leaf can be propagated up to the parent, as for IDENT.
For all its generality, the Icon grammar hardwires processing at a global declaration level. It seems to require code generation on a per-procedure basis, since The rule for declarations has no macro for combining them
decls : ; | decls decl ;This seems to make sense for icont but not for iconc, which does substantial analysis on the entire program; iconc is most likely doing its whole program analysis using the global symbol table in which all declarations are stored. In any case there should probably be a macro for decls:decls decl.
For another thing, the Icon grammar used repeatedly a technique which I have scrupulously avoided, putting semantic actions in the middle of a rule.
prochead: PROCEDURE IDENT {Prochead1($1,$2);} LPAREN arglist RPAREN {
Prochead2($1,$2,$3,$4,$5,$6);
} ;
The reason to avoid this is that adding new epsilon productions frequently
introduces reduce-reduce conflicts, which can be a major problem. When is
it safe to use this technique: when there is only one possible parse at the
location of the epsilon. Why do it? Because (in this example) you might
want to create some data structures for later rules to populate.
A more urgent tweak is to replace calls to yyerror() to give the Icon compilers more control over their syntax error messages. A regular call yyerror("syntax error") is replaced by a call that passes the current input token and parse state in (yyerror(yychar, yylval, yy_state)). Other messages (mainly stack overflow) are rerouted as internal parser errors rather than programmer errors.
To get rid of pscript, the best solution would be to switch icont/iconc over to using the Merr syntax error generator tool.
rttgram.y stores three different kinds of things on the YACC value stack: token pointers, treenode pointers, and long integers. It uses many different syntax tree node constructors, with names indicating how many children. node0..node5 all take an int label, a token pointer, and a number of node *'s. The token pointer is a lexical "anchor" for error reporting purposes. The use of 6 functions is avoiding C's ... variable arguments construct for no good reason.
There is something to be said for a system that allows an entire large language's grammar to be specified in 107 lines of code. However, the parser generator has no "semantic action" flexibility, so modifying the parser or adapting it for other tools is a painful exploration of C code that lives nowhere near the grammar rules, and it is not easy to customize behavior on a per-rule basis as is done heavily in Unicon/Icon/GCC.
lecture #7 began here
The external public interface of the preprocessor is line-oriented, consisting of a generator preproc(filename, predefinedsyms) which suspends each line of the output, one after another. Its invocation from the main() procedure looks like:
yyin := "" every yyin ||:= preprocessor(fName, uni_predefs) do yyin ||:= "\n"Since the preprocessor outputs line-by-line, there is a mismatch between it and the lexical analyzer's big-inhale model. The preprocessor could be modified to fit better with the lexical analyzer or vice versa.
The preprocessor function takes the filename to read from, along with a table of predefined symbols which allows the preprocessor to respond to lines like
$ifdef _SQLbased on what libraries are available and how Unicon was built on a given platform.
The preprocessor() function itself starts each call off with initializations:
static nonpunctuation
initial {
nonpunctuation := &letters ++ &digits ++ ' \t\f\r'
}
preproc_new(fname,predefined_syms)
The initialization code opens fname, creates empty stacks to keep track of nested
$ifdef's and $include's, initializes counters to 0 and so forth.
The preprocessor is line-oriented. For each line, it looks for a preprocessor directive, and if it does not find one, it just scans for symbols to replace and returns the line. The main loop looks like
while line := preproc_read() do line ? {
preproc_space() # eat whitespace
if (="#" & match("line")) | (="$" & any(nonpunctuation)) then {
suspend preproc_scan_directive()
}
else {
&pos := 1
suspend preproc_scan_text()
}
}
The procedures preproc_scan_directive() and preproc_scan_text() work on
special and ordinary lines, respectively. The line is not a parameter
because it is held in the current string scanning environment. The
preproc_scan_directive() starts by discardign whitespace and identifying
the first word on the line (which must be a valid preprocessor directive).
A case expression handles the various directives (define, undef, ifdef, etc.).
Defined symbols are stored in a table. $ifdef and $ifndef are handled using
a global variable preproc_if_state to track the boolean conditions.
A count of $ifdef's is maintained, in order to handle matching endif's.
Include files are handled using a stack, but an additional set of filenames is kept to prevent infinite recursion when files include each other. When a new include directive is encountered it is checked against the preproc_include_set and if OK, it is opened. The including file (and its associated name, line, etc) are pushed onto a list named preproc_file_stack. It is possible to run out of open files under this model, although this is not easy under modern operating systems.
Include files are searched on an include file path, consisting of a list of directories given on an optional environment variable (LPATH) followed by a list of standard directories. The standard directories are expected to be found relative to the location of the virtual machine binaries.
The procedure preproc_scan_text has the relatively simple job of replacing any symbols by their definitions within an ordinary source line. Since macros do not have parameters, it is vastly simpler than in a C preprocessor. The main challenges are to avoid macro substitutions when a symbol is in a comment or within quotes (string or cset literals). An additional issue is to handle multiline string literals, which occur in Icon when a string literal is not closed on a line, and instead the line ends with an underscore indicating that it is continued on the next line. Skipping over quoted text sounds simple, but is trickier than it looks. Escape characters mean you can't just look for the closing quote without considering what comes before it, and you can't just look at the preceding character since it might have been escaped, as in "\\". The code looks similar to:
repeat {
while tab(upto('"\\')) do {
case move(1) of {
"\\": move(1)
default: {
break break
}
}
}
# ...
if not match("_",,-1) then
break
&subject := preproc_read() | fail
# ...
}
The code in preproc_read() for reading a line does a regular Icon read(); end of file causes the preprocessor file_stack to be popped for the previous file's information. Performance has not been perceived as a significant problem, it it would be interesting to convert preproc_read() to use a big-inhale model to see if any statistical difference could be observed. When an include is encountered under a big-inhale, the saved state would contain the string of remaining file contents, instead of the open file value.
typedef struct fstruct { /* input file structure */
struct fstruct *prev; /* previous file */
char *fname; /* file name */
long lno; /* line number */
FILE *fp; /* stdio file pointer */
int m4flag; /* nz if preprocessed by m4 */
int ifdepth; /* $if nesting depth when opened */
} infile;
Compared with Unicon, there is extra code for handling standard in ("-") and
a command-line option to pipe the input through the m4 preprocessor before
the Icon preprocessor works on it! This might be overengineering a bit, and
it is a portability problem (works only on UNIX).
The list of directives is given as an array whose elements include the name (used in an old-fashioned linear search) and a function pointer for handling the various directives. The array of function pointers helps organize the code but one can't help thinking a binary search or a hash table might be good. But, this is performed only proportional to the number of preprocessor directives, which are rare in most programs, and there are only 10 directives.
The C preprocessor interface is character-oriented, not line-oriented. This allows function ppch() to substitute for whatever character fetching function the lexical analyzer was using prior to the preprocessor's existence. ppch() has some semi-complicated buffering and is a gigantic, messy function (150 lines).
Symbol definitions are stored in yet another implementation of C hash tables. There are several in the compiler, each a little different but mostly redundant.
Since it does a similar job as ipp.c only more, there are a lot of similar features in rtt's preprocessor. As far as I know, no attempt was made to share code. ANSI C has sticky, detailed rules for how macro parameters are applied, which are better than the pre-ANSI C language whose preprocessor semantics were not entirely well-defined. Besides the complex paramterized macros, a C preprocessor has to have a mini-expression evaluation mechanism to handle complex boolean conditional expressions such as
#if (THIS || THAT) && (defined(OTHER) || (ALT1 && ALT2))This almost amounts to the whole C expression evaluation rules, all over again in the preprocessor (preproc/evaluate.c).
Because RTT is written in C and Icon runs on many operating systems, there is a fair amount of conditional code in the preprocessor, especially in the handling of the search through standard system directories for include files. UNIX has certain conventions, but many platforms have multiple C compilers installed, and some platforms (anyone use MVS?) have complicated naming for system include's. Some of this #ifdef code probably needs to be removed as it refers to dead compilers on ancient platforms.
lecture #8 began here
Ultimately, semantics means meaning, and the true "meaning" of a program isn't just a property of the source code, but of the semantics of the language and of the underlying platform (the CPU or virtual machine, the runtime system, and the operating system). For this reason we will be studying semantics long after we move beyond syntax trees and the information the compiler is able to add to them. Similarly, the analyses performed by the optimization phases of a compiler revolve around understanding the program's semantics well enough to handle special cases while preserving correct behavior.
In conventional YACC, a %union declaration is necessary to handle the varying types of objects on the value stack including the type used for syntax tree nodes, but iyacc has no need of this awkward mechanism: the value stack like all structure types can hold any type of value in each slot. Similarly, tree nodes can hold children of any type, potentially eliminating any awkwardness of mixing tokens and internal nodes. Of course, you do still have to check what kind of value you are working with.
record treenode(label, children)holds one node worth of information. For convenience, a procedure node(label, kids[]) takes an arbitrary number of parameters and constructs the list of children for you. Leaves have a null children field.
Earlier we saw that the start symbol of the Unicon grammar had a semantic action that called a procedure Progend(). We will cover most of that procedure next week since it is all about object-orientation, but at the end Progend(), a call to yyprint() performs the tree traversal for code generation. A classic tree traversal pattern would look like:
procedure traverse(node)
if node is an internal node {
every child := ! node.children do traverse(child)
generate code for this internal node (postfix)
}
else
generate code for this leaf
end
The code generator traversal yyprint() is a lot more complicated than that,
but fits the general pattern. The main work done at various nodes is to
write some text to the output file, yyout. Most ordinary internal nodes are
of type treenode as described above. But because there are several kinds of
internal nodes and several kinds of leaves, the "if node is an internal node"
is implemented as a case expression. Besides a regular treenode, the other
kinds of internal nodes are objects of type declaration, class, and argument
list. For regular treenodes, another case expression on the node's label
field is used to determine
what kind of code to generate, if any, besides visiting children and
generating their code.
The default behavior for an internal node is to just visit the children, generating their code. For ordinary syntax constructs (if, while, etc.) this works great and a copy of the code is written out, token by token. But several exceptions occur, mainly for the pieces of Unicon syntax that extend Icon's repertoire. For example, packages and imports are not in Icon and require special treatment.
procedure yyprint(node)
static lasttok
case type(node) of {
"treenode" : {
case node.label of {
"package": { } # handled by semantic analysis
"import": { print_imports(node.children[2]) }
# implement packages via name mangling
"packageref": {
if *node.children = 2 then
yyprint(node.children[2]) # ::ident
else { # ident :: ident
yyprint(node.children[1])
writes(yyout, "__")
outcol +:= ((* writes(yyout, node.children[3].s)) + 2)
}
}
New syntax constructs such as procedure parameter defaults and type
restrictions, and variable initializers, are other examples where
the default traversal would output things illegal in Icon.
They are implemented by skipping some of
the children (assignment and value) in the regular pass, and
adding extra code elsewhere, discussed below.
"varlist2"|"stalist2": { yyprint(node.children[1]) }
"varlist4"|"stalist4": {
yyprint(node.children[1])
yyprint(node.children[2])
yyprint(node.children[3])
}
Much of this special logic is orchestrated by the code for traversing a
procedure; it can visit its arguments and variable declarations and apply
special rules to them.
"proc": {
yyprint(node.children[1])
every yyprint(node.children[2 to 3])
if exists_statlists(node.children[3]) then {
ini := node.children[4]
yyprint("\ninitial {")
if ini ~=== EmptyNode then { # append into existing initial
yyprint(ini.children[2])
yyprint(";\n")
}
yystalists(node.children[3])
yyprint("\n}\n")
}
else
every yyprint(node.children[4])
(node.children[1].fields).coercions()
yyvarlists(node.children[3])
yyprint(node.children[5])
yyprint(node.children[6])
}
The default behavior of visiting one's children is very simple, as is the
handling of other kinds of internal nodes, which are objects.
For the objects, a method Write() is invoked.
"error": fail
default:
every yyprint(!node.children)
}
"declaration__state" | "Class__state" | "argList__state":
node.Write(yyout)
The outer case expression of yyprint() continues with various kinds of leaf
(token) nodes. These mainly know how to write their lexemes out. But, a
lot of effort is made to try to keep line and column number information
consistent. Variables outline and outcol are maintained as each token is
written out. Integers and string literals found in the syntax tree are
written out as themselves. Since they have no attached lexical attributes,
they are a bit suspect in terms of maintaining debugging consistency.
It turns out the reason they occur at all, and the reason they have no
source lexical attributes, is that artificial syntax subtrees
are generated to handle certain object-oriented constructs, and within
those subtrees strings and integers may be placed, which do not correspond
to anywhere in the source code.
"integer": {
writes(yyout, node); outcol +:= *string(node)
}
"string": {
node ? {
while writes(yyout, tab(find("\n")+1)) do {
outline+:=1; outcol:=1;
}
node := tab(0)
}
writes(yyout, node); outcol +:= *node
}
"Normally", tokens are written out at exactly the line and column they
appear at in the source code. But a myriad of constructs may bump them
around. If the output falls behind (in lines, or columns) extra whitespace
can be inserted to stay in sync. If output gets ahead by lines, a #line
directive can back it up, but if output gets ahead by columns, there is
nothing much one can do, except make sure subsequent tokens don't
accidentally get attached/concatenated onto earlier tokens. This occurs,
for example, when the output code for an object-oriented construct in an
expression is longer than the source expression, perhaps due to name mangling.
Specific token combinations are checked, but the list here may be incomplete
(possible BUG!).
For source tokens, not only might the line and column change, the
filename could be different as well.
"token": {
if outfilename ~== node.filename | outline > node.line then {
write(yyout,"\n#line ", node.line-1," \"", node.filename,"\"")
outline := node.line
outcol := 1
outfilename := node.filename
}
while outline < node.line do {
write(yyout); outline +:= 1; outcol := 1
}
if outcol >= node.column then {
# force space between idents and reserved words, and other
# deadly combinations (need to add some more)
if ((\lasttok).tok = (IDENT|INTLIT|REALLIT) & reswords[node.s][2]~=IDENT)|
(((\lasttok).tok = NMLT) & (node.tok = MINUS)) |
((\lasttok).tok = node.tok = PLUS) |
((\lasttok).tok = node.tok = MINUS) |
((reswords[(\lasttok).s][2]~=IDENT) & (node.tok=(IDENT|INTLIT|REALLIT)))|
((reswords[(\lasttok).s][2]~=IDENT) & (reswords[node.s][2]~=IDENT))
then
writes(yyout, " ")
}
else
while outcol < node.column do { writes(yyout, " "); outcol +:= 1 }
Most tokens' lexemes are finally written out by writing node.s:
writes(yyout, node.s)
outcol +:= *node.s
lasttok := node
}
"null": { }
default: write("its a ", type(node))
}
end
lecture #9 began here
The Unicon OOP facilities were originally prototyped as a semester class project in a "special topics" graduate course. Writing the prototype in a very high-level language like Icon, and developing it as a preprocessor with name mangling, allowed the initial class mechanism to be developed in a single evening, and a fairly full, usable system with working inheritance to be developed in the first weekend. By the end of the semester, the system was robust enough to write it in itself, and it was released to the public shortly afterwards as a package for Icon called "Idol". Many many improvements were made after this point, often at the suggestion of users.
An initial design goal was to make the absolute smallest additions to the language that were necessary to support object-orientation. Classes were viewed as a version of Icon's record data type, retaining its syntax for fields (member variables), but appending a set of associated procedures. Because records have no concept of public and private, neither did classes. Another graduate student criticized this lack of privacy, and for several versions, everything was made private unless an explicit public keyword was used. But eventually support for privacy was dropped on the grounds that it added no positive capabilities and was un-Iconish. The existence of classes with hundreds of "getter" and "setter" methods was considered a direct proof that "private" was idiotic in a rapid prototyping language.
class A(x,y)
method m()
write("hello")
end
end
These five lines generate 25 lines for Icont to translate into virtual
machine code. The first two lines are line directives showing from whence
this source code originated:
#line 0 "/tmp/uni13804206" #line 0 "a.icn"Global declarations (including procedures) would be passed through the preprocessor pretty nearly intact, but for the class, we get a bunch of very different code. Methods are written out, with names mangled to a classname_methodname format.
procedure A_m(self)
#line 2 "a.icn"
write("hello");
end
Two record types are defined, one for the class instances and one for the
"methods vector", or "operation record". The methods vector is instantiated
exactly once in a global variable in classname__oprec format.
record A__state(__s,__m,x,y) record A__methods(m) global A__oprecThe default constructor for a class takes fields as parameters and uses them directly for initialization purposes. The first time it is called, a methods vector is created. Instances are given a pointer to themselves in an __s field (mainly for historical reasons) and to the methods vector in an __m field. Current NMSU grad student Sumant Tambe did an independent study project to get rid of __s and __m with partial success, but his work is not finished or robust enough to be enabled by default.
procedure A(x,y)
local self,clone
initial {
if /A__oprec then Ainitialize()
}
self := A__state(&null,A__oprec,x,y)
self.__s := self
return self
end
procedure Ainitialize()
initial A__oprec := A__methods(A_m)
end
# Build local_vars from the params and local var expressions. local_vars := set() extract_identifiers(node.children[1].fields, local_vars) extract_identifiers(node.children[3], local_vars)
Eventually, every identifier in every expression is checked against local_vars, and if not found there, against the class variables stored in a variable self_vars:
self_vars := set() every insert(self_vars, c.foreachmethod().name) every insert(self_vars, c.foreachfield()) every insert(self_vars, (!c.ifields).ident) every insert(self_vars, (!c.imethods).ident)For an IDENT node, the tests boil down to:
if node.tok = IDENT then {
if not member(\local_vars, node.s) then {
if member(\self_vars, node.s) then
node.s := "self." || node.s
else
node.s := mangle_sym(node.s)
}
}
Undeclared locals and globals are mangled to include the current
package name if there is one.
class subclass : super1 : super2 : ... ( ...fields... )The semantics of inheritance, and particularly of multiple inheritance, are interesting in Unicon; the implementation is relatively simple. An example of inheritance is given by class Class, from uni/unicon/idol.icn
class declaration(name,fields,tag,lptoken,rptoken) ... end ... class Class : declaration (supers, methods, text, imethods, ifields, glob, linkfile, dir, unmangled_name, supers_node)Unique perspective on inheritance in Unicon comes from the actual acquisition of inherited data fields and methods by the subclass. Some object-oriented languages do this inheritance "by aggregation", creating a copy of the superclass in the subclass. This is fine, but it makes "overriding" an anomaly, when overriding the parent with new/different behavior is entirely routine. Unicon instead inherits by the child looking for things in the parent (and the parent's parent, etc.) that they don't already have. In the above example, class declaration effectively appends 5 fields from class declaration onto the end of its field list. The generated code for instances looks like
record Class__state(__s,__m,
supers,methods,text,imethods,ifields,
glob,linkfile,dir,unmangled_name,supers_node,
name,fields,tag,lptoken,rptoken)
The inheritance semantics is called "closure based" because the process
of looking for things to add from parent superclasses iterates until no
new information can be added, after which the subclass is said to be closed
on its parents. Other forms of closure appear frequently in CS.
lecture #10 began here
Java answer: there is no concrete multiple inheritance in Java, only abstract multiple inheritance via interfaces.)
Method transitive_closure() is one of the cleaner demonstrations of why Unicon is a fun language in which to write complex algorithms. It is walking through a class graph, but by the way it is not recursive.
method transitive_closure()
count := supers.size()
while count > 0 do {
added := taque()
every sc := supers.foreach() do {
if /(super := classes.lookup(sc)) then
halt("class/transitive_closure: couldn't find superclass ",sc)
every supersuper := super.foreachsuper() do {
if / self.supers.lookup(supersuper) &
/added.lookup(supersuper) then {
added.insert(supersuper)
}
}
}
count := added.size()
every self.supers.insert(added.foreach())
}
end
Now, given what I've said about Unicon providing a depthfirst inheritance
hierarchy semantics, what is wrong with this picture? The code is stable
and hasn't needed changes in several years, so I am not fishing for syntax
bugs, or claiming that there is a bug. But there is something odd. A
chocolate "peanut butter cup" is available in my office for the first
correct description of the problem.
The method resolve() within class Class finds the inherited fields and methods from the linearized list of superclasses.
#
# resolve -- primary inheritance resolution utility
#
method resolve()
#
# these are lists of [class , ident] records
#
self.imethods := []
self.ifields := []
ipublics := []
addedfields := table()
addedmethods := table()
every sc := supers.foreach() do {
if /(superclass := classes.lookup(sc)) then
halt("class/resolve: couldn't find superclass ",sc)
every superclassfield := superclass.foreachfield() do {
if /self.fields.lookup(superclassfield) &
/addedfields[superclassfield] then {
addedfields[superclassfield] := superclassfield
put ( self.ifields , classident(sc,superclassfield) )
if superclass.ispublic(superclassfield) then
put( ipublics, classident(sc,superclassfield) )
} else if \strict then {
warn("class/resolve: '",sc,"' field '",superclassfield,
"' is redeclared in subclass ",self.name)
}
}
every superclassmethod := (superclass.foreachmethod()).name() do {
if /self.methods.lookup(superclassmethod) &
/addedmethods[superclassmethod] then {
addedmethods[superclassmethod] := superclassmethod
put ( self.imethods, classident(sc,superclassmethod) )
}
}
every public := (!ipublics) do {
if public.Class == sc then
put (self.imethods, classident(sc,public.ident))
}
}
end
Unicon generates in each source directory an NDBM database (named uniclass.dir and uniclass.pag) that includes a mapping from class name to: what file the class lives in, plus, what superclasses, fields, and methods appear in that class. From these specifications, "link" declarations are generated for superclasses within subclass modules, plus the subclass can perform inheritance resolution. The code to find a class specification is given in idol.icn's fetchspec(). A key fragment looks like
if f := open(dir || "/" || env, "dr") then {
if s := fetch(f, name) then {
close(f)
return db_entry(dir, s)
}
close(f)
}
Unicon searches for "link" declarations in a particular order, given by the current directory followed by directories in an IPATH (Icode path, or perhaps Icon path) environment variable, followed by system library directories such as ipl/lib and uni/lib. This same list of directories is searched for inherited classes.
The string stored in uniclass.dir and returned from fetch() for class Class is:
idol.icn class Class : declaration(supers,methods,text,imethods,ifields,glob,linkfile,dir,unmangled_name,supers_node) ismethod isfield Read ReadBody has_initially ispublic foreachmethod foreachsuper foreachfield isvarg transitive_closure writedecl WriteSpec writemethods Write resolve end
procedure Progend(x1)
package_level_syms := set()
package_level_class_syms := set()
set_package_level_syms(x1)
scopecheck_superclass_decs(x1)
outline := 1
outcol := 1
#
# export specifications for each class
#
native := set()
every cl := classes.foreach_t() do {
cl.WriteSpec()
insert(native, cl)
}
#
# import class specifications, transitively
#
repeat {
added := 0
every super := ((classes.foreach_t()).foreachsuper() | !imports) do {
if /classes.lookup(super) then {
added := 1
readspec(super)
cl := classes.lookup(super)
if /cl then halt("can't inherit class '",super,"'")
iwrite(" inherits ", super, " from ", cl.linkfile)
writelink(cl.dir, cl.linkfile)
outline +:= 1
}
}
if added = 0 then break
}
#
# Compute the transitive closure of the superclass graph. Then
# resolve inheritance for each class, and use it to apply scoping rules.
#
every (classes.foreach_t()).transitive_closure()
every (classes.foreach_t()).resolve()
scopecheck_bodies(x1)
if \thePackage then {
every thePackage.insertsym(!package_level_syms)
}
#
# generate output
#
yyprint(x1)
write(yyout)
Besides classes and packages, Unicon adds to Icon one additional syntax construct in support of this kind of program: type checking and coercion of parameters. Parameters and return values are the points at which type errors usually occur, during an integration phase in a large project where one person's code calls another. The type checking and coercion syntax was inspired by the type checks done by the Icon runtime system at the boundary where Icon program code calls the C code for a given function or operator.
One additional comment about types is that the lack of types in declarations for ordinary variables such as "local x" does not prevent the Icon compiler iconc from determining the exact types of well over 90% of uses at compile time using type inference. Type checking can generally be done at compile time even if variable declarations do not refer to types... as long as the type information is available across file and module boundaries.
lecture #11 began here
C++ was very large to begin with, but after its initial development, C++ greatly changed and grew almost every mechanism any programmer could want; especially, it seemed to play the doppelganger on Ada, acquiring every feature of Ada necessary to kill it and take its place. A C++ compiler is a very large and complicated undertaking.
AT&T C++ was very buggy in 1985 and it took several years for compilers to stabilize and for the language to commercialize successfully. This would have failed had it originated someplace smaller than AT&T, and it almost failed anyhow. The C++ community owes a great debt to Borland, without whom the computing world would not be what it is today. Microsoft's agenda of killing Borland forced it into the C++ business, late and apparently against its will, but many current students seem to think Microsoft invented C++ and that it is the only systems language on the only platform that matters.
One of the main object-orientation implementation features of C++ that is almost unique is the support for non-virtual methods. C++ experts may disagree but arguably this is a performance hack that adds complexity to the language for the sake of execution speed of the generated code. Basically the symbol table for each class must track virtual and non-virtual properties, and generate C-like function calls for non-virtuals while generating virtuals through a methods vector (aka virtual function table), with an extra memory reference.
Another feature whose implementation in C++ or Java raises additional challenges is function overloading. The symbol table must hash not on method names but on method signatures. Signatures can generally be represented by strings. When automatic type conversions come into play, multiple possible interpretations of a call require more semantic checking than in simpler languages' compilers. For example, with methods f(char c) and f(float x), a call like f(32) is ambiguous and produces a semantic error (kind of like the multiple inheritance ambiguity we saw earlier).
Operator overloading adds a similar challenge to the semantic analysis of operators; sometimes + is an ADD instruction and sometimes it is a function call.
The Pascal language used a virtual machine instruction set called p-code to improve the language's portability. For each machine, the compiler would generate the same instructions, but the representation details of those instructions might vary from machine to machine, e.g. little-endian versus big-endian machines. The instruction set was very simple, enabling it to be implemented by single instructions or small sequences of instructions on most typical hardware of that day.
The modern Java virtual machine has similar goals, but with the modern twist of (in theory) true machine-independent VM code. Instruction portability is solved at a tiny cost in performance, but Java programs still have portability problems: they have traded instruction compatibility problems for library version compatibility problems. Microsoft's C# and .Net CLR started out as cheap replacements for Java, but CLR quickly gained a much larger goal, that of supporting many different languages, such as Visual BASIC and C#, with good interoperability and performance.
SmallTalk, Prolog, and Icon/Unicon have Pascal-style virtual machines, with the added twist of higher-level language semantics, semantics complicated enough that they do not map easily down to underlying harware. For such languages, a natural migration path is to develop a virtual machine that captures the semantics and allows experience and experimentation, followed (optionally, much later) by a compiler that produces native code.
The virtual machine instruction set does not go into the details of data types, the way JVM does. It is sort of like "values" and "memory" are virtual, not just "instructions" are virtual.
Some changes since the early days: itran+ilink merged to become icont. In future, these should just go ahead and jump into the VM itself (iconx).
Appendices B and C of the IC document the virtual machine instruction set. In addition, a crude Ucode Code Generation Guide is available.
| d-word |
|---|
| v-word |
The vword has either: nothing (if dword holds null type), or an integer value, or a pointer to char, or a pointer to a "block".
lecture #12 began here
A: Read section 3.1 of the IC for Griswold's answer to this question. To it reasons given there, I would add:
The question you should be asking is: why is it easier to write a VM for a platform than a backend for a compiler which generates C code for that platform?
There is a gigantic union type, union block, to denote a value who is a block that has not been examined yet. Each type of block then has a separate struct type associated with it. Some types require multiple block types to handle their representation.
Most blocks' sizes are statically determined by their type. An exception to this rule would be for records; the sizes of different records vary, so a size in bytes is stored in the block. Note: this is stupid, the size is a property of the record type, why are we wasting 4 bytes per instance to store its size in bytes? Fixing it could easily be a homework problem.
Here's a straw-man one-word descriptor design to allow us to consider the matter. You might come up with a better one-word design, this is kind of an exercize at identifying the minimal changes.
Unicon has 25 type codes, so 5 bits would suffice for them. Four more bits are used in current descriptors for flags NVPT. On a 32-bit platform, that leaves us with 23 bits: only enough to address 8MB of address space. Is this useful? Let's see.
For integers, numbers larger than 4M or smaller than -4M would spill out into "large integer" blocks. One would have to do some statistics to tell how often integers are in between 4M and 2B.
For blocks, if we play tricks, like aligning blocks on 8 byte-boundaries, we get three bits back! 64MB of blocks, hurray. But we have introduced a bit of fragmentation where memory is wasted, and the block pointer is now computed as
BlkLoc(d) = (d & 7FFFFF) << 3;This might be slower than just d.vword.bptr, which is just fetching the memory 4 bytes off from the address of d. We have extra instructions to fetch to make up for any fewer memory fetches from the smaller descriptor.
There might be additional tricks we could play to increase the addressable block memory, like allocating a different region for each data type. This would allow types to be aligned on larger boundaries than just 8-bytes, and allow each type its own 64MB or larger region (by adding the type's region address to the pointer). But now, following these pointers is slower, there is another addition along with all that bit twiddling.
And what about strings? They are majorly special-cased for high-performance in the current implementation, and many Icon programs do heavy string processing. Do we redesign strings to use a block, so we can store the length and a pointer? If so, we've added another level of indirection to strings, and we are paying a new 12 byte block for every string. Can we fit the length and pointer into the one-word descriptor? The existing design would allow us 31 bits for string length and string pointer. We'd quickly hit unacceptable limits, such as only allowing 8MB of string space (23 bits) and strings of max length 256 (8 bits). How about storing the length at the beginning of every string like Pascal and some BASIC's? If we use all 31 bits for the char *, substring sectioning becomes a more expensive copy operation, and some string concatenation optimizations are no longer possible.
We might actually want to implementing two internal types for strings: short strings (length <= 256, 8MB of them) and long strings (using a type code and an external block), mirroring the integer implementation.
One thing is for sure: one-word descriptors would add some strong limits on memory (on 32-bit machines, that is) and potentially a lot of complexity. If anyone wants to do them for a semester project, it would be a high-risk gamble with a high potential payoff, especially on 64-bit machines.
There are some special cases where assignments need special treatment. Special keyword variables have semantic rules which must be enforced, such as &subject must always be a string. These special cases are handled by trapped variables. A reference to a trapped variable allocates a block with special typecode so that a subsequent assignment (if there is one) can enforce the semantic rules. The Icont program (and hence Unicon) lazy about this and allocate the trapped variables just in case, even in syntactic contexts where the variable will be dereferenced rather than assigned.
lecture #13 began here
The core of the Unicon VM lives in runtime/interp.r, which has a gigantic function named interp(). This function does a fetch/decode/execute loop on VM instructions. Some instructions are implemented immediately; most invoke runtime system functions in other modules. The instructions are The fetch-decode-execute loop looks like
for (;;) {
...
lastop = GetOp; /* Instruction fetch */
...
switch ((int)lastop) {
... something like 100 cases, 25 for binary operators
}
The most exciting thing about this VM loop is that, when an expression
generates a result (and might have more), the way expressions suspend
is by calling interp() recursively.
The RTL code is illustrated by the implementation of the size operator (unary asterisk, as in *s). Compared with C, RTL has a documentation string, modified function header syntax, an "abstract" clause, a "type_case" operation, and separate code fragments for the different types. When the type of x is unknown, this expands to a switch statement executed at runtime, but when type information is known, the correct branch is selected at compile time.
"*x - return size of string or object x."
operator{1} * size(x)
abstract {
return integer
}
type_case x of {
string: inline {
return C_integer StrLen(x);
}
list: inline {
return C_integer BlkLoc(x)->list.size;
}
table: inline {
return C_integer BlkLoc(x)->table.size;
}
set: inline {
return C_integer BlkLoc(x)->set.size;
}
cset: inline {
register word i;
i = BlkLoc(x)->cset.size;
if (i < 0)
i = cssize(&x);
return C_integer i;
}
record: inline {
return C_integer BlkLoc(x)->record.recdesc->proc.nfields;
}
coexpr: inline {
return C_integer BlkLoc(x)->coexpr.size;
}
file: inline {
int status = BlkLoc(x)->file.status;
#ifdef Dbm
if ((status & Fs_Dbm) == Fs_Dbm) {
int count = 0;
DBM *db = (DBM *)BlkLoc(x)->file.fd;
datum key = dbm_firstkey(db);
while (key.dptr != NULL) {
count++;
key = dbm_nextkey(db);
}
return C_integer count;
}
#endif /* Dbm */
#ifdef ISQL
if ((status & Fs_ODBC) == Fs_ODBC) { /* ODBC file */
struct ISQLFile *fp;
int rc;
#if (ODBCVER >= 0x0351)
SQLLEN numrows;
#else /* ODBCVER >= 0x0351 */
SQLINTEGER numrows;
#endif /* ODBCVER >= 0x0351 */
fp = (struct ISQLFile *) BlkLoc(x)->file.fd;
rc = SQLRowCount(fp->hstmt, &numrows);
return C_integer(numrows);
}
#endif /* ISQL */
runerr(1100, x); /* not ODBC file */
}
default: {
/*
* Try to convert it to a string.
*/
if !cnv:tmp_string(x) then
runerr(112, x); /* no notion of size */
inline {
return C_integer StrLen(x);
}
}
}
end
lecture #14 began here
The icont source code lives in src/icont. The code generator apparently lives in icont/tcode.c, an 1100 line source file. The rest of the t*.c files are related to the compiler front-end (itran); the l*.c files are for the linker (ilink).
Code generation to the .u1 VM code file proceeds on a per-procedure basis, when the grammar action macro Procdcl(x) calls a C function codegen(x) on a parse tree node for a procedure. codegen(t) just resets a label counter and calls traverse(t), the "real" code generator function. The start nonterminal's action macro Progend calls gout(), which writes global (.u2) information out. The .u2 is then concatenated with the .u1 to form a .u file.
Function traverse(t) traverses the syntax (sub)tree rooted at t. It is a simple function with a long switch statement with 35 or so branches for different kinds of tree nodes. The bodies of most branches call helper functions to generate specific instructions, plus call traverse() recursively on child nodes.
static int traverse(t)
register nodeptr t;
{
register int lab, n, i;
struct loopstk loopsave;
static struct loopstk loopstk[LoopDepth]; /* loop stack */
static struct loopstk *loopsp;
static struct casestk casestk[CaseDepth]; /* case stack */
static struct casestk *casesp;
static struct creatstk creatstk[CreatDepth]; /* create stack */
static struct creatstk *creatsp;
n = 1;
switch (TType(t)) {
...
Interestingly, this code has static limits on the maximum lexical depth
of major control structures, which may be nested at most 20 deep (loops),
10 deep (case expressions), and 10 deep (co-expression creations).
These static limits do not appear to be checked. This is very possibly
a cause of (rare) coredumps from icont on especially large, complicated
.icn source files, such as the 475KB uniconlex.icn generated by ulex
by one of your illustrious classmates for an earlier homework. POSSIBLE
HOMEWORK: add checks, raise or better yet eliminate static limits on loopstk,
casestk, and creatstk!
The case branches for different node types rely on macros to pick out information from the tree nodes, so the actual representation of the tree is entirely hidden from this code! The actual tree representation as well as the definitions of these macros live in src/icont/tree.h. The macros hide it, but in reality, each "child" of a tree node can be either a (long) int, a (char *) string, or a pointer to another tree node:
union field {
long n_val; /* integer-valued fields */
char *n_str; /* string-valued fields */
nodeptr n_ptr; /* subtree pointers */
};
struct node {
int n_type; /* node type */
char *n_file; /* name of file containing source program */
int n_line; /* line number in source program */
int n_col; /* column number in source program */
union field n_field[1]; /* variable-content fields */
};
Tree0(t) picks out child 0,
which had better be a tree node, Val0(t) picks out child 0, which had
better be a (long integer) "immediate value" in the tree, etc.
This allows the code for each kind of tree node to be fairly simple.
The following example illustrates code generation for perhaps the most common syntax construct, binary operators. The code uses function emit() to generate individual VM instructions, and helper functions setloc() and binop() to do some of the dirty work. The overall template is: push a (null) descriptor to make space for the result, generate code for the left operand, generate code for the right operand, and generate the instruction for the operator. The pnull at the beginning may seem unnecessary; in many stack machines the protocol is: pop two operands and push the result in their place. The pnull is not an accident for Icon and Unicon, see if you can come up with a guess as to why you might not always want to throw away the operands right away and overwrite them with your result. Note that the tree node order for binary operators is "prefix" order, the operator in child 0, followed by operands in child 1 and 2.
case N_Augop: /* augmented assignment */
case N_Binop: /* or a binary operator */
emit("pnull");
traverse(Tree1(t));
if (TType(t) == N_Augop)
emit("dup");
traverse(Tree2(t));
setloc(t);
binop((int)Val0(Tree0(t)));
free(Tree0(t));
Working our way backwards through this code:
There are actually several variations on the emit() function: emitlab() for labels, emit(), emitl() for instructions that reference labels, emitn() for instructions with a numeric argument, and emits() for instructions with a string argument. These functions all use good-old fprintf, to a global variable codefile, and every single write is checked for failure to avoid surprises. Strangely enough the writecheck() is a function that should probably be inline or a macro.
static void emit(s)
char *s;
{
writecheck(fprintf(codefile, "\t%s\n", s));
}
I have omitted some of the details, such as opening files, but this whole
code generator is one of the simplest "production compiler" arrangements
I have ever seen. One of the most useful possible homework #4's I could
assign would be to rewrite this VM code generator for the Unicon translator,
so it would not need to write out Icon source code and then call the
icont program which re-parses everything. If we got rid of the itran
part of icont, it would eliminate ~6K lines of C code from
our distribution. But, there are enough details that it might be too
large for a homework. However, it would make an excellent semester project.
Function ilink() in link.c performs three passes, one to obtain global information from ucode files, one to omit unreferenced procedures, and one to do the "real" code generation. We will focus on the third pass. Looping through the list of files to link is done by walking through a linked list:
lfls = llfiles;
while ((lf = getlfile(&lfls)) != 0) {
For each file, function gencode() is called. After all files have
been visited, a function gentables() generates several global tables
containing information about record types and fields, globals and
statics.Function gencode() reads from a global open FILE named infile; it relies on helper functions such as getopc() to handle the details of reading "tokens" from the ucode file. The majority of gencode() is a giant switch statement on the opcodes of each (real and pseudo) instruction:
while ((op = getopc(&name)) != EOF) {
switch (op) {
For many opcodes, the binary instruction is very straightforward.
Sixty three opcodes share the following switch branch:
case Op_Plus:
newline();
lemit(op, name);
break;
The opcodes that don't share this branch must have special needs or
instruction immediate in-line operands (as opposed to operands pushed as
descriptors onto the interpreter stack), such as: constants, for which the
instruction and the value (or a pointer to it) are both emitted into the
binary bytecode. The procedure call instruction and the record field
operator VM instruction similarly have an integer code operand (specifying
how many parameters, and which field, respectively) rather than a
descriptor. Some of these special cases are mentioned in later sections.
The helper functions used in the common case are illustrative and worth
covering in detail. newline() is actually moving the input file past
the next line feed, i.e. to the next instruction. lemit() is the helper
function that is doing the real work, except that it is just a wrapper
(with optional debugging code) around outop(op), which is a macro that
calls intout(n) for opcodes. Instruction operands would instead be
generated by wordout() which tends to be long-sized/pointer-sized, which
might or might not be the same number of bytes as wordout().
word pc = 0; /* simulated program counter */Similar to emit(), at the binary level the actual code is "emitted" by helper functions such as wordout(), which emits a word. This might be tricker to get consistent than it sounds, at least when one is writing for "all possible platforms". Counterintuitively, these helper functions don't actually write to files, but rather, write bytes to a (gigantic) in-memory image of the icode file (named codep), an array of bytes that is grown dynamically.
/*
* wordout(i) outputs i as a word that is used by the runtime system
* WordSize bytes must be moved from &oword[0] to &codep[0].
*/
static void wordout(oword)
word oword;
{
int i;
union {
word i;
char c[WordSize];
} u;
CodeCheck(WordSize);
u.i = oword;
for (i = 0; i < WordSize; i++)
codep[i] = u.c[i];
codep += WordSize;
pc += WordSize;
}
In a previous software engineering course, a 25K LOC student project used enough record types that a size problem was identified. Happily, other researchers in the SmallTalk community had solved a similar problem in SmallTalk. An undergraduate student named Richard Hatch experimented with several techniques and eventually came up with a simple, cheap compression technique, which is described in a UTSA technical report. In a very large program, Field Table Compression has been demonstrated to take a 464KB field table and reduce it to 24KB, without requiring decompression and at an execution cost of around 2%.
lecture #15 began here
lecture #16 began here
write("hello " x)
would print
hello MervynSuggest how this feature might be implemented in the lexical analyzer of a compiler (not just for "write()", but for all expressions). What problems would occur if we added this feature to Unicon? Are there any semantic problems deep enough that we should NOT do it?
t := ["Washington": "Deleware", "Egypt": "Nile", 3.14: 3](This creates a table of size 3, with keys "Washington", "Egypt", and 3.14 mapped onto elements "Deleware", "Nile", and integer 3. It makes no sense.)
optionalExprs : exprs* ;
How would you handle YACC's notation for semantic actions to allow
this kind of production rule to build syntax trees?