CS 580: Compiler Construction Lecture Notes

  • Lecture 15 (printable)
  • Lecture 16 (printable)
  • Lecture 17 (printable)
  • Lecture 18 (printable)
  • Lecture 19 (printable)
  • Lecture 20 (printable)
  • Lecture 21 (printable)
  • Lecture 22 (printable)
  • Lecture 23 (printable)
  • Lecture 24 (printable)
  • Lecture 25 (printable)
  • Lecture 26 (printable)
  • Lecture 27 (printable)
  • Lecture 28 (printable)
  • lecture #1 began here

    Read the Syllabus

    Sleep Hours and Office Hours

    It is my normal practice to work all night after my children are asleep and no more distractions are present, and sleep in. This is conducive to afternoon teaching, but this semester I get to have class in the morning.

    In order to make this scheduled class time more bearable (for me), I propose to hold office hours Tuesday and Thursday after class from 12-1pm in the Corbett Center "food court". You may eat with me, or just ask questions. I am always going to be eating the Mexican "Daily" Special at the food court, which has not changed at NMSU in three years (The answer to official NM State Paradox: one red and one green).

    Of course, I am also available in my office by appointment, and sometimes happy to take drop-in visits. E-mail works best though.

    A "Reality Course"

    Like those "reality TV" shows you may have heard of, this course uses a real compiler family as a teaching tool. My theory is you will learn more useful stuff from looking at a real compiler than from textbooks or toy exercises. Various principles we study from this family of compiler tools will teach you a lot about how other compilers are likely to work.

    On the other hand, if we tried to look at, say, GCC as our compiler of study, it is very likely that most of you would enjoy the process. Most real compilers are quite complex, and were not designed to be "read". An exception is [Fraser/Hanson 95], see below.

    For the reasons given above, in this course I propose to describe for you, and provide essential documentation on, my own open source compiler for the Unicon language, and the compilers for the Icon language upon which Unicon builds.

    Why Unicon? Why Icon?

    With this instructor, in this course, there are many, obvious reasons: If you just want to do a "toy" C compiler: you should be taking CS 370! If you really must do a C compiler in 580: get "A Retargetable C Compiler" by Christopher Fraser and David Hanson, published by Addison-Wesley, and we'll adapt assignments for you.

    Introduction and Overview

    Compiler construction is a broad field. In the early days, much of the effort was on how to implement high-level constructs. Then, for a long time, the major emphasis was on improving the efficiency of generated code. These topics remain important today, but many new technologies have caused them to become more specialized.

    For example, figuring out how to implement high-level constructs has moved beyond while loops, or function calls, to more advanced topics such as object-orientation, or concurrency. Improving the efficiency of generated code might mean: special-purpose transformations to take advantage of novel features of modern hardware, or it might mean: customizing a virtual machine, or mixing virtual machine code and native code.

    One big difference between this course and the first compiler course is: in the first compiler course, the major focus is on writing a compiler from scratch, starting from the ground up with lexical and syntax analysis. In this course, we will instead study some existing compilers that are part of a language family (the Icon and Unicon language), and your assignments and semester project will add to or improve those compilers, rather than having to write a whole compiler yourself.

    The Icon Language Family

    The compilers we will be studying are those for the Icon programming language. This language was developed as a by-product of research at the University of Arizona. The principal designer and maintainer of this language over a period of 15+ years was Ralph Griswold. "Uncle Ralph" came to the University of Arizona from AT&T Bell Labs in 1970 or so, where he was one of the principal inventors of the SNOBOL family of languages. As one of the founders of the Bell Labs software culture which spawned UNIX, C, and many other essential contributions to modern software, Ralph Griswold brought to his academic research not only brilliance, but also experience and a value system that demanded that research ideas be tested by fire and proven useful and usable by real users, not just good-looking diagrams in academic papers.

    The first two versions of the Icon language were developed in Ratfor in the late 1970's. Starting around 1980 this implementation was discarded and the language rewritten from scratch in C and UNIX. A virtual machine was adopted and refined over a decade. Over time, many parts originally written in assembler were rewritten in C as the semantics of the language came to be better understood.

    Around 1990, another radical rewrite was produced in order to support an optimizing compiler with much higher performance than the virtual machine. The new compiler was written from scratch, but its runtime system was produced by ingeniously adapting the code from the virtual machine. At approximately the time of Ralph Griswold's retirement in the mid-1990's, the Icon language was more or less frozen; the culminating work on the subject being the 3rd edition of "The Icon Programming Language", published in 1997.

    But people in the Internet Age wanted applications to do more than they did in the 1980's. Additions to the open-source language were made by various authors. Around 2000, several such additions were merged together by a group led by Clint Jeffery under the name Unicon; the project was placed under the GPL and moved to SourceForge.net.

    The most recent major addition to the language, done at NMSU, has been a set of portable high-level 3D graphics facilities, built on top of but much easier to learn and use than OpenGL. Current and on-going work includes the addition of portable audio and video support, in order to accomodate the needs of Collaborative Virtual Environments. With such extensions, the goal is not to add access to a popular C or C++ or Java API, but rather, to design higher level abstractions at the level most programmers are comfortable working, and then figure out how to provide those abstractions.

    lecture #2 began here

    Reading Assignments

    Before we dive into the compilers themselves, you should take a quick look at the language whose compiler you will study.

    Icon and Unicon in 30 Minutes

    This material is from your reading assignment. The class lecture only covered the highlights.

    Overview of the Compilers and their Source Code

    Sources of information:

    Highlights of the Unicon Source Tree

              +-bin------          executable binaries and support files
              +-config--+          configurations
              |         +-unix---+
              |         |        +-intel_linux-
              |         |        +-sun_gcc-----
              |         +-win32---
              |         |        +-gcc---------
              |         |        +-msvc--------
    /-unicon--+-src------          source code (C)
              |         +-common--
              |         +-h-------
              |         +-icont---
              |         +-iconc---
              |         +-rtt-----
              |         +-runtime-
              +-tests----          tests
              |         +-general-
              |         +-posix---
              +-unicon---          source code (Unicon)
                        +-unicon--
    

    Downloading and Installing the Unicon Source

    1. Get http://unicon.org/dist/uni.zip
    2. Unzip it
    3. Configure it (make X-Configure name=intel_linux)
    4. type "make Unicon"
    Windows: get free unzip.exe, free MinGw32 GCC, sh.exe, and make.exe; add them to path and test; then (make W-Configure-GCC) and (make Unicon)

    lecture #3 began here

    Lexical Analysis

    Lexical Analysis is traditionally covered heavily in a first compiler course that is a prerequisite for this one. We will briefly review the main concepts, and then talk about lexical analyzers in the Icon and Unicon compiler family.

    Lexical Analysis Fundamentals

    If you aren't familiar with lexical analysis in general, you may want to take CS 370. In brief, you should be familiar with the following ideas. Note that you do not need to memorize these :-) but if you do not understand them when used in sentences, its an indicator that you haven't taken the prerequisite course.
    scanner
    a program that does lexical analysis. Also called a lexical analyzer.
    lexeme
    a "word" in the program you are compiling; a string of symbols to be interpreted by the compiler as an indivisible, atomic unit.
    token
    the set of information collected by the compiler about a lexeme. Besides the lexeme string itself, this typically includes a syntactic category and a set of lexical attributes. The rest of the compiler generally works with tokens, not just lexemes.
    lexical attributes
    information about a lexeme that may be needed later in compilation, such as what source file, line, and column the token appeared at, and for constant values, their corresponding binary representation.
    regular expressions
    formal notation used to specify set of characters which match each syntactic category in a programming language
    finite automaton
    abstract mathematical machine capable of recognizing a regular expression, or categorizing/selecting which regular expression a piece of source code matches, can be made to run in O(n) time.
    lex and flex
    (originally UNIX) tools that take in a set of regular expressions and generate (originally C) source code for a finite automaton.
    int yylex()
    the function, possibly generated by lex, implementing a lexical analyzer; returns an integer category for what type of word is matched. Called repeatedly by the compiler, each time processing and returning one source code lexeme. TOKEN (holding lexical attributes) must be returned in a global variable since yylex() returns only the category.
    Obsolete conventional wisdom was to write the lexical analyzer by hand, because lexical analyzers generated by lex were too buggy and slow for production systems. Flex and Moore's law have largely rendered this moot, but many/most compilers still have handwritten lexical analyzers. They still tend to use lex's public API, though, because it works seamlessly with parsers generated by YACC.

    String Scanning in Icon and Unicon

       s ? expr 
    
    causes a string s to be the subject of scanning by expr. Within expr, s is a default argument to all string functions, and a scanning position is maintained and used to move through the string looking for specified patterns. move(rel) and tab(abs) move to a position and return the substring between the old and new position. Several functions such as find() and many() return positions within the string that can be used as the destinations for tab(). The unary operator =x is tab(match(x)), moving the position past x if x is the next thing in the string.

    It is reasonable to ask whether Icon string scanning makes processing of strings by hand easier than doing it in C or Java, and whether Icon string scanning is similar to the scanning done by compiler scanners.

    The Unicon Lexical Analyzer

    Unicon's lexical analyzer is written by hand, in Unicon, using a lex-compatible interface. Some of its design is borrowed from the Icon lexical analyzer (which is handwritten C code). It would be interesting to replace Unicon's lexical analyzer with a machine generated lexical analyzer to reduce the amount of compiler source code we have to maintain. The lexical analyzer consists of a function yylex() located in unicon/uni/unicon/unilex.icn, about 500 lines of code.

    Globals Comprising the Lex-compatible Public API

    The global declarations that exist in order to provide a Lex-compatible API include:
    $include "ytab_h.icn"			# yacc's token categories
    global yytext				# lexeme
    global yyin				# source file we are reading
    global yytoken				# token (a record)
    global yylineno, yycolno, yyfilename	# source location
    

    Character Categories

    The lexical analyzer uses several csets for different character categories beyond the built-in ones:
    global O, D, L, H, R, FS, IS, W, idchars
    
    procedure init_csets()
       O  := '01234567'
       D  := &digits
       L  := &letters ++ '_'
       H  := &digits ++ 'abcdefABCDEF'
       R  := &digits ++ &letters
       FS := 'fFlL'
       IS := 'uUlL'
       W  := ' \t\v'
       idchars := L ++ D
    end
    

    The Token Type

    The record type storing each token's information just bundles together the syntactic category (an integer), lexeme (a string), and location at which the token occurred. This is pretty minimalist.
    record token(tok, s, line, column, filename)
    

    Global Variables for Error Handling and Debugging

    Several Remaining global variables are mainly used for error handling, and for debugging the lexical analyzer itself.

    Reserved Words

    Global reswords() creates and becomes a table holding the Unicon reserved words. For each word, a pair of integers [tokenflags, category] is kept. Language design note: tables in this language need a "literal" format.
    procedure reswords()
    static t
    initial {
       t := table([Beginner+Ender, IDENT])
    
       t["abstract"] := [0, ABSTRACT]
       t["break"] := [Beginner+Ender, BREAK]
       t["by"] := [0, BY]
       t["case"] := [Beginner, CASE]
       t["class"] := [0, CLASS]
       t["create"] := [Beginner, CREATE]
       t["default"] := [Beginner, DEFAULT]
       t["do"] := [0, DO]
       t["else"] := [0, ELSE]
       t["end"] := [Beginner, END]
       t["every"] := [Beginner, EVERY]
       t["fail"] := [Beginner+Ender, FAIL]
       t["global"] := [0, GLOBAL]
       t["if"] := [Beginner, IF]
       t["import"] := [0, IMPORT]
       t["initial"] := [Beginner, iconINITIAL]
       t["initially"] := [Ender, INITIALLY]
       t["invocable"] := [0, INVOCABLE]
       t["link"] := [0, LINK]
       t["local"] := [Beginner, LOCAL]
       t["method"] := [0, METHOD]
       t["next"] := [Beginner+Ender, NEXT]
       t["not"] := [Beginner, NOT]
       t["of"] := [0, OF]
       t["package"] := [0, PACKAGE]
       t["procedure"] := [0, PROCEDURE]
       t["record"] := [0, RECORD]
       t["repeat"] := [Beginner, REPEAT]
       t["return"] := [Beginner+Ender, RETURN]
       t["static"] := [Beginner, STATIC]
       t["suspend"] := [Beginner+Ender, SUSPEND]
       t["then"] := [0, THEN]
       t["to"] := [0, TO]
       t["until"] := [Beginner, UNTIL]
       t["while"] := [Beginner, WHILE]
    }
       return t
    end
    

    Lexical Analyzer Initialization and the Big Inhale

    A function, yylex_reinit() is called the first time yylex() is called, along with each time the compiler moves to process a new file named on the command line. Along with initializing the public API variables, this function reads in the entire file, in a single global string variable, named "buffer". This allows extremely fast subsequent processing, which does not file I/O for each token, while avoiding complex buffering sometimes done to reduce file I/O costs in compilers.

    This "big-inhale" model did not work well on original 128K PDP-11 UNIX computers, but works well in this century. At present, the code assumes Unicon source files are less than a megabyte -- a lazy programmer's error. Although Unicon programs are much shorter than C programs, an upper limit of 1MB is bound to be reached someday. Homework: fix the lexical analyzer so it works correctly on files of arbitrarily large size.

    procedure yylex_reinit()
       yytext := ""
       yylineno := 0
       yycolno := 1
       lastchar := ""
       if type(yyin) == "file" then
          buffer := reads(yyin, 1000000)
       else
          buffer := yyin
       tokflags := 0
    end
    

    Semicolon Insertion

    Icon and Unicon insert semicolons for you automatically. This is an easy lexical analyzer trick. The lexical analyzer requires one token of lookahead. Between each two tokens, it asks: was there a newline? If yes, was the token before the newline one that could conceivably be the end of an expression, and was the token at the start of the new line one that could conceivably start a new expression? If it would be legal to do so, it saves the new token and returns a semicolon instead.

    This little procedure is entirely hidden from the regular lexical analyzer code by writing that regular code in a helper function yylex2(), and writing the semicolon insertion logic in a yylex() function that calls yylex2 when it needs a new token.

    Initialization for the yylex() function shows the static variables used to implement the one token of lookahead. If the global variable buffer doesn't hold a string anymore, /buffer will succeed and it must be that we are at end-of-file and should return 0.

    procedure yylex()
      static saved_tok, saved_yytext
      local rv, ender
      initial {
          if /buffer then
    	  yylex_reinit()
         }
       if /buffer then {
          if \debuglex then
    	 write("yylex() : 0")
          return 0
          }
    
    If we inserted a semicolon last time we were called, the saved_tok will be the first token of the next line; we should return it.
      if \saved_tok then {
        rv := saved_tok
        saved_tok := &null
        yytext := saved_yytext
        yylval := yytoken := token(rv, yytext, yylineno, yycolno, yyfilename)
        if \debuglex then
          write("yylex() : ",tokenstr(rv), "\t", image(yytext))
        return rv
      }
    
    Otherwise, we should obtain the next token by calling yylex2(). We have to check for end of file, remember if the last token could end an expression, call yylex2(), and update buffer to be the smaller string remaining after the token.
      ender := iand(tokflags, Ender)
      tokflags := 0
      if *buffer=0 then {
          buffer := &null
          if \debuglex then
    	  write("yylex() : EOFX")
          return EOFX
         }
      buffer ? {
          if rv := yylex2() then {
    	  buffer := tab(0)
          }
          else {
             buffer := &null
    	 yytext := ""
    	 if \debuglex then
    	     write("yylex() : EOFX")
             return EOFX
          }
      }
    
    After fetching a new token, we have to decide whether to insert a semicolon or not. This is based on global variable ender (whether the previous token could end an expression) and global variable tokflags (which holds both whether the current token could begin an expression, and whether a newline occurred between the last token and the current token. iand() is a bitwise AND, equivalen to C language & operator, used to pick bits out of a set of boolean flags encoded as bits within an integer.
      if ender~=0 & iand(tokflags, Beginner)~=0 & iand(tokflags, Newline)~=0 then {
        saved_tok := rv
        saved_yytext := yytext
        yytext := ";"
        rv := SEMICOL
        }
    
    Returning a token requires allocation of a token() record instance, which is stored in a global variable.
       yylval := yytoken := token(rv, yytext, yylineno, yycolno, yyfilename)
       if \debuglex then
          write("yylex() : ", tokenstr(rv), "\t", image(yytext))
       return rv
    end
    

    The Real Lexical Analyzer Function, yylex2()

    This function maintains a table of functions, calling a helper function depending on what the first character in the token is.
    procedure yylex2()
    static punc_table
    initial {
       init_csets()
       reswords := reswords()
       punc_table := table(uni_error)
       punc_table["'"] := do_literal
       punc_table["\""] := do_literal
       punc_table["!"] := do_bang
       punc_table["%"] := do_mod
       punc_table["&"] := do_and
       punc_table["*"] := do_star
       punc_table["+"] := do_plus
       punc_table["-"] := do_minus
       punc_table["."] := do_dot
       punc_table["/"] := do_slash
       punc_table[":"] := do_colon
       punc_table["<"] := do_less
       punc_table["="] := do_equal
       punc_table[">"] := do_greater
       punc_table["?"] := do_qmark
       punc_table["@"] := do_at
       punc_table["\\"] := do_backslash
       punc_table["^"] := do_caret
       punc_table["|"] := do_or
       punc_table["~"] := do_tilde
       punc_table["("] := do_lparen
       punc_table[")"] := do_rparen
       punc_table["["] := do_lbrack
       punc_table["]"] := do_rbrack
       punc_table["{"] := do_lbrace
       punc_table["}"] := do_rbrace
       punc_table[","] := do_comma
       punc_table[";"] := do_semi
       punc_table["$"] := do_dollar
       every punc_table[!&digits] := do_digits
       every punc_table["_" | !&letters] := do_letters
       }
    
    The main lexical analyzer code strips comments and whitespace, and calls the function table for the first non-whitespace character it finds. Note support for #line directives, and the use of string scanning.
       yycolno +:= *yytext
    
       repeat {
           if pos(0) then fail
           if 
    	   ="#" then {
    	       if ="line " then {
    		   if yylineno := integer(tab(many(&digits))) then {
    		       =" \""
    		       yyfilename := tab(find("\"")|0)
    		   }
    	       }
    	       tab(find("\n") | 0)
    	       next
    	   }
           if ="\n" then {
    	   yylineno +:= 1
    	   yycolno := 1
    	   if tokflags < Newline then
    	       tokflags +:= Newline
    	   next
           }
           if tab(any(' ')) then { yycolno +:= 1; next }
           if tab(any('\v\^l')) then { next }
           if tab(any('\t')) then {
    	   yycolno +:= 1
    	   while (yycolno-1) % 8 ~= 0 do yycolno +:= 1
    	   next
           }
    
           yytext := move(1)
           return punc_table[yytext]()
       }
    end
    
    The functions in the punctuation table select integer codes and match the rest of the lexeme. do_comma() illustrates an unambiguous token selection, while do_plus() illustrates a more common case where the "+" character could start any of 5 different tokens depending on the character(s) that follow it. Tokens starting with "letters" are looked up in a reserved words table, which tells whether they are special, or just a variable name.
    procedure do_comma()
       return COMMA
    end
    
    procedure do_plus()
       if yytext ||:= =":" then {
          if yytext ||:= ="=" then { return AUGPLUS }
             return PCOLON
          }
       if yytext ||:= ="+" then {
          if yytext ||:= =":=" then {return AUGUNION}
             return UNION
          }
       tokflags +:= Beginner
       return PLUS
    end
    
    procedure do_letters()
       yytext ||:= tab(many(idchars))
       x := reswords[yytext]
       tokflags +:= x[1]
       return x[2]
    end
    

    lecture #4 began here

    Homework Solutions

    You were charged with fixing this:
          buffer := reads(yyin, 1000000)
    

    While Loop Solution

         buffer := ""; while buffer ||:= reads(yyin, 1000000)
    

    stat() Solution

         buffer := reads(yyin, stat(yyin).size)
    

    Kobeashi Maru Solution

    Modify the definition of reads() so that reads(yyin, -1) reads all of yyin.

    Other Icon-Family Lexical Analyzers

    The Icont/Iconc Lexical Analyzer

    Icont (the VM translator) and Iconc share their lexical analyzer code, comprising approximately 1400+ lines of C code. src/icont/tlex.c (and src/iconc/clex.c) are almost empty files that include the header files where the lexical analyzer resides.

    The lexical analyzer has a header src/h/lexdef.h containing some global declarations and useful macros. src/common/lextab.h contains a "token table" similar to Unicon's reswords table, and an "operator table" encoding for each operator both its beginner/ender flags and whether it is unary, binary, or is used both ways. lextab.h also includes a function getopr() that takes 400 lines of switches and if-statements to examine characters one at a time and calculate the longest operator when several operator characters are adjacent.

    Although mostly handwritten, there are a couple juicy specification files src/common/tokens.txt and src/common/op.txt that define the tokens and operators; these are processed by an Icon program (src/common/mktoktab.icn) to generate certain .h files if the ultrarare event of adding a new token or operator ever occurs. These operations are so rare that the Makefiles do not even consider them by default.

    The main "lexical analyzer" file is src/common/yylex.h. The logic in this lexical analyzer is similar to Unicon's lexical analyzer. When semicolon insertion, whitespaces, and comments are handled and a new actual token is being examined, the code boils down to four possibilities:

       if (isalpha(c) || (c == '_')) {   /* gather ident or reserved word */
          if ((t = getident(c, &cc)) == NULL)
    	 goto loop;
          }
       else if (isdigit(c) || (c == '.')) {	/* gather numeric literal or "." */
          if ((t = getnum(c, &cc)) == NULL)
    	 goto loop;
          }
       else if (c == '"' || c == '\'') {    /* gather string or cset literal */
          if ((t = getstring(c, &cc)) == NULL)
    	 goto loop;
          }
       else {			/* gather longest legal operator */
          if ((n = getopr(c, &cc)) == -1)
    	 goto loop;
          t = &(optab[n].tok);
          yylval = OpNode(n);
          }
    
    Aside from syntax and the larger # of lines of code, some of the main differences between the C-based lexical analyzer and the Unicon lexical analyzer for this language include:.

    A Brief Aside on the %union Type and yylval

    CS 370 students have trouble with how the lexical analyzer delivers lexical attributes to YACC, so we should review this concept.

    Some Ways to Classify Languages' Lexical Analyzers

    Rtt's Lexical Analyzer

    Rtt, the Icon runtime translator, is essentially a C preprocessor with some extended syntax, including more reserved words. Its lexical analyzer consumes input that has gone through an extended ANSI C preprocessor, with extra macro capabilities including a nifty "multi-line macro" syntax:
    #begdef foo(x)
       ...body of foo can be as long as you like
       ...more body of foo
       ...
    #enddef
    
    Note that the conventional ANSI C preprocessor has already "tokenized" the input into chunks in order to look for macro names and such; the rtt lexical analyzer gets text chunks one at a time, and categorizes them. src/rtt/rttlex.c is only 430 lines of code, but the C preprocessor that is doing much of the work is another 6500 lines, in the src/preproc/ directory! It also keeps its size down by utilizing the symbol table code (a hash table) to handle the reserved words.

    There is some amount of lexical awkwardness in rtt due to the extended C syntax. For example, p ** q in regular C might mean "multiply p by what q points at"; under rtt C code fragments may use this meaning but the extra syntax (for type inferencing) uses ** as a set intersection operator, and a global variable flag has to keep track and change the lexical analyzer behavior depending on whether one is in "type mode" or "regular C mode".

    The yylex() function, since its preprocessor has already provided it with the sequence of characters that will comprise the token, is mainly looking at the characters to select what integer code to return. For example, for strings that might be identifiers:

       if (yylval.t->tok_id == Identifier) {
          /*
           * See if this is an identifier, a reserved word, or typedef name.
           */
          sym = sym_lkup(yylval.t->image);
          if (sym != NULL)
             yylval.t->tok_id = sym->tok_id;
          }
    

    Ulex

    Since we are doing lexical analysis, it is worth mentioning ulex, by recent NMSU M.S. grad Katie Ray. It is an implementation of lex for Unicon, written from scratch in 2200 lines of C, lex, and yacc source. It is relatively little-tested. See Unicon TR #2a.

    Lexical Analysis - GCC

    It is a bit hard to find anything in GCC's 438,000 lines of code. Code in c_lex() looks like:
       do
          tok = cpp_get_token (parse_in);
       while (tok->type == CPP_PADDING);
       ...
       switch (tok->type) {
          /* 10 branches for different C preprocessor categories,
             that require special handling, including errors and constants
           */
          }
       return tok->type;
    
    Some apparent lexical-analysis information is located more in the parser file, c-parse.y. For example, a reserved words array (reswords) lives in that file, including the actual spelling of all the C reserved words. There are extra complications from the fact that the gcc C front end supports gcc, g++, and GNU Objective C.

    Lexical Analysis - Python

    Python is a scripting language for which the compile step is optional and normally omitted. So does it have a lexical analyzer? You bet.

    At first glance, the lexical analyzer appears to live in Parser/tokenizer.[ch], about 1500 lines of C. The lexical analyzer is interesting in that Python is often used in an interactive interpreter mode; the lexical analyzer has to worry about whether its reading from a file or a terminal.

    There are about 53 token categories in Python. A lot of lexical analyzer code handles Unicode, and includes tricks for international character handling, for example the code specification is supplied in an optional special comment.

    Because the lexical analyzer is feeding an interpreter, Python tokens are full-blown Python objects, not just C structs.

    The main lexical analysis function is int tok_get(tok, pstart, pend), 400 lines or so, which reads from a function tok_nextc(), calculates an indentation level (used in Python parsing), and handles a Lot of weird special cases. One point here, true in most lexical analyzers, is the need to lookahead at characters in order to decide what token a certain character might be. If you are string scanning, lookahead is very simple, but if you are working with file I/O it can get complicated.

    	do {
    		c = tok_nextc(tok);
    	} while (c == ' ' || c == '\t' || c == '\014');
    	...
    	if (c == '#') { /* skip comment code */... }
    	if (c == EOF) { /* return ENDMARKER or ERRORTOKEN */
    	if (isalpha(c) || c == '_') { /* identifier */ }
    	if (c == '\n') { /* newline */
    	if (c == '.') { /* period, or number starting with period? */ }
    	if (isdigit(c)) { /* number */ }
    	if (c == '\'' || c == '"') { /* string */ }
    	if (c == '\\') { /* line continuation */ }
    	...
    	/* Check for two-character token */
    	{
    		int c2 = tok_nextc(tok);
    		int token = PyToken_TwoChars(c, c2);
    		if (token != OP) {
    			int c3 = tok_nextc(tok);
    			int token3 = PyToken_ThreeChars(c, c2, c3);
    			if (token3 != OP) {
    				token = token3;
    			} else {
    				tok_backup(tok, c3);
    			}
    			*p_start = tok->start;
    			*p_end = tok->cur;
    			return token;
    		}
    		tok_backup(tok, c2);
    	}
    	...
    	return PyToken_OneChar(c);
    

    Cute code:

    int
    PyToken_OneChar(int c)
    {
    	switch (c) {
    	case '(':	return LPAR;
    	case ')':	return RPAR;
    	...
    	default:	return OP;
    	}
    }
    int
    PyToken_TwoChars(int c1, int c2)
    {
    	switch (c1) {
    	case '=':
    		switch (c2) {
    		case '=':	return EQEQUAL;
    		}
    		break;
    	...
    	return OP;
    }
    
    int
    PyToken_ThreeChars(int c1, int c2, int c3)
    {
    	switch (c1) {
    	case '<':
    		switch (c2) {
    		case '<':
    			switch (c3) {
    			case '=':
    				return LEFTSHIFTEQUAL;
    			}
    			break;
    		}
    		break;
    	...
    	return OP;
    }
    

    Let's Look at Lexical Analysis Homework

    lecture #5 began here

    Getting Your Free Unicon Book

    Besides the free Icon/Unicon Implementation Book, and the free Graphics Programming in Icon book, if you want a free copy of the main Unicon language reference, bring me a ream (500 sheets) of copy paper (letter or legal size; legal allows for a slightly larger font). I will contribute the toner, the softbound binding and cover, and the labor.

    Missing .txt Files for Lexical Analysis

    These comments apply to homework #1, option #1.

    YACC - highlights

    More students requested this session than I expected. Plus, you need this information right away; we are moving on to syntax analysis this week. How can we resolve this dilemma? The best way is to decide that TODAY's lecture will include material on YACC as needed.

    Syntax Analysis

    Syntax Analysis is traditionally covered heavily in a first compiler course that is a prerequisite for this one. We will briefly review the main concepts, and then talk about syntax analyzers in the Icon and Unicon compiler family.

    Syntax Analysis Fundamentals

    The eery similarities between this section and last week's section on lexical analysis are no coincidence. The same "you should know 370" comment applies here: if this all does not ring many bells, take CS 370.
    parser
    a program that does syntax analysis. Also called a syntax analyzer.
    context free grammar
    formal notation used to specify set of characters which match each syntactic category in a programming language. Influenced by field of linguistics.
    nonterminal (symbol)
    context free grammar element corresponding to a "phrase" of 0 or more words. Corresponds to an internal node in the syntax tree
    terminal (symbol) = leaf = token
    the set of information collected by the scanner about a lexeme. Besides the lexeme string itself, this typically includes a syntactic category and a set of lexical attributes. The parsing task uses an integer category, which constitutes a terminal symbol in the grammar for the source language. The entire token for each terminal symbol is a "leaf" in the syntax tree
    start symbol
    special nonterminal corresponding to a "sentence" or "program"
    production rule
    a CFG production rule specifies that a nonterminal can be replaced by 0 or more terminals or nonterminals
    derivation step
    application of one production rule. Performed repeatedly until no nonterminals are left.
    Parse tree/syntax tree
    A parse tree shows all the derivation steps. A syntax tree has the same general shape but omits internal unary nonterminals.
    pushdown automaton
    abstract mathematical machine capable of recognizing a regular expression, or categorizing/selecting which regular expression a piece of source code matches, can be made to run in O(n) time.
    yacc/bison/byacc
    (originally UNIX) tool that takes in a context free grammar and generates (originally C) source code for a pushdown automaton
    int yyparse()
    the function, possibly generated by yacc, implementing a parser; returns an integer for whether the parse was successful (0) or not. Called repeatedly by the compiler, each time processing an entire source file. Root (holding entire syntax tree) must be returned in a global variable since yyparse() returns only an integer.
    It is very fair to say that syntax and semantic analyses are often intertwined in YACC. The syntax tree itself is logically part of syntax analysis, but YACC calls the code fragments that are used to construct such trees "semantic actions".

    The Unicon Parser

    Unicon's parser is written using a YACC grammar; a graduate student (Ray Pereda) modified Berkeley's public domain version of YACC (byacc) to generate Unicon code, following in the footsteps of someone who had earlier modified it to generate Java. The Unicon parser lives in uni/unicon/unigram.y in the source distribution (22kB, 700 lines, 119 terminals, 71 nonterminals). Unicon's YACC grammar was obtained by copying the Icon grammar, and adding Unicon syntax constructs. Prior to this time the object-oriented dialect of Icon was called Idol and really was a line-oriented preprocessor instead of a compiler.

    The start symbol for the grammar is named program, and the semantic action code fragment for this nonterminal calls the rest of the compiler (semantic analysis and code generation) directly on the root of the syntax tree, rather than storing it in a global variable for the main() procedure to examine.

    program	: decls EOFX { Progend($1);} ;
    
    Many context free grammar rules are recursive, with an empty production to terminate the recursion. The rule for declarations is typical:
    decls	: { $$ := EmptyNode }
    	| decls decl {
    	   if yynerrs = 0 then iwrites(&errout,".")
    	   $$ := node("decls", $1, $2)
    	   } ;
    
    The "semantic action" (code fragment) for every production rule builds a syntax tree node and assigns it to $$ for the nonterminal left-hand side of the rule.

    Another common grammar pattern is a production rule that has many different alternatives, such as the one for individual declarations:

    decl	: record
    	| proc
    	| global
    	| link
    	| package
    	| import
            | invocable
    	| cl
    	;
    
    For such "unary" productions, child's syntax tree node suffices for the parent, no new tree node is needed.

    Some nonterminals mostly correspond to a specific sequence of terminals, as is the case for package references:

    packageref : IDENT COLONCOLON IDENT { $$ := node("packageref", $1,$2,$3) } 
       | COLONCOLON IDENT { $$ := node("packageref", $1,$2) }  
       ;
    

    The lexical analyzer has already constructed a valid "leaf" for each terminal symbol, so if a production rule has only one terminal symbol in it, for a syntax tree we can simply use the leaf for that nonterminal (for a parse tree, we would need to allocate an extra unary internal node):

    lnkfile	: IDENT ;
    	| STRINGLIT ;
    

    The expressions (which comprise about half of the grammar) use a separate nonterminal for each level of precedence instead of YACC's tricks for resolving precedence. This may be up to around 20 levels of nonterminals. A typical rule looks like:

    expr6	: expr7 ;
    	| expr6 PLUS expr7 { $$ := node("Bplus", $1,$2,$3);} ;
    	| expr6 DIFF expr7 { $$ := node("Bdiff", $1,$2,$3);} ;
    	| expr6 UNION expr7 { $$ := node("Bunion", $1,$2,$3);} ;
    	| expr6 MINUS expr7 { $$ := node("Bminus", $1,$2,$3);} ;
    
    The "B" stands for "binary", to distinguish these operators from their unary brethren. The 20 levels of nonterminals approach is inherited from Icon and probably makes the parser way larger than it has to be. Potential syntax homework!

    lecture #6 began here

    Things you should learn from Homework #1

    LR Parsing in a Nutshell

    Last lecture we looked at Context Free Grammars, and mentioned that parsing is analogous to determine which production rules are applied in what order. And we also mentioned that a pushdown automaton is the underlying model. It is worth reviewing what that looks like

    Syntax Error Handling

    YACC has primitive error diagnostics and recovery.

    Icon employed a relatively clever approach to doing syntax error messages with YACC -- the parse state at the time of error was enough to do fairly good diagnoses. But, every time the grammar changed, the parse state numbers could change wildly. For Unicon I developed the Merr tool, which associates parse error example fragments with the corresponding diagnostic error message, and detects/infers the parse state for you, reducing the maintenance problem when changing the grammar. Merr also considers the current input token in deciding what error message to emit, making it fundamentally more precise than Icon's approach.

    Syntax Analysis in Icont/Iconc

    Yes, the parser is produced by YACC, but because the grammar is used by many different tools (not just icont and iconc) the file organization is very elaborate. For one thing, the semantic actions in the YACC grammar are entirely macro-ized so their body can be defined to be different things for different tools. From grammar.h, a typical case would look like:
    proc	: prochead SEMICOL locals initial procbody END {
    		Proc1($1,$2,$3,$4,$5,$6);
    		} ;
    
    The definitions of macros such as Proc1 are in icont/tgrammar.c and the analogous cgrammar.c file. For the compilers, they generally do the typical syntax tree construction, as in
    #define Proc1(x1,x2,x3,x4,x5,x6) $$ = tree6(N_Proc,x1,x1,x4,x5,x6)
    
    In icont, there are many exceptions to this general case, as many kinds of work seem to be done during parsing. The macros do not all assign to $$, especially for unary nodes with a leaf/token on the righthand side, where the leaf can be propagated up to the parent, as for IDENT.

    For all its generality, the Icon grammar hardwires processing at a global declaration level. It seems to require code generation on a per-procedure basis, since The rule for declarations has no macro for combining them

    decls	: ;	
    	| decls decl ;
    
    This seems to make sense for icont but not for iconc, which does substantial analysis on the entire program; iconc is most likely doing its whole program analysis using the global symbol table in which all declarations are stored. In any case there should probably be a macro for decls:decls decl.

    For another thing, the Icon grammar used repeatedly a technique which I have scrupulously avoided, putting semantic actions in the middle of a rule.

    prochead: PROCEDURE IDENT {Prochead1($1,$2);} LPAREN arglist RPAREN {
    		Prochead2($1,$2,$3,$4,$5,$6);
    		} ;
    
    The reason to avoid this is that adding new epsilon productions frequently introduces reduce-reduce conflicts, which can be a major problem. When is it safe to use this technique: when there is only one possible parse at the location of the epsilon. Why do it? Because (in this example) you might want to create some data structures for later rules to populate.

    Pscript

    Pscript is a 34-line program that tweaks Yacc output for use in icont/iconc. It does silly stuff, like stripping off the comments on #endif ... lines because some C compilers choke on them. This may no longer be needed on modern platforms.

    A more urgent tweak is to replace calls to yyerror() to give the Icon compilers more control over their syntax error messages. A regular call yyerror("syntax error") is replaced by a call that passes the current input token and parse state in (yyerror(yychar, yylval, yy_state)). Other messages (mainly stack overflow) are rerouted as internal parser errors rather than programmer errors.

    To get rid of pscript, the best solution would be to switch icont/iconc over to using the Merr syntax error generator tool.

    mktoktab.icn

    This 500+ line program builds lextab.h and yacctok.h from tokens.txt and op.txt. It is a totally classic Icon string-processing program, the kind for which the language was originally intended. Its purpose is to write out ugly C code from a lovely human-readable specification. One way to measure its value is to compare the sizes of tokens.txt and op.txt (130 lines) with the sizes of lextab.h and yacctok.h (710 lines). This factor of 5 reduction might justify the extra complications of obscure .txt files and scripts to process them...but mainly when a language is being modified substantially. Once a language is mature it tends to undergo fewer changes and many years might elapse without new tokens or operators.

    fixgram.icn

    This 49 line Icon program "fixes" the grammar after the C processor manipulates it, in preparation to feed the grammar into YACC. The C preprocessor is mainly used to #include YACC header and CFG's from different files. It is not clear how essential this "fixer" is.

    typespec.icn

    This 480-line program converts typespec.txt (88 lines) into icontype.h (55 lines). In order for the optimizing compiler to know about data types in the language, a fairly sophisticated mechanism for representing types is provided. Someone went to a lot of trouble to make it possible to add new types to this language. We will look at this in coming weeks.

    mkkwd.icn

    This 55-line file reads in runtime/keyword.r and writes keyword.h and kdefs.h. keyword.h is just a bunch of #define symbols, while h/kdefs.h is a bunch of KDef() macros which are used to associate the string names of keywords with integer codes. mkkwd makes it easy to add a new keyword to the language, but perhaps its logic would be better if it were part of rtt.

    trash.icn

    This 36 line program takes AT&T YACC output and removes some duplicated code for the many semantic actions for binary operators. It is probably around due to historical reasons and should be deleted.

    Syntax Analysis in rtt

    rtt uses a classic ANSI C grammar, 1100 lines long, with many extensions. src/rtt/rttgram.y is a considerably more "standard" YACC grammar with typical semantic actions that build syntax tree nodes.

    rttgram.y stores three different kinds of things on the YACC value stack: token pointers, treenode pointers, and long integers. It uses many different syntax tree node constructors, with names indicating how many children. node0..node5 all take an int label, a token pointer, and a number of node *'s. The token pointer is a lexical "anchor" for error reporting purposes. The use of 6 functions is avoiding C's ... variable arguments construct for no good reason.

    Syntax Analysis in GCC

    GCC's c-parse.y file is 3,170 lines of pure chewing satisfaction. Like icont/iconc, it has complex semantic actions that do a lot more than build a parse tree, and includes systematic use of macros, albeit less uniformly than Icon does.

    Syntax Analysis in Python

    Python is the only one of our reference languages not based on YACC. Grammar/Grammar gives Python's grammar in a variant of context free grammars with regular expression operators, and Parser/pgen.c is a "parser generator" a la YACC, except hardwired for Python.

    There is something to be said for a system that allows an entire large language's grammar to be specified in 107 lines of code. However, the parser generator has no "semantic action" flexibility, so modifying the parser or adapting it for other tools is a painful exploration of C code that lives nowhere near the grammar rules, and it is not easy to customize behavior on a per-rule basis as is done heavily in Unicon/Icon/GCC.

    Let's Look at Syntax Analysis Homework

    lecture #7 began here

    Preprocessing

    Preprocessing plays an important role in many languages, such as C/C++, while other languages such as Java avoid it entirely so as to maximize readability and maintainability of the source code. Many famous and catastrophic C and C++ bugs have been traced to surprises introduced by extensive use of the preprocessor in macros that were a little bit too tricky.

    Preprocessing in Icon and Unicon

    The Icon language originally did not include any preprocessor, but eventually, a simple one was introduced, with ability to include headers, define symbolic constants (macros without parameters), and handle conditional compilation (ifdef). In addition to the Icon/Unicon preprocessors, there is the question of whether Unicon itself is just a preprocessor which translates Unicon down to Icon, and whether the rtt translator is just an elaborate C preprocessor, since it translates a C superset (RTL) down to C. So, perhaps there are lots of preprocessors to talk about in these compilers.

    The Unicon Preprocessor

    The Unicon preprocessor is the only portion of the Unicon compiler not written by Clinton Jeffery. It was written in Icon by one of our "power users" (Bob Alexander), and came to Unicon by way of Jcon, an Icon-to-JVM translator. This preprocessor is written in a single 600+ line file, uni/unicon/preproce.icn.

    The external public interface of the preprocessor is line-oriented, consisting of a generator preproc(filename, predefinedsyms) which suspends each line of the output, one after another. Its invocation from the main() procedure looks like:

       yyin := ""
       every yyin ||:= preprocessor(fName, uni_predefs) do yyin ||:= "\n"
    
    Since the preprocessor outputs line-by-line, there is a mismatch between it and the lexical analyzer's big-inhale model. The preprocessor could be modified to fit better with the lexical analyzer or vice versa.

    The preprocessor function takes the filename to read from, along with a table of predefined symbols which allows the preprocessor to respond to lines like

    $ifdef _SQL
    
    based on what libraries are available and how Unicon was built on a given platform.

    The preprocessor() function itself starts each call off with initializations:

        static nonpunctuation
        initial {
            nonpunctuation := &letters ++ &digits ++ ' \t\f\r'
        }
    
        preproc_new(fname,predefined_syms)
    
    The initialization code opens fname, creates empty stacks to keep track of nested $ifdef's and $include's, initializes counters to 0 and so forth.

    The preprocessor is line-oriented. For each line, it looks for a preprocessor directive, and if it does not find one, it just scans for symbols to replace and returns the line. The main loop looks like

       while line := preproc_read() do line ? {
          preproc_space()       # eat whitespace
          if (="#" & match("line")) | (="$" & any(nonpunctuation)) then {
             suspend preproc_scan_directive()
             }
          else {
             &pos := 1
             suspend preproc_scan_text()
             }
          }
    
    The procedures preproc_scan_directive() and preproc_scan_text() work on special and ordinary lines, respectively. The line is not a parameter because it is held in the current string scanning environment. The preproc_scan_directive() starts by discardign whitespace and identifying the first word on the line (which must be a valid preprocessor directive). A case expression handles the various directives (define, undef, ifdef, etc.). Defined symbols are stored in a table. $ifdef and $ifndef are handled using a global variable preproc_if_state to track the boolean conditions. A count of $ifdef's is maintained, in order to handle matching endif's.

    Include files are handled using a stack, but an additional set of filenames is kept to prevent infinite recursion when files include each other. When a new include directive is encountered it is checked against the preproc_include_set and if OK, it is opened. The including file (and its associated name, line, etc) are pushed onto a list named preproc_file_stack. It is possible to run out of open files under this model, although this is not easy under modern operating systems.

    Include files are searched on an include file path, consisting of a list of directories given on an optional environment variable (LPATH) followed by a list of standard directories. The standard directories are expected to be found relative to the location of the virtual machine binaries.

    The procedure preproc_scan_text has the relatively simple job of replacing any symbols by their definitions within an ordinary source line. Since macros do not have parameters, it is vastly simpler than in a C preprocessor. The main challenges are to avoid macro substitutions when a symbol is in a comment or within quotes (string or cset literals). An additional issue is to handle multiline string literals, which occur in Icon when a string literal is not closed on a line, and instead the line ends with an underscore indicating that it is continued on the next line. Skipping over quoted text sounds simple, but is trickier than it looks. Escape characters mean you can't just look for the closing quote without considering what comes before it, and you can't just look at the preceding character since it might have been escaped, as in "\\". The code looks similar to:

    repeat {
       while tab(upto('"\\')) do {
          case move(1) of {
             "\\": move(1)
             default: {
                break break
                }
             }
          }
       # ...
       if not match("_",,-1) then
          break
       &subject := preproc_read() | fail
       # ...
       }
    

    The code in preproc_read() for reading a line does a regular Icon read(); end of file causes the preprocessor file_stack to be popped for the previous file's information. Performance has not been perceived as a significant problem, it it would be interesting to convert preproc_read() to use a big-inhale model to see if any statistical difference could be observed. When an include is encountered under a big-inhale, the saved state would contain the string of remaining file contents, instead of the open file value.

    Preprocessing in Icont/Iconc

    These compilers use an identical preprocessor to the one in Unicon, except written in C. This processor is a single file named src/common/ipp.c, around 1K lines of code. To handle includes, a stack of the following type is used:
    typedef struct fstruct {		/* input file structure */
       struct fstruct *prev;		/* previous file */
       char *fname;				/* file name */
       long lno;				/* line number */
       FILE *fp;				/* stdio file pointer */
       int m4flag;				/* nz if preprocessed by m4 */
       int ifdepth;				/* $if nesting depth when opened */
       } infile;
    
    Compared with Unicon, there is extra code for handling standard in ("-") and a command-line option to pipe the input through the m4 preprocessor before the Icon preprocessor works on it! This might be overengineering a bit, and it is a portability problem (works only on UNIX).

    The list of directives is given as an array whose elements include the name (used in an old-fashioned linear search) and a function pointer for handling the various directives. The array of function pointers helps organize the code but one can't help thinking a binary search or a hash table might be good. But, this is performed only proportional to the number of preprocessor directives, which are rare in most programs, and there are only 10 directives.

    The C preprocessor interface is character-oriented, not line-oriented. This allows function ppch() to substitute for whatever character fetching function the lexical analyzer was using prior to the preprocessor's existence. ppch() has some semi-complicated buffering and is a gigantic, messy function (150 lines).

    Symbol definitions are stored in yet another implementation of C hash tables. There are several in the compiler, each a little different but mostly redundant.

    The RTT Preprocessor

    The runtime translator uses an extended ANSI C preprocessor, so large (6.5K lines) that it requires its own directory src/preproc/ for 15 source files. It can be compiled as a stand-alone executable, as per UNIX cpp, by linking in preproc/pmain.c which is not linked into the rtt executable.

    Since it does a similar job as ipp.c only more, there are a lot of similar features in rtt's preprocessor. As far as I know, no attempt was made to share code. ANSI C has sticky, detailed rules for how macro parameters are applied, which are better than the pre-ANSI C language whose preprocessor semantics were not entirely well-defined. Besides the complex paramterized macros, a C preprocessor has to have a mini-expression evaluation mechanism to handle complex boolean conditional expressions such as

       #if (THIS || THAT) && (defined(OTHER) || (ALT1 && ALT2))
    
    This almost amounts to the whole C expression evaluation rules, all over again in the preprocessor (preproc/evaluate.c).

    Because RTT is written in C and Icon runs on many operating systems, there is a fair amount of conditional code in the preprocessor, especially in the handling of the search through standard system directories for include files. UNIX has certain conventions, but many platforms have multiple C compilers installed, and some platforms (anyone use MVS?) have complicated naming for system include's. Some of this #ifdef code probably needs to be removed as it refers to dead compilers on ancient platforms.

    lecture #8 began here

    Semantic Analysis

    This is the first of several lectures that could be considered to fall in the category of semantic analysis. Semantic analysis can be defined different ways. In a compiler course devoted to traditional languages, it mainly means symbol tables and type checking. For our purposes, the analysis of a program's meaning mainly consists of that portion of the compiler that revolves around examination of the syntax tree. Semantic analysis usually adds a lot of information to the tree, besides checking for errors.

    Ultimately, semantics means meaning, and the true "meaning" of a program isn't just a property of the source code, but of the semantics of the language and of the underlying platform (the CPU or virtual machine, the runtime system, and the operating system). For this reason we will be studying semantics long after we move beyond syntax trees and the information the compiler is able to add to them. Similarly, the analyses performed by the optimization phases of a compiler revolve around understanding the program's semantics well enough to handle special cases while preserving correct behavior.

    Semantic Analysis Comments on Unicon

    The Unicon translator's semantic analysis is minimal, and revolves mainly around object-oriented features such as inheritance and package imports. Before we can look at those things, we need to look at the syntax tree structure.

    In conventional YACC, a %union declaration is necessary to handle the varying types of objects on the value stack including the type used for syntax tree nodes, but iyacc has no need of this awkward mechanism: the value stack like all structure types can hold any type of value in each slot. Similarly, tree nodes can hold children of any type, potentially eliminating any awkwardness of mixing tokens and internal nodes. Of course, you do still have to check what kind of value you are working with.

    Parse Tree Nodes

    uni/unicon/tree.icn contains procedures to handle the syntax tree node data type, including both the following declaration and the yyprint() traversal function we'll be discussing in today's lecture.
    record treenode(label, children)
    
    holds one node worth of information. For convenience, a procedure node(label, kids[]) takes an arbitrary number of parameters and constructs the list of children for you. Leaves have a null children field.

    "Code Generation" in the Unicon Translator

    In a regular preprocessor, there is no code generation, there is a text-filter model in which the preprocessor writes out (modified) versions of the lines it reads in. In the Unicon translator, the code that is written out is produced by a traversal of the syntax tree. The same technique might be used by a "pretty printer". We will explore this aspect of the Unicon translator as the best available demonstration of working with Unicon syntax trees. Later on we will consider more "real" code generation in the virtual machine and the optimizing compiler.

    Earlier we saw that the start symbol of the Unicon grammar had a semantic action that called a procedure Progend(). We will cover most of that procedure next week since it is all about object-orientation, but at the end Progend(), a call to yyprint() performs the tree traversal for code generation. A classic tree traversal pattern would look like:

    procedure traverse(node)
       if node is an internal node {
          every child := ! node.children do traverse(child)
          generate code for this internal node (postfix)
          }
       else
          generate code for this leaf
    end
    
    The code generator traversal yyprint() is a lot more complicated than that, but fits the general pattern. The main work done at various nodes is to write some text to the output file, yyout. Most ordinary internal nodes are of type treenode as described above. But because there are several kinds of internal nodes and several kinds of leaves, the "if node is an internal node" is implemented as a case expression. Besides a regular treenode, the other kinds of internal nodes are objects of type declaration, class, and argument list. For regular treenodes, another case expression on the node's label field is used to determine what kind of code to generate, if any, besides visiting children and generating their code.

    The default behavior for an internal node is to just visit the children, generating their code. For ordinary syntax constructs (if, while, etc.) this works great and a copy of the code is written out, token by token. But several exceptions occur, mainly for the pieces of Unicon syntax that extend Icon's repertoire. For example, packages and imports are not in Icon and require special treatment.

    procedure yyprint(node)
       static lasttok
       case type(node) of {
          "treenode" : {
    	 case node.label of {
    	 "package": { } # handled by semantic analysis
    	 "import": { print_imports(node.children[2]) }
             # implement packages via name mangling
             "packageref": {
    	     if *node.children = 2 then
    		 yyprint(node.children[2]) # ::ident
    	     else { # ident :: ident
    		yyprint(node.children[1])
    		writes(yyout, "__")
    	        outcol +:= ((* writes(yyout, node.children[3].s)) + 2)
    		}
    	    }
    
    New syntax constructs such as procedure parameter defaults and type restrictions, and variable initializers, are other examples where the default traversal would output things illegal in Icon. They are implemented by skipping some of the children (assignment and value) in the regular pass, and adding extra code elsewhere, discussed below.
    	 "varlist2"|"stalist2": { yyprint(node.children[1]) }
    	 "varlist4"|"stalist4": {
    	    yyprint(node.children[1])
    	    yyprint(node.children[2])
    	    yyprint(node.children[3])
    	    }
    
    Much of this special logic is orchestrated by the code for traversing a procedure; it can visit its arguments and variable declarations and apply special rules to them.
    	 "proc": {
    	    yyprint(node.children[1])
    	    every yyprint(node.children[2 to 3])
    	    if exists_statlists(node.children[3]) then {
    	       ini := node.children[4]
    	       yyprint("\ninitial {")
                   if ini ~=== EmptyNode then { # append into existing initial
    		  yyprint(ini.children[2])
    		  yyprint(";\n")
    	          }
    	       yystalists(node.children[3])
    	       yyprint("\n}\n")
    	       }
    	    else
    	       every yyprint(node.children[4])
    	    (node.children[1].fields).coercions()
                yyvarlists(node.children[3])
    	    yyprint(node.children[5])
    	    yyprint(node.children[6])
    	    }
    
    The default behavior of visiting one's children is very simple, as is the handling of other kinds of internal nodes, which are objects. For the objects, a method Write() is invoked.
             "error": fail
    	 default:
                every yyprint(!node.children)
             }
          "declaration__state" | "Class__state" | "argList__state":
    	 node.Write(yyout)
    
    The outer case expression of yyprint() continues with various kinds of leaf (token) nodes. These mainly know how to write their lexemes out. But, a lot of effort is made to try to keep line and column number information consistent. Variables outline and outcol are maintained as each token is written out. Integers and string literals found in the syntax tree are written out as themselves. Since they have no attached lexical attributes, they are a bit suspect in terms of maintaining debugging consistency. It turns out the reason they occur at all, and the reason they have no source lexical attributes, is that artificial syntax subtrees are generated to handle certain object-oriented constructs, and within those subtrees strings and integers may be placed, which do not correspond to anywhere in the source code.
          "integer": {
    	 writes(yyout, node); outcol +:= *string(node)
    	 }
          "string": {
             node ? {
    	    while writes(yyout, tab(find("\n")+1)) do {
    	       outline+:=1; outcol:=1;
    	       }
    	    node := tab(0)
    	    }
    	 writes(yyout, node); outcol +:= *node
             }
    
    "Normally", tokens are written out at exactly the line and column they appear at in the source code. But a myriad of constructs may bump them around. If the output falls behind (in lines, or columns) extra whitespace can be inserted to stay in sync. If output gets ahead by lines, a #line directive can back it up, but if output gets ahead by columns, there is nothing much one can do, except make sure subsequent tokens don't accidentally get attached/concatenated onto earlier tokens. This occurs, for example, when the output code for an object-oriented construct in an expression is longer than the source expression, perhaps due to name mangling. Specific token combinations are checked, but the list here may be incomplete (possible BUG!). For source tokens, not only might the line and column change, the filename could be different as well.
          "token": {
    	 if outfilename ~== node.filename | outline > node.line then {
    	    write(yyout,"\n#line ", node.line-1," \"", node.filename,"\"")
    	    outline := node.line
    	    outcol := 1
    	    outfilename := node.filename
    	    }
    	 while outline < node.line do {
    	    write(yyout); outline +:= 1; outcol := 1
    	    }
    	 if outcol >= node.column then {
                # force space between idents and reserved words, and other
                # deadly combinations (need to add some more)
                if ((\lasttok).tok = (IDENT|INTLIT|REALLIT) & reswords[node.s][2]~=IDENT)|
    		(((\lasttok).tok = NMLT) & (node.tok = MINUS)) |
    		((\lasttok).tok = node.tok = PLUS) |
    		((\lasttok).tok = node.tok = MINUS) |
    		((reswords[(\lasttok).s][2]~=IDENT) & (node.tok=(IDENT|INTLIT|REALLIT)))|
    	        ((reswords[(\lasttok).s][2]~=IDENT) & (reswords[node.s][2]~=IDENT))
    		   then
    	       writes(yyout, " ")
    	    }
    	 else
    	    while outcol < node.column do { writes(yyout, " "); outcol +:= 1 }
    
    Most tokens' lexemes are finally written out by writing node.s:
    
    	 writes(yyout, node.s)
    	 outcol +:= *node.s
    	 lasttok := node
    	 }
          "null": { }
          default: write("its a ", type(node))
          }
    end
    

    Keywords

    Besides the large set of interesting reserved words, Icon and Unicon have another set of predefined special words called keywords. These words are prefixed by an ampersand, for example, &subject holds the current "subject" string being examined by string scanning. A procedure Keyword(x1,x2) semantically checks that an identifier following a unary ampersand is one of the valid keyword names. The valid names are kept in a set data structure.

    lecture #9 began here

    Object Oriented Facilities

    Unicon features classes, packages, and a novel multiple inheritance mechanism. These items are implemented entirely within the Unicon translator. The Icon virtual machine thusfar has only the slightest of extensions for object-orientation, specifically, the dot operator has been extended to handle objects and method invocation.

    The Unicon OOP facilities were originally prototyped as a semester class project in a "special topics" graduate course. Writing the prototype in a very high-level language like Icon, and developing it as a preprocessor with name mangling, allowed the initial class mechanism to be developed in a single evening, and a fairly full, usable system with working inheritance to be developed in the first weekend. By the end of the semester, the system was robust enough to write it in itself, and it was released to the public shortly afterwards as a package for Icon called "Idol". Many many improvements were made after this point, often at the suggestion of users.

    An initial design goal was to make the absolute smallest additions to the language that were necessary to support object-orientation. Classes were viewed as a version of Icon's record data type, retaining its syntax for fields (member variables), but appending a set of associated procedures. Because records have no concept of public and private, neither did classes. Another graduate student criticized this lack of privacy, and for several versions, everything was made private unless an explicit public keyword was used. But eventually support for privacy was dropped on the grounds that it added no positive capabilities and was un-Iconish. The existence of classes with hundreds of "getter" and "setter" methods was considered a direct proof that "private" was idiotic in a rapid prototyping language.

    The Code Generation Model for Classes

    "unicon -E foo" will show you what code is generated for Unicon file foo.icn. If foo.icn contains classes, you can enjoy the code generation model and experiment to see what it does under various circumstances. As a first example, consider
    class A(x,y)
       method m()
          write("hello")
       end
    end
    
    These five lines generate 25 lines for Icont to translate into virtual machine code. The first two lines are line directives showing from whence this source code originated:
    #line 0 "/tmp/uni13804206"
    #line 0 "a.icn"
    
    Global declarations (including procedures) would be passed through the preprocessor pretty nearly intact, but for the class, we get a bunch of very different code. Methods are written out, with names mangled to a classname_methodname format.
    procedure A_m(self)
    
    
    #line 2 "a.icn"
         write("hello");
    end
    
    Two record types are defined, one for the class instances and one for the "methods vector", or "operation record". The methods vector is instantiated exactly once in a global variable in classname__oprec format.
    record A__state(__s,__m,x,y)
    record A__methods(m)
    global A__oprec
    
    The default constructor for a class takes fields as parameters and uses them directly for initialization purposes. The first time it is called, a methods vector is created. Instances are given a pointer to themselves in an __s field (mainly for historical reasons) and to the methods vector in an __m field. Current NMSU grad student Sumant Tambe did an independent study project to get rid of __s and __m with partial success, but his work is not finished or robust enough to be enabled by default.
    procedure A(x,y)
    local self,clone
    initial {
      if /A__oprec then Ainitialize()
      }
      self := A__state(&null,A__oprec,x,y)
      self.__s := self
      return self
    end
    
    procedure Ainitialize()
      initial A__oprec := A__methods(A_m)
    end
    

    Symbols and Scope Resolution

    One of the basic aspects of semantic analysis is: for each variable, where was it declared, so we can identify its address, etc. Unicon inherits from Icon the curious convenience that variables do not have to be declared: they are local by default. This feature is implemented by deferring the local vs. global decision until link time, so the Unicon translator has no local vs. global issues. Class variables, however, have to be identified, and looked up relative to the implicit "self" variable. A family of procedures in uni/unicon/tree.icn with names starting "scopecheck" go through the syntax tree looking for such class variables. Like most tree traversals, this is a recursive process, and since local and parameter declarations override class variables, there are helper functions to walk through subtrees building mini-symbol tables such as local_vars in scopecheck_proc(node):
       # Build local_vars from the params and local var expressions.
       local_vars := set()
       extract_identifiers(node.children[1].fields, local_vars)
       extract_identifiers(node.children[3], local_vars)
    

    Eventually, every identifier in every expression is checked against local_vars, and if not found there, against the class variables stored in a variable self_vars:

       self_vars := set()
       every insert(self_vars, c.foreachmethod().name)
       every insert(self_vars, c.foreachfield())
       every insert(self_vars, (!c.ifields).ident)
       every insert(self_vars, (!c.imethods).ident)
    
    For an IDENT node, the tests boil down to:
       if node.tok = IDENT then {
          if not member(\local_vars, node.s) then {
             if member(\self_vars, node.s) then
                node.s := "self." || node.s
             else 
                node.s := mangle_sym(node.s)
          }
       }
    
    Undeclared locals and globals are mangled to include the current package name if there is one.

    Inheritance

    Inheritance means: creating a class that is similar to an existing class. In object-oriented literature there is "abstract inheritance" in which a class supports all the same operations with the same signatures, and there is concrete inheritance in which actual code is shared. Early object-oriented languages supported only concrete inheritance, while more recent languages tend to discourage it. Unicon is not typed at compile time, so abstract inheritance is not a big deal. There are abstract methods, and classes whose every method is abstract, but the use of abstract is mainly for documentation: subclass authors must provide certain methods. Anyhow, the syntax of inheritance in Unicon is
    class subclass : super1 : super2 : ... ( ...fields... )
    
    The semantics of inheritance, and particularly of multiple inheritance, are interesting in Unicon; the implementation is relatively simple. An example of inheritance is given by class Class, from uni/unicon/idol.icn
    class declaration(name,fields,tag,lptoken,rptoken)
       ...
    end
    ...
    class Class : declaration (supers,
    			   methods,
    			   text,
    			   imethods,
    			   ifields,
    			   glob,
    			   linkfile,
    			   dir,
    			   unmangled_name,
    			   supers_node)
    
    
    Unique perspective on inheritance in Unicon comes from the actual acquisition of inherited data fields and methods by the subclass. Some object-oriented languages do this inheritance "by aggregation", creating a copy of the superclass in the subclass. This is fine, but it makes "overriding" an anomaly, when overriding the parent with new/different behavior is entirely routine. Unicon instead inherits by the child looking for things in the parent (and the parent's parent, etc.) that they don't already have. In the above example, class declaration effectively appends 5 fields from class declaration onto the end of its field list. The generated code for instances looks like
    record Class__state(__s,__m,
                        supers,methods,text,imethods,ifields,
                        glob,linkfile,dir,unmangled_name,supers_node,
                        name,fields,tag,lptoken,rptoken)
    
    The inheritance semantics is called "closure based" because the process of looking for things to add from parent superclasses iterates until no new information can be added, after which the subclass is said to be closed on its parents. Other forms of closure appear frequently in CS.

    lecture #10 began here

    Addendum on Multiple Inheritance in C++

    Ziad found that it is a compile error to call a method inherited from two superclasses without specifying which class to use! Dick Brodine pointed out that C++ requires you to disambiguate which class's method m() you wish to call via explicit name by writing o.super1::m() instead of o.m(). The question being asked wasn't whether there was a way to select a specific class' methods, but what the default behavior was, so Ziad's answer is still correct.

    Java answer: there is no concrete multiple inheritance in Java, only abstract multiple inheritance via interfaces.)

    Implementing Multiple Inheritance in Unicon

    The actual code in the Unicon translator is, by analogy to transitive closure, looking for things to inherit via a depthfirst traversal of the inheritance graph. Multiple inheritance can be separated out into two portions:
    1. a method transitive_closure() that finds all superclasses and provides a linearization of them, flattening the graph into a single ordered list of all superclasses
    2. a method resolve() that walks the list and looks for classes and fields to add.

    Method transitive_closure() is one of the cleaner demonstrations of why Unicon is a fun language in which to write complex algorithms. It is walking through a class graph, but by the way it is not recursive.

      method transitive_closure()
        count := supers.size()
        while count > 0 do {
    	added := taque()
    	every sc := supers.foreach() do {
    	  if /(super := classes.lookup(sc)) then
    	    halt("class/transitive_closure: couldn't find superclass ",sc)
    	  every supersuper := super.foreachsuper() do {
    	    if / self.supers.lookup(supersuper) &
    		 /added.lookup(supersuper) then {
    	      added.insert(supersuper)
    	    }
    	  }
    	}
    	count := added.size()
    	every self.supers.insert(added.foreach())
        }
      end
    
    Now, given what I've said about Unicon providing a depthfirst inheritance hierarchy semantics, what is wrong with this picture? The code is stable and hasn't needed changes in several years, so I am not fishing for syntax bugs, or claiming that there is a bug. But there is something odd. A chocolate "peanut butter cup" is available in my office for the first correct description of the problem.

    The method resolve() within class Class finds the inherited fields and methods from the linearized list of superclasses.

      #
      # resolve -- primary inheritance resolution utility
      #
      method resolve()
        #
        # these are lists of [class , ident] records
        #
        self.imethods := []
        self.ifields := []
        ipublics := []
        addedfields := table()
        addedmethods := table()
        every sc := supers.foreach() do {
    	if /(superclass := classes.lookup(sc)) then
    	    halt("class/resolve: couldn't find superclass ",sc)
    	every superclassfield := superclass.foreachfield() do {
    	    if /self.fields.lookup(superclassfield) &
    	       /addedfields[superclassfield] then {
    		addedfields[superclassfield] := superclassfield
    		put ( self.ifields , classident(sc,superclassfield) )
    		if superclass.ispublic(superclassfield) then
    		    put( ipublics, classident(sc,superclassfield) )
    	    } else if \strict then {
    		warn("class/resolve: '",sc,"' field '",superclassfield,
    		     "' is redeclared in subclass ",self.name)
    	    }
    	}
    	every superclassmethod := (superclass.foreachmethod()).name() do {
    	    if /self.methods.lookup(superclassmethod) &
    	       /addedmethods[superclassmethod] then {
    		addedmethods[superclassmethod] := superclassmethod
    		put ( self.imethods, classident(sc,superclassmethod) )
    	    }
    	}
    	every public := (!ipublics) do {
    	    if public.Class == sc then
    		put (self.imethods, classident(sc,public.ident))
    	}
        }
      end
    

    Class and Package Specifications

    In the "old days" of Unicon's ancestor Idol, you could only inherit from a class that appeared in the same source file. Anything else poses a librarian's problem of identifying from what file to inherit. Java, for instances, takes a brute-force approach of one class per file.

    Unicon generates in each source directory an NDBM database (named uniclass.dir and uniclass.pag) that includes a mapping from class name to: what file the class lives in, plus, what superclasses, fields, and methods appear in that class. From these specifications, "link" declarations are generated for superclasses within subclass modules, plus the subclass can perform inheritance resolution. The code to find a class specification is given in idol.icn's fetchspec(). A key fragment looks like

       if f := open(dir || "/" || env, "dr") then {
          if s := fetch(f, name) then {
    	 close(f)
    	 return db_entry(dir, s)
    	 }
          close(f)
          }
    

    Unicon searches for "link" declarations in a particular order, given by the current directory followed by directories in an IPATH (Icode path, or perhaps Icon path) environment variable, followed by system library directories such as ipl/lib and uni/lib. This same list of directories is searched for inherited classes.

    The string stored in uniclass.dir and returned from fetch() for class Class is:

    idol.icn
    class Class : declaration(supers,methods,text,imethods,ifields,glob,linkfile,dir,unmangled_name,supers_node)
    ismethod
    isfield
    Read
    ReadBody
    has_initially
    ispublic
    foreachmethod
    foreachsuper
    foreachfield
    isvarg
    transitive_closure
    writedecl
    WriteSpec
    writemethods
    Write
    resolve
    end
    

    Unicon's Progend() revisited

    Having presented scope resolution, inheritance, and importing packages and inheriting classes from other files via the uniclass.dir NDBM files, we can finally show the complete semantic analysis in the Unicon compiler, prior to writing out the syntax tree as Icon code:
    procedure Progend(x1)
       
       package_level_syms := set()
       package_level_class_syms := set()
       set_package_level_syms(x1)
       scopecheck_superclass_decs(x1)
    
       outline := 1
       outcol := 1
       #
       # export specifications for each class
       #
       native := set()
       every cl := classes.foreach_t() do {
          cl.WriteSpec()
          insert(native, cl)
          }
       #
       # import class specifications, transitively
       #
       repeat {
          added := 0
          every super := ((classes.foreach_t()).foreachsuper() | !imports) do {
             if /classes.lookup(super) then {
    	    added := 1
    	    readspec(super)
    	    cl := classes.lookup(super)
    	    if /cl then halt("can't inherit class '",super,"'")
    	    iwrite("  inherits ", super, " from ", cl.linkfile)
    	    writelink(cl.dir, cl.linkfile)
    	    outline +:= 1
                }
           }
        if added = 0 then break
      }
      #
      # Compute the transitive closure of the superclass graph. Then
      # resolve inheritance for each class, and use it to apply scoping rules.
      #
      every (classes.foreach_t()).transitive_closure()
      every (classes.foreach_t()).resolve()
    
      scopecheck_bodies(x1)
    
       if \thePackage then {
          every thePackage.insertsym(!package_level_syms)
          }
    
       #
       # generate output
       #
       yyprint(x1)
       write(yyout)
    

    Other Object-oriented Issues in Unicon

    The primary mechanisms for object-oriented programming that we have discussed so far include: classes, method invocation, inheritance. There were certainly a few parts we glossed over (like how a$super.m() is implemented.) The main way to look for additional issues we skipped is to read uni/unicon/idol.icn, which handles all the object-oriented features and comes from the original Idol preprocessor. Here are some thoughts from a scan of idol.icn:

    An Aside on Public Interfaces and Runtime Type Checking

    Object-oriented facilities are usually discussed in the context of large complex applications where software engineering is an issue. We don't usually need OOP for 100 line programs, but for 10,000+ line programs it is often a big help.

    Besides classes and packages, Unicon adds to Icon one additional syntax construct in support of this kind of program: type checking and coercion of parameters. Parameters and return values are the points at which type errors usually occur, during an integration phase in a large project where one person's code calls another. The type checking and coercion syntax was inspired by the type checks done by the Icon runtime system at the boundary where Icon program code calls the C code for a given function or operator.

    One additional comment about types is that the lack of types in declarations for ordinary variables such as "local x" does not prevent the Icon compiler iconc from determining the exact types of well over 90% of uses at compile time using type inference. Type checking can generally be done at compile time even if variable declarations do not refer to types... as long as the type information is available across file and module boundaries.

    Let's Look at Homework #3

    lecture #11 began here

    The Good

    I have put up the current Icon/Unicon Implementation Book draft at http://www.cs.nmsu.edu/~jeffery/courses/580/ib.pdf. An increasing amount of material from this book will become relevant to lectures in the coming weeks.

    The Bad

    The Icon/Unicon Implementation book is far from "all together" in one piece. Chapters 1-12 are there, but I am still pulling in chapters from multiple sources. And, the text of the current draft still needs titanic code updates. So: if you print out the whole book now, you will have to reprint most of what's there, along with printing out all the new chapters as they come online. Feel free to print or read anything you like, but you should probably focus your reading attention on assigned chapters only: I will assign chapters when they are relevant, and update the online book edition, especially for the chapters that I assign.

    The Ugly

    This book production effort is going to make it hard for me to do my grading promptly. I will try to get far enough into the grading to identify major problems (such as: your compiler won't build for me). Please come see me with questions or for more specific feedback on your assignments.

    Object-orientation: Implementation in Other Languages

    For C++ implementation, a good reference is Stroustrup's "The Design and Evolution of C++". Also, see "Inside the C++ Object Model", by Stan Lippman. C++ descends from Simula, and acts in many ways as an allergic reaction to SmallTalk, the leading object-oriented language for many years. The goals were execution speed and safety for systems programming, as opposed to SmallTalk's goals of flexibility and programmer convenience for applications design and prototyping. But the main advantage C++ had initially was the syntax based on C (this is also Java's advantage). The main advantage SmallTalk had was passionate support of the research community, and its incredible graphical IDE.

    C++ was very large to begin with, but after its initial development, C++ greatly changed and grew almost every mechanism any programmer could want; especially, it seemed to play the doppelganger on Ada, acquiring every feature of Ada necessary to kill it and take its place. A C++ compiler is a very large and complicated undertaking.

    AT&T C++ was very buggy in 1985 and it took several years for compilers to stabilize and for the language to commercialize successfully. This would have failed had it originated someplace smaller than AT&T, and it almost failed anyhow. The C++ community owes a great debt to Borland, without whom the computing world would not be what it is today. Microsoft's agenda of killing Borland forced it into the C++ business, late and apparently against its will, but many current students seem to think Microsoft invented C++ and that it is the only systems language on the only platform that matters.

    One of the main object-orientation implementation features of C++ that is almost unique is the support for non-virtual methods. C++ experts may disagree but arguably this is a performance hack that adds complexity to the language for the sake of execution speed of the generated code. Basically the symbol table for each class must track virtual and non-virtual properties, and generate C-like function calls for non-virtuals while generating virtuals through a methods vector (aka virtual function table), with an extra memory reference.

    Another feature whose implementation in C++ or Java raises additional challenges is function overloading. The symbol table must hash not on method names but on method signatures. Signatures can generally be represented by strings. When automatic type conversions come into play, multiple possible interpretations of a call require more semantic checking than in simpler languages' compilers. For example, with methods f(char c) and f(float x), a call like f(32) is ambiguous and produces a semantic error (kind of like the multiple inheritance ambiguity we saw earlier).

    Operator overloading adds a similar challenge to the semantic analysis of operators; sometimes + is an ADD instruction and sometimes it is a function call.

    Virtual Machines

    Modern programmers mainly think of JVM and Microsoft's .Net CLR when they hear the phrase virtual machine. Hardware folks, especially mainframe people, think of hardware emulators and operating systems which support virtual hardware devices. The facts are that virtual machines are a very old idea and that they mean different things in different contexts.

    The Pascal language used a virtual machine instruction set called p-code to improve the language's portability. For each machine, the compiler would generate the same instructions, but the representation details of those instructions might vary from machine to machine, e.g. little-endian versus big-endian machines. The instruction set was very simple, enabling it to be implemented by single instructions or small sequences of instructions on most typical hardware of that day.

    The modern Java virtual machine has similar goals, but with the modern twist of (in theory) true machine-independent VM code. Instruction portability is solved at a tiny cost in performance, but Java programs still have portability problems: they have traded instruction compatibility problems for library version compatibility problems. Microsoft's C# and .Net CLR started out as cheap replacements for Java, but CLR quickly gained a much larger goal, that of supporting many different languages, such as Visual BASIC and C#, with good interoperability and performance.

    SmallTalk, Prolog, and Icon/Unicon have Pascal-style virtual machines, with the added twist of higher-level language semantics, semantics complicated enough that they do not map easily down to underlying harware. For such languages, a natural migration path is to develop a virtual machine that captures the semantics and allows experience and experimentation, followed (optionally, much later) by a compiler that produces native code.

    The Icon/Unicon Virtual Machine

    Read: IC (the Implementation Compendium) chapters 3 and 4.

    The virtual machine instruction set does not go into the details of data types, the way JVM does. It is sort of like "values" and "memory" are virtual, not just "instructions" are virtual.

    Some changes since the early days: itran+ilink merged to become icont. In future, these should just go ahead and jump into the VM itself (iconx).

    Ucode and Icode

    Ucode is the "object" format, similar to .o; icode is the binary format. Ucode is human readable ascii, and highly portable. icode is not, it contains C structs and therefore depends on the C compiler and its representation of types such as "int" or "char *", including issues such as word size and endianness.

    Appendices B and C of the IC document the virtual machine instruction set. In addition, a crude Ucode Code Generation Guide is available.

    Values and Descriptors

    Any variable can hold any value, so values are self-describing, i.e. they contain their type information. The C struct representing an Icon/Unicon value therefore looks like:
    d-word
    v-word
    The dword has: several flags in the most significant bits, and a type code in the least significant bits, OR if no flags are on, there is no type code because the value is a string, and d-word is the string length.

    The vword has either: nothing (if dword holds null type), or an integer value, or a pointer to char, or a pointer to a "block".

    lecture #12 began here

    Questions

    Q: Why is it easier to write a VM for a platform than a backend for a compiler which generates native code for that platform? (And) why is it easier to come up with a virtual machine that supports noval (sic) language features than it is to write a compiler that implements them?

    A: Read section 3.1 of the IC for Griswold's answer to this question. To it reasons given there, I would add:

    1. Working at compile-time, the compiler has to implement the all-possible-cases solution to a problem. Working at run-time, an interpreter (VM or otherwise) only has to solve the particular case given by the specific program being executed.
    2. Although both compiler and interpreter may be written in a high level language, the interpreter can be almost entirely cast in those higher-level language terms, while a compiler back-end is working with machine code.
    3. It is not one platform, but N platforms we are talking about, where N is gradually growing.
    4. It is harder to debug a compiler than a VM, since there is a "meta" level of removal between the evidence of error and its cause.
    Do not underestimate the level of complexity difference between "virtual" machines and real machines: writing a code generator for VM is much much simpler than writing a native code generator. Real processors usually have more instructions, "good" instruction scheduling is more complex, register allocation is non-trivial, memory addressing modes may be complex, and a large number of idiosyncratic quirks must be accomodated. The first compilers took many man-years of effort, and while tools have automated the front-end, even with available tools, compiler back-ends are still a major challenge.

    The question you should be asking is: why is it easier to write a VM for a platform than a backend for a compiler which generates C code for that platform?

    Blocks Representing Values in the Unicon Virtual Machine

    Most types are represented by blocks, starting with a type code in the title, with following bytes' size and field organization determined by the type code. These structs all live in src/h/rstructs.h.

    There is a gigantic union type, union block, to denote a value who is a block that has not been examined yet. Each type of block then has a separate struct type associated with it. Some types require multiple block types to handle their representation.

    Most blocks' sizes are statically determined by their type. An exception to this rule would be for records; the sizes of different records vary, so a size in bytes is stored in the block. Note: this is stupid, the size is a property of the record type, why are we wasting 4 bytes per instance to store its size in bytes? Fixing it could easily be a homework problem.

    A note on the one-word descriptor

    Last lecture we discussed the two-word descriptor Unicon inherits from Icon's implementation. The memory size requirements (potentially a factor of 2 size increase) were noted. To that discussion I would add: the design decision was to keep the code simpler and smaller by not playing tricks with bits. For a file format, or a network protocol, it may pay off more to save every bit, but if it takes several instructions to peel out the compacted individual fields of a one-word descriptor, it may well be a lot slower to try and save space.

    Here's a straw-man one-word descriptor design to allow us to consider the matter. You might come up with a better one-word design, this is kind of an exercize at identifying the minimal changes.

    Unicon has 25 type codes, so 5 bits would suffice for them. Four more bits are used in current descriptors for flags NVPT. On a 32-bit platform, that leaves us with 23 bits: only enough to address 8MB of address space. Is this useful? Let's see.

    For integers, numbers larger than 4M or smaller than -4M would spill out into "large integer" blocks. One would have to do some statistics to tell how often integers are in between 4M and 2B.

    For blocks, if we play tricks, like aligning blocks on 8 byte-boundaries, we get three bits back! 64MB of blocks, hurray. But we have introduced a bit of fragmentation where memory is wasted, and the block pointer is now computed as

    BlkLoc(d) = (d & 7FFFFF) << 3;
    
    This might be slower than just d.vword.bptr, which is just fetching the memory 4 bytes off from the address of d. We have extra instructions to fetch to make up for any fewer memory fetches from the smaller descriptor.

    There might be additional tricks we could play to increase the addressable block memory, like allocating a different region for each data type. This would allow types to be aligned on larger boundaries than just 8-bytes, and allow each type its own 64MB or larger region (by adding the type's region address to the pointer). But now, following these pointers is slower, there is another addition along with all that bit twiddling.

    And what about strings? They are majorly special-cased for high-performance in the current implementation, and many Icon programs do heavy string processing. Do we redesign strings to use a block, so we can store the length and a pointer? If so, we've added another level of indirection to strings, and we are paying a new 12 byte block for every string. Can we fit the length and pointer into the one-word descriptor? The existing design would allow us 31 bits for string length and string pointer. We'd quickly hit unacceptable limits, such as only allowing 8MB of string space (23 bits) and strings of max length 256 (8 bits). How about storing the length at the beginning of every string like Pascal and some BASIC's? If we use all 31 bits for the char *, substring sectioning becomes a more expensive copy operation, and some string concatenation optimizations are no longer possible.

    We might actually want to implementing two internal types for strings: short strings (length <= 256, 8MB of them) and long strings (using a type code and an external block), mirroring the integer implementation.

    One thing is for sure: one-word descriptors would add some strong limits on memory (on 32-bit machines, that is) and potentially a lot of complexity. If anyone wants to do them for a semester project, it would be a high-risk gamble with a high potential payoff, especially on 64-bit machines.

    Variables and Assignments

    A variable is a descriptor with a "v-flag" set; its vword points to another descriptor. The two main operations on variables are dereferencing and assignment. In dereferencing, a variable descriptor is replaced by the descriptor it is pointing at. In assignment, the descriptor pointed at is changed.

    There are some special cases where assignments need special treatment. Special keyword variables have semantic rules which must be enforced, such as &subject must always be a string. These special cases are handled by trapped variables. A reference to a trapped variable allocates a block with special typecode so that a subsequent assignment (if there is one) can enforce the semantic rules. The Icont program (and hence Unicon) lazy about this and allocate the trapped variables just in case, even in syntactic contexts where the variable will be dereferenced rather than assigned.

    lecture #13 began here

    Midterm Exam

    The midterm exam will be Thursday, October 14. It will cover material through this week.

    The Core of the Unicon VM Interpreter

    In general, the Unicon VM source code lives in unicon/src/runtime/, with header files in unicon/src/h/, and a few libraries and additional modules in other subdirectories. Files matching f*.r* deal with built-in functions, o*.r* deal with operators, r*.r* deal with internal runtime system services, and i*.r* deal with the VM interpreter itself.

    The core of the Unicon VM lives in runtime/interp.r, which has a gigantic function named interp(). This function does a fetch/decode/execute loop on VM instructions. Some instructions are implemented immediately; most invoke runtime system functions in other modules. The instructions are The fetch-decode-execute loop looks like

       for (;;) {
          ...
          lastop = GetOp;		/* Instruction fetch */
          ...
          switch ((int)lastop) {
             ... something like 100 cases, 25 for binary operators
             }
    
    The most exciting thing about this VM loop is that, when an expression generates a result (and might have more), the way expressions suspend is by calling interp() recursively.

    RTL Language

    In order to read very much of the virtual machine source code, we need to talk about the extended C dialect RTL that it is written in. RTL is processed by the program rtt and described in IPD 261.

    The RTL code is illustrated by the implementation of the size operator (unary asterisk, as in *s). Compared with C, RTL has a documentation string, modified function header syntax, an "abstract" clause, a "type_case" operation, and separate code fragments for the different types. When the type of x is unknown, this expands to a switch statement executed at runtime, but when type information is known, the correct branch is selected at compile time.

    
    "*x - return size of string or object x."
    
    operator{1} * size(x)
       abstract {
          return integer
          }
       type_case x of {
          string: inline {
             return C_integer StrLen(x);
             }
          list: inline {
             return C_integer BlkLoc(x)->list.size;
             }
          table: inline {
             return C_integer BlkLoc(x)->table.size;
             }
          set: inline {
             return C_integer BlkLoc(x)->set.size;
             }
          cset: inline {
             register word i;
    
             i = BlkLoc(x)->cset.size;
    	 if (i < 0)
    	    i = cssize(&x);
             return C_integer i;
             }
          record: inline {
             return C_integer BlkLoc(x)->record.recdesc->proc.nfields;
             }
          coexpr: inline {
             return C_integer BlkLoc(x)->coexpr.size;
             }
          file: inline {
    	 int status = BlkLoc(x)->file.status;
    #ifdef Dbm
    	 if ((status & Fs_Dbm) == Fs_Dbm) {
    	    int count = 0;
    	    DBM *db = (DBM *)BlkLoc(x)->file.fd;
    	    datum key = dbm_firstkey(db);
    	    while (key.dptr != NULL) {
    	       count++;
    	       key = dbm_nextkey(db);
    	       }
    	    return C_integer count;
    	    }
    #endif					/* Dbm */
    #ifdef ISQL
    	 if ((status & Fs_ODBC) == Fs_ODBC) { /* ODBC file */
    	    struct ISQLFile *fp;
    	    int rc;
    #if (ODBCVER >= 0x0351)
    	    SQLLEN numrows;
    #else					/* ODBCVER >= 0x0351 */
    	    SQLINTEGER numrows;
    #endif					/* ODBCVER >= 0x0351 */
    	    fp = (struct ISQLFile *) BlkLoc(x)->file.fd;
    	    rc = SQLRowCount(fp->hstmt, &numrows);
    	    return C_integer(numrows);
    	    }
    #endif					/* ISQL */
    	 runerr(1100, x); /* not ODBC file */
    	 }
          default: {
             /*
              * Try to convert it to a string.
              */
             if !cnv:tmp_string(x) then
                runerr(112, x);	/* no notion of size */
             inline {
    	    return C_integer StrLen(x);
                }
             }
          }
    end
    

    lecture #14 began here

    Code Generation for the Icon VM

    Unicon code generation relies on icont (a C program) to translate procedural Icon code into VM instructions. Later this semester we will be studying the analysis and code generation performed by iconc, the translator that generates C code and thereby creates faster native (non-VM) executables.

    The icont source code lives in src/icont. The code generator apparently lives in icont/tcode.c, an 1100 line source file. The rest of the t*.c files are related to the compiler front-end (itran); the l*.c files are for the linker (ilink).

    Code generation to the .u1 VM code file proceeds on a per-procedure basis, when the grammar action macro Procdcl(x) calls a C function codegen(x) on a parse tree node for a procedure. codegen(t) just resets a label counter and calls traverse(t), the "real" code generator function. The start nonterminal's action macro Progend calls gout(), which writes global (.u2) information out. The .u2 is then concatenated with the .u1 to form a .u file.

    Function traverse(t) traverses the syntax (sub)tree rooted at t. It is a simple function with a long switch statement with 35 or so branches for different kinds of tree nodes. The bodies of most branches call helper functions to generate specific instructions, plus call traverse() recursively on child nodes.

    static int traverse(t)
    register nodeptr t;
       {
       register int lab, n, i;
       struct loopstk loopsave;
       static struct loopstk loopstk[LoopDepth];	/* loop stack */
       static struct loopstk *loopsp;
       static struct casestk casestk[CaseDepth];	/* case stack */
       static struct casestk *casesp;
       static struct creatstk creatstk[CreatDepth]; /* create stack */
       static struct creatstk *creatsp;
    
       n = 1;
       switch (TType(t)) {
          ...
    
    Interestingly, this code has static limits on the maximum lexical depth of major control structures, which may be nested at most 20 deep (loops), 10 deep (case expressions), and 10 deep (co-expression creations). These static limits do not appear to be checked. This is very possibly a cause of (rare) coredumps from icont on especially large, complicated .icn source files, such as the 475KB uniconlex.icn generated by ulex by one of your illustrious classmates for an earlier homework. POSSIBLE HOMEWORK: add checks, raise or better yet eliminate static limits on loopstk, casestk, and creatstk!

    The case branches for different node types rely on macros to pick out information from the tree nodes, so the actual representation of the tree is entirely hidden from this code! The actual tree representation as well as the definitions of these macros live in src/icont/tree.h. The macros hide it, but in reality, each "child" of a tree node can be either a (long) int, a (char *) string, or a pointer to another tree node:

    union field {
       long n_val;		/* integer-valued fields */
       char *n_str;		/* string-valued fields */
       nodeptr n_ptr;	/* subtree pointers */
       };
    
    struct node {
       int n_type;			/* node type */
       char *n_file;		/* name of file containing source program */
       int n_line;			/* line number in source program */
       int n_col;			/* column number in source program */
       union field n_field[1];      /* variable-content fields */
       };
    
    Tree0(t) picks out child 0, which had better be a tree node, Val0(t) picks out child 0, which had better be a (long integer) "immediate value" in the tree, etc. This allows the code for each kind of tree node to be fairly simple.

    The following example illustrates code generation for perhaps the most common syntax construct, binary operators. The code uses function emit() to generate individual VM instructions, and helper functions setloc() and binop() to do some of the dirty work. The overall template is: push a (null) descriptor to make space for the result, generate code for the left operand, generate code for the right operand, and generate the instruction for the operator. The pnull at the beginning may seem unnecessary; in many stack machines the protocol is: pop two operands and push the result in their place. The pnull is not an accident for Icon and Unicon, see if you can come up with a guess as to why you might not always want to throw away the operands right away and overwrite them with your result. Note that the tree node order for binary operators is "prefix" order, the operator in child 0, followed by operands in child 1 and 2.

          case N_Augop:			/* augmented assignment */
          case N_Binop:			/*  or a binary operator */
    	 emit("pnull");
    	 traverse(Tree1(t));
    	 if (TType(t) == N_Augop)
    	    emit("dup");
    	 traverse(Tree2(t));
    	 setloc(t);
    	 binop((int)Val0(Tree0(t)));
    	 free(Tree0(t));
    
    Working our way backwards through this code:

    There are actually several variations on the emit() function: emitlab() for labels, emit(), emitl() for instructions that reference labels, emitn() for instructions with a numeric argument, and emits() for instructions with a string argument. These functions all use good-old fprintf, to a global variable codefile, and every single write is checked for failure to avoid surprises. Strangely enough the writecheck() is a function that should probably be inline or a macro.

    static void emit(s)
    char *s;
       {
       writecheck(fprintf(codefile, "\t%s\n", s));
       }
    
    I have omitted some of the details, such as opening files, but this whole code generator is one of the simplest "production compiler" arrangements I have ever seen. One of the most useful possible homework #4's I could assign would be to rewrite this VM code generator for the Unicon translator, so it would not need to write out Icon source code and then call the icont program which re-parses everything. If we got rid of the itran part of icont, it would eliminate ~6K lines of C code from our distribution. But, there are enough details that it might be too large for a homework. However, it would make an excellent semester project.

    icont's linker, ilink

    The linker is around 4K lines of code in 8 files. It is somewhat "lower level" than the itran portion, since it generates binary VM bytecode. The main file link.c invokes other modules, the primary of which is lcode.c, which contains the primary translation logic from ucode to icode.

    Function ilink() in link.c performs three passes, one to obtain global information from ucode files, one to omit unreferenced procedures, and one to do the "real" code generation. We will focus on the third pass. Looping through the list of files to link is done by walking through a linked list:

       lfls = llfiles;
       while ((lf = getlfile(&lfls)) != 0) {
    
    For each file, function gencode() is called. After all files have been visited, a function gentables() generates several global tables containing information about record types and fields, globals and statics.

    Function gencode() reads from a global open FILE named infile; it relies on helper functions such as getopc() to handle the details of reading "tokens" from the ucode file. The majority of gencode() is a giant switch statement on the opcodes of each (real and pseudo) instruction:

       while ((op = getopc(&name)) != EOF) {
          switch (op) {
    
    For many opcodes, the binary instruction is very straightforward. Sixty three opcodes share the following switch branch:
             case Op_Plus:
                newline();
                lemit(op, name);
                break;
    
    The opcodes that don't share this branch must have special needs or instruction immediate in-line operands (as opposed to operands pushed as descriptors onto the interpreter stack), such as: constants, for which the instruction and the value (or a pointer to it) are both emitted into the binary bytecode. The procedure call instruction and the record field operator VM instruction similarly have an integer code operand (specifying how many parameters, and which field, respectively) rather than a descriptor. Some of these special cases are mentioned in later sections. The helper functions used in the common case are illustrative and worth covering in detail. newline() is actually moving the input file past the next line feed, i.e. to the next instruction. lemit() is the helper function that is doing the real work, except that it is just a wrapper (with optional debugging code) around outop(op), which is a macro that calls intout(n) for opcodes. Instruction operands would instead be generated by wordout() which tends to be long-sized/pointer-sized, which might or might not be the same number of bytes as wordout().

    Representing the Binary Image in Memory

    lcode.c keeps track of actual binary locations of instructions represented as offsets, for use as targets of goto and similar instructions. It uses a global variable for this:
    word pc = 0;		/* simulated program counter */
    
    Similar to emit(), at the binary level the actual code is "emitted" by helper functions such as wordout(), which emits a word. This might be tricker to get consistent than it sounds, at least when one is writing for "all possible platforms". Counterintuitively, these helper functions don't actually write to files, but rather, write bytes to a (gigantic) in-memory image of the icode file (named codep), an array of bytes that is grown dynamically.
    /*
     * wordout(i) outputs i as a word that is used by the runtime system
     *  WordSize bytes must be moved from &oword[0] to &codep[0].
     */
    static void wordout(oword)
    word oword;
       {
       int i;
       union {
            word i;
            char c[WordSize];
            } u;
    
       CodeCheck(WordSize);
       u.i = oword;
    
       for (i = 0; i < WordSize; i++)
          codep[i] = u.c[i];
    
       codep += WordSize;
       pc += WordSize;
       }
    

    Global Tables in Icode

    Icode has a header (src/h/header.h) of 60+ bytes which is used by the VM to find the locations of each piece that is written out. Besides the instructions there are various global tables including

    Field Table Compression

    All of these tables are linear in the size of the corresponding program elements, except the record field table, which is N x M, where N is # of record types, and M is number of unique field names appearing in all the record types. This matters more in Unicon than in Icon, because classes are records, and large OO programs use lots of classes. For example, every significant GUI application uses lots of GUI classes and will have a large field table.

    In a previous software engineering course, a 25K LOC student project used enough record types that a size problem was identified. Happily, other researchers in the SmallTalk community had solved a similar problem in SmallTalk. An undergraduate student named Richard Hatch experimented with several techniques and eventually came up with a simple, cheap compression technique, which is described in a UTSA technical report. In a very large program, Field Table Compression has been demonstrated to take a 464KB field table and reduce it to 24KB, without requiring decompression and at an execution cost of around 2%.

    ilink Summary

    There is enough complicated logic in the linker that it is hard to justify rewriting it and throwing it away. Instead it should probably be merged into the VM binary, and tied closer to the icode loading mechanism. Then again, this lecture did not show that; other than the field table compression it mainly looked like the linker could be translated into Unicon fairly easily.

    lecture #15 began here

    Midterm Exam Review

    We have a nice midterm exam to look forward to on Thursday.

    lecture #16 began here

    Midterm Exam - the answers

    CS 580 Midterm Exam

    Name: __________________________________________

    Answer all questions. Read them ahead of time, and pace yourself to complete in ≤ 75 minutes. You may use the back of each sheet.

    1. Lexical Analysis (20 points)

    The REXX language, along with certain other shell/scripting languages, features "implicit concatenation", in which a concatenation operator is automatically applied between every pair of adjacent tokens where it makes sense to do so. Note that this goes well beyond C/C++ implicitly concatenating adjacent string constants; it applies to variables. Pretending for a moment that it was Unicon we are talking about instead of REXX, for example, if x is "Mervyn",
       write("hello " x)
    
    would print
    hello Mervyn
    
    Suggest how this feature might be implemented in the lexical analyzer of a compiler (not just for "write()", but for all expressions). What problems would occur if we added this feature to Unicon? Are there any semantic problems deep enough that we should NOT do it?

















































































    2. Syntax Analysis (20 points)

    Unicon has a table data type, in which keys are mapped to values. We earlier discussed a lack of a "literal table constructor" syntax for specifying an entire table with its keys and values in a single expression, such as
       t := ["Washington": "Deleware", "Egypt": "Nile", 3.14: 3]
    
    (This creates a table of size 3, with keys "Washington", "Egypt", and 3.14 mapped onto elements "Deleware", "Nile", and integer 3. It makes no sense.)

    2a) Write YACC production rules (i.e. context free grammar rules in YACC syntax) for this "literal table constructor" feature. Consider whether there are any optional or variational elements that are needed.

    2b) Discuss whether this proposed syntax is likely to conflict with other features in Unicon, i.e. whether it would be safe to add this feature. If you are not comfortable discussing this for the Unicon language, you may instead discuss whether such a table type, with such constructors, would fit into Java or C/C++, or whether it would have conflicts there.






















































































    3. Semantic Analysis (20 points)

    Suppose you extended a YACC tool to handle additional regular-expression operators such as * or +, as in the example:
        optionalExprs : exprs* ;
    
    How would you handle YACC's notation for semantic actions to allow this kind of production rule to build syntax trees?































































































    4. Code Generation (20 points)

    Describe the similarities and differences between code generation in the Unicon translator (unicon) and the icont VM compiler. If you want to make execu