Overview of Compilation Process

Lexical Analysis

Lexical analysis looks at the actual string of characters provided as program, and divides it up into meaningful symbols. For instance, if the string 1376 appears, this step recognizes the presence of a number with the value of 1376. The string if is recognized as the keyword IF. The string iff would be recognized as an identifier; and so forth.

After lexical analysis, the body of the Euclid program would ``look'' to the compiler something like:


IDENT(m) ASSIGN INTEGER(32) SEMICOLON
IDENT(n) ASSIGN INTEGER(28) SEMICOLON
WHILE LPAREN IDENT(n) NE INTEGER(0) RPAREN LBRACE
IF LPAREN IDENT(m) LE IDENT(m) RPAREN
IDENT(m) ASSIGN IDENT(m) MINUS INTEGER(1)
ELSE LBRACE
IDENT(tmp) ASSIGN IDENT(m)
IDENT(m) ASSIGN IDENT(n)
IDENT(n) ASSIGN IDENT(tmp)
RBRACE
RBRACE
IDENT(gcd) ASSIGN IDENT(m)

Actually, the line breaks would be gone at this point, too - I left those in for readability. And, of course, it doesn't really appear as a string like this; that's just a human-readable form of what the internal data structure would look like.

Parsing

The second step, parsing, looks at the tokens provided by the lexical analysis and creates an abstract representation of the structure of the program. This step knows, for instance, that a while loop consists of the keyword WHILE, an expression, and a statement. The abstract representation of a program will be a tree structure; Euclid's Algorithm will look something like this:


Last modified: Wed Jan 28 13:06:47 MST 2004