Lexical analysis looks at the actual string of characters provided as
program, and divides it up into meaningful symbols. For instance, if
the string 1376
appears, this step recognizes the
presence of a number with the value of 1376. The string
if
is recognized as the keyword IF. The string
iff
would be recognized as an identifier; and so forth.
After lexical analysis, the body of the Euclid program would ``look'' to the compiler something like:
IDENT(m) ASSIGN INTEGER(32) SEMICOLON IDENT(n) ASSIGN INTEGER(28) SEMICOLON WHILE LPAREN IDENT(n) NE INTEGER(0) RPAREN LBRACE IF LPAREN IDENT(m) LE IDENT(m) RPAREN IDENT(m) ASSIGN IDENT(m) MINUS INTEGER(1) ELSE LBRACE IDENT(tmp) ASSIGN IDENT(m) IDENT(m) ASSIGN IDENT(n) IDENT(n) ASSIGN IDENT(tmp) RBRACE RBRACE IDENT(gcd) ASSIGN IDENT(m)
Actually, the line breaks would be gone at this point, too - I left those in for readability. And, of course, it doesn't really appear as a string like this; that's just a human-readable form of what the internal data structure would look like.
The second step, parsing, looks at the tokens provided by the lexical
analysis and creates an abstract representation of the structure of
the program. This step knows, for instance, that a while
loop consists of the keyword WHILE
, an expression, and a
statement. The abstract representation of a program will be a tree
structure; Euclid's Algorithm will look something like this: