PL: Syntax of PLs
From old CS 471 notes…
Syntax of Programming Languages
Syntax matters
No web references, you can find a lot but it’s up to you as to which you believe. Certainly some NASA space code somewhere had the problem below.
Fortran DO loop DO 20 I = 1,5
Probably true: Mecury mission’s orbit affected by a Fortran bug where the comma was mistyped as a period.
Probably false: It was long claimed that Mariner I was lost because of above typo, but this is doubtful (but potentially a mistyped hyphen did it in)
Fortran’s syntax: spaces do not matter
Can this happen in C?
Definitions
The grammar of a programming language is the set of rules which describe what syntactically correct programs look like. A program is a sentence in the grammar.
A lexeme is a small syntactic unit (sequence of characters) that become the “words” in the grammar.
Lexemes are keywords, identifiers, operators, numeric literals, etc.
We separate spelling of English words from the proper formation of English sentences (grammar)
Recall in CS 370, identification of lexemes is separate from matching the higher-level grammar (e.g., lex was the tool for specifying lexeme matching, yacc was the tool for specifying grammar)
Each lexeme group is represented in the grammar by a token
In Object Oriented terms, the token is the class name, while individual lexemes in a program are object instances
Many times there is only one lexeme type in a class, so the token is equal to the string of that lexeme type; e.g., the keyword while is the only lexeme type in the “while-keyword” class, so “while” is used as is in the grammar.
Languages
Language and automata theory define several hierarchical classes of languages
Two very important classes that every computer scientist should know: regular and context-free
There is a direct mapping between language classes and types of automata
Regular languages are those that can be represented by finite automata
For our purposes, a finite automaton is a finite state machine with labelled transitions between states, where the label on a transition matches a token in the input stream
Context-free languages are those than can be represented by push-down automata (essentially, finite automata with a stack memory)
Regular expressions, used by many Unix command tools, are expressions that define a regular language
Lexeme identification is typically done using a regular language
But, regular languages are not powerful enough to describe useful programming languages
The canonical example of the limitations of regular languages is that of matching parentheses
A finite automaton has no way of counting the number of left parens, to make sure the number of right parens match
But with a PDA, you can use the stack to count them
So, if anyone ever asks you to write a generic expression parser, do not try to use a finite automata approach!
Backus-Naur Form and Context Free Grammar
The syntax of most “real” programming languages is context-free, or at least close to it (i.e., small exceptions)
(These exceptions can be killers, though!)
A very elegant way to write CFGs is with grammar production rules, where nonterminal rule names are recursively substituted for their rule patterns, which include both terminals and other nonterminals
BNF, EBNF embody a language for writing these production rules
Basically, a bunch of rules of the form
nonterminal :- pattern of terminals and/or nonterminals
A grammar must have one top-level nonterminal which is the start symbol
Sentence Derivation
A grammar can act as a sentence generator or sentence recognizer (when we run the compiler we are using it as a recognizer)
Starting with the start symbol, replacing any one nonterminal with any one of its rule bodies, until all nonterminals are gone, results in a sequence of sentential forms, the last one being one possible sentence in the language defined by the grammar
Example on page 122
Replacement can happen leftmost, rightmost, or in random order
Connecting with and edge the nonterminals of the sentential forms
and the rules into which they derived results in a parse tree
Ambiguity
Sentential form derivation order does not affect the resulting parse tree
But a grammar can be ambiguous, meaning more than one parse tree can represent the same sentence
Example on pages 124-125
Operator Precedence Problem in ambiguous expression grammars
Operator Associativity Problem in ambiguous expression grammars
Matching the else to the correct if