PL: Syntax of PLs • Jonathan Cook

From old CS 471 notes…

Syntax of Programming Languages

Syntax matters

No web references, you can find a lot but it’s up to you as to which you believe. Certainly some NASA space code somewhere had the problem below.

Fortran DO loop DO 20 I = 1,5

Probably true: Mecury mission’s orbit affected by a Fortran bug where the comma was mistyped as a period.

Probably false: It was long claimed that Mariner I was lost because of above typo, but this is doubtful (but potentially a mistyped hyphen did it in)

Fortran’s syntax: spaces do not matter

Can this happen in C?

Definitions

The grammar of a programming language is the set of rules which describe what syntactically correct programs look like. A program is a sentence in the grammar.

A lexeme is a small syntactic unit (sequence of characters) that become the “words” in the grammar.

Lexemes are keywords, identifiers, operators, numeric literals, etc.

We separate spelling of English words from the proper formation of English sentences (grammar)

Recall in CS 370, identification of lexemes is separate from matching the higher-level grammar (e.g., lex was the tool for specifying lexeme matching, yacc was the tool for specifying grammar)

Each lexeme group is represented in the grammar by a token

In Object Oriented terms, the token is the class name, while individual lexemes in a program are object instances

Many times there is only one lexeme type in a class, so the token is equal to the string of that lexeme type; e.g., the keyword while is the only lexeme type in the “while-keyword” class, so “while” is used as is in the grammar.

Languages

Language and automata theory define several hierarchical classes of languages

Two very important classes that every computer scientist should know: regular and context-free

There is a direct mapping between language classes and types of automata

Regular languages are those that can be represented by finite automata

For our purposes, a finite automaton is a finite state machine with labelled transitions between states, where the label on a transition matches a token in the input stream

Context-free languages are those than can be represented by push-down automata (essentially, finite automata with a stack memory)

Regular expressions, used by many Unix command tools, are expressions that define a regular language

Lexeme identification is typically done using a regular language

But, regular languages are not powerful enough to describe useful programming languages

The canonical example of the limitations of regular languages is that of matching parentheses

A finite automaton has no way of counting the number of left parens, to make sure the number of right parens match

But with a PDA, you can use the stack to count them

So, if anyone ever asks you to write a generic expression parser, do not try to use a finite automata approach!

Backus-Naur Form and Context Free Grammar

The syntax of most “real” programming languages is context-free, or at least close to it (i.e., small exceptions)

(These exceptions can be killers, though!)

A very elegant way to write CFGs is with grammar production rules, where nonterminal rule names are recursively substituted for their rule patterns, which include both terminals and other nonterminals

BNF, EBNF embody a language for writing these production rules

Basically, a bunch of rules of the form

nonterminal :- pattern of terminals and/or nonterminals

A grammar must have one top-level nonterminal which is the start symbol

Sentence Derivation

A grammar can act as a sentence generator or sentence recognizer (when we run the compiler we are using it as a recognizer)

Starting with the start symbol, replacing any one nonterminal with any one of its rule bodies, until all nonterminals are gone, results in a sequence of sentential forms, the last one being one possible sentence in the language defined by the grammar

Example on page 122

Replacement can happen leftmost, rightmost, or in random order

Connecting with and edge the nonterminals of the sentential forms
and the rules into which they derived results in a parse tree

Ambiguity

Sentential form derivation order does not affect the resulting parse tree

But a grammar can be ambiguous, meaning more than one parse tree can represent the same sentence

Example on pages 124-125

Operator Precedence Problem in ambiguous expression grammars

Operator Associativity Problem in ambiguous expression grammars

Matching the else to the correct if