C-Flat : Specifications 1 and 2

Updated: November 27th, 2001

Author: Cassandra Bayer

Purpose

C-flat is designed to present the programmer with a very low-level language that is at the same time visually oriented like C, and which will take care of various housekeeping activities such as procedure calling, with parameter passing, and register storing on stack. The C-flat spec 1 is pretty much an extention of the specific architecture of the machine, being essentially just a different way to write assembly for that architecture. C-flat spec 2 is designed to continue the idea of providing very percise and low-level functionality while not being as architecture specific.

C-flat is intended to be used in cases where a piece of code's speed or size must be tightly controlled. (Such a piece of code is hereafter refered to as critical code.) C is often inadequate to address the requirements of critical code, since the programmer has very little actual control of various aspects of the code once compilation has occured, thus he may not know the exact size, or if the optimizations produced better code than she might be able to produce herself. C-flat attempts to address these specific areas of critical code, while still remaining visually attractive, and (spec 2) being just high-enough to insure that a piece of code could be compiled to as many systems as possible. Thus, the goals of C-flat are to provide the power of Assembly with the readability, and portability of C, thus the name: C flat (just below C)

Licences & Legal Claims

The source code for cbc (the C-flat compiler) will be released Open Source, but will not be under the GNU license. The specific license, is that all source code to cbc must be made easily available. When distributing binary formats of the code, you must also make the source code just as easily available to access as the binaries. Source code will be centralized and maintained by Cassandra Bayer, but the source code is quite literally free, even to use as a source to make an alternate compiler. The C-flat specificiations will be strictly controlled though by Cassandra Bayer, and she will not release control over the specifications unless presented with a pention of a sufficient number to release the specification to a non-profit councel designed for the purpose of maintaining the specification. Any distribution of a C-flat compiler, (either source, or binary) must be accompanied with the SPECIFICATION file that it conforms to, and contain a file COMPLIANCE, which details the exact points where the compiler varies or extends from the SPECIFICATIONS document.

Source File Extentions

Since spec 1 and spec 2 code will not be source compatible, the extentions for spec 1 are to be .cb, and the extention for spec 2 will be .cb2. The C-flat compiler should recognize the spec version by the extention, and compile accordingly. Note that spec 1 should contain a compiler directive for indicating the target architecture.

Comments

Comments are the most overlooked element of programming, but their use in critical code is very vital, since critical code very often performs many "tricks" and twists in order to squeeze out every inch of performance. Thus, comments are very much essential to programming critical code. C-flat has elected to take two styles of comments in order to provide familarity to both ANSI C and Assembly programmers. Thus, the format choosen is '/*' and '*/' which enclose block comments, and '#' for line comments.

Procedure/Function definition

Procedures and Functions are very vital to modern programming languages, thus the need was felt to bring standardized procedures and functions into C-flat. The format for stack parameter passing is the standard C format, while specifications remain so that one can pass parameters through registers. Returning values are defaulted to the C format, though this can be overridden with standardized commands.

The declaration of a function is of this form:

func ( returns ) function_name ( parameters )
{
  function_code
}

The declaration of a procedure is of the form:

func procedure_name ( parameters )
{
  procedure_code
}

While C programmers make due with only ever returning one value, Assembly programmers often exploit returning more than one value at a time. Borrowing C style returns exactly would greatly damage the efforts of C-flat, thus are extended from their C style equivalent. See Parameter & Return Value Definitions below.

Note: The alternative "proc" exists for "func", but it should be considered deprecated

Macro definition

Macros are a vital part of assembly. This is because assembly is most often used for critical code, which often wants to declare a function, but actually embed that code in the calling program. The declaration of macros should thus be very similar to a function that accomplishes the same goals. In C these embeded functions are accomplished by optimization routines or by the keyword "inline", but this is ineffective in critical code, since critical code needs functions to be functions and macros to be macros; functions for size, and macros for speed. Thus an approach is taken of a different keyword to indicate a macro. Thus:

The definition of a function-like macro is declared as:
macro ( returns ) function_macro_name ( parameters )
{
  function_macro_code
}

The declaration of a procedure-like macro is of the form:

macro procedure_macro_name ( parameters )
{
  procedure_macro_code
}

Function-like macros, being similar to their counterpart would suffer greatly from being restricted strictly to only C style return values, thus they have the same return value declaration as functions. See Parameters & Return Value Definitions below

Types

C-flat has been presented with a unique situation to arise at a transition between two major architecture formats, from 32-bit to 64-bit. Thus, many matters can be addressed, and many more pitfalls possibily avoided by choosing preventative formats for a variety of situations. The most serious one is to ensure compatibility of integer and pointers from architecure to architecure. Since C-flat is being designed specifically for critical code, it is extremely useful to have loose typing, since one will often have a desire to treat a value in any manner she sees fit. With these ideas in mind, the specification for types are:

Register Names

All of the ideas of C-flat would be for nought, if we didn't allow direct access and allocation of registers to the programmer, but the problem becomes, how? All processors seem to treat registers differently than all the others. In spec 1 the solution is easy, the registers are refered to by their standard assembly name prefixed with a '%'. The solution for spec 2 though is much harder. Spec 2 seeks to be as architecture neutral as possible, while still being efficient, but how to accomplish this when the processors vary so greatly that the HC11 has 2 8-bit registers, along with 2 16-bit address registers, and the IA-64 has 127 64-bit general purpose registers (and soon to be 256). Also, what if the chip is accumulator based, or stack based (Java ByteCode)? How should all these be treated?

Spec 2 has taken an approach very similar to the idea that Intel had with their first math co-processors. (the 8087) The idea was to have a stack based machine, but not quite. Registers will be assigned in a queue based order, thus the first use of a register will supply you with register "A" on the HC11, "EAX" on the IA-32, and register 1 on the IA-64. This will of course skip any non-general purpose registers, such as register 31 on MIPS ($ra = return address). When a register becomes free for use, it will remain unused, until the "queue" of available registers exceeds the number of available registers, then a search will begin for an unused register. If no register is found, then a warning is issued at compile time, and the compiler then stuffs a register onto the stack, and assigns that register to the name, performing stack swapping as necessary. The compiler should use an efficient algorithm for this, to avoid register thrashing

Non-general purpose registers will be assigned there own names. Certain registers will become obscured through the housekeeping managment of C-flat, though these values will remain important. The register %stack will always be the value pointing to the stack. The value __STACK_DIR__ will specify which direction the stack is expected to grow (most times, down). The value %return will always be the location of the returning address, since this is often used in critical code, even if this does not make it a register. This value is commonly used in critical code, and thus a definitive access method must be given. Thus, in the x86 architecture, this value will always be a pointer into the stack, and in the MIPS architecture it will always be the $ra register. The %ip will always be the value of the current instruction. It's use is very likely not going to be available, but it is defined by this spec, though it's recommended to avoid it's use! (Using it on an IA-32 processor may require a function call to ensure the value is recieved... this could be costly) Finally, the value %retval will be the register, which C normally expects return values to be.

As you may have noticed, register names are prefixed with a '%' in C-Flat. This convention is taken from the AT&T x86 assembly format, and actually carries little reason except for familiarity to the author. To assign a register in spec 2, you declare a variable such as any other, but prefix the name with '%', thus the code "int %counter, %step" will assign two registers, "%counter", and "%step" which will be assigned to two integral general purpose registers for the remainder of their use. The only exception is those registers which are also post-fixed with a "%", which are the direct assembly names for the registers. Their use is almost exclusively in library definitions, or bios access definitions, where specific registers must contain specific values. (Such names will be available in spec 1 and spec 2)

In spec 1, attempting to declare a register such as a variable (such as the code "int %counter, %step") will cause a compile-time error.

The Register Assignment Queue

If you intend to produce critical code, then it's important to know the number of registers that the processor you're targeting has, and then utilize those registers to the best of your ability. Unfortunately, you may not know specifically how many processors are going to be available. The value __NUM_REGISTERS__ will be provided for determining the number of registers for the compiling architecture, but this is of little use for producing the code you want, more as a way to insure that a register starved system (such as the HC11) won't try to compile the code. (Say if you were expecting 16 some general purpose registers) So, thus it's important to understand the dynamics of the assignment queue, and also to specify how this queue will work.

As mentioned above the allocation of registers works in a queue based method. The allocation of registers occurs in two seperate register spaces: data, and address. On systems where data and address registers are the same, these will overlap in allocation, but in a system where there are actually two register types, this allocation will NOT overlap. (Note: Specification is that in the x86 architecture, the two registers %edi, %esi (and their 16 bit equivalents) are first considered address registers, and %edx will be avoided till last for address values, since the x86 architecture treats these registers very specificly.)

The first allocation of a value to a register will assign it the first available register in that register set (address or data). As further registers are allocated, they will be allocated to the next available register in that register set. As each register is accessed, it will be incremented towards the front of the allocation queue, this will provide that the last register in the queue will be the last recently used (LRU) register. When compiling, the compiler will attempt to realize which register allocations have become obsolete, and deallocate them, thus no register should become permenantly locked into an allocation, unless it's access is constant throughout. Once a set of registers has been exhausted, the LRU register will be pushed onto the stack, and returned when accessed. Every attempt will be made by the compiler to insure that excessive stack operations do not occur, but it's suggested to not rely on this!

It's recommended that one keep register allocations to a minimum if producing code that is intended to be compiled cross-system, because the number of allocatable registers will vary with respect to each system.

Parameters and Return Values

Parameters and Return values are the most essential features of any functional language. Thus, it's important to specify how these will be treated and detail their features. The basic format for a parameter or return value declaration is the list. Unlike C, C-flat provides for multiple returns in a single function. Thus, the format provided is:

list ::= '(' <list-element> ( ',' <list-element> )* ')'

The format provided for list-element is:

list-element ::= <type> <identifier> ( '=>' <real-register> )?;

The '=>' indicates that the value is "attached" to the value in the real-register, this is usefully really only in declarations for BIOS calls, or in spec 1 code.

Identifiers

Identifiers follow the standard C convention, that it must initially begin with an alpha character, or '_', then each successive character may be any alpha-numeric character or '_'. The only deviation is the character '%' which may be prepended to name, which indicates that it be allocated directly to a register, and optionally also postfixed in order to specify directly a real register name. These real register names will generate an "invalid register name" error if compiling to a system that does not have that register name.

Numeric and Otherwise Constant Values

Numeric constants are any word that starts with a numeric character. If the first numeric character is not '0', then it is interpreted to be a decimal value. If the first character is a '0' then the following character specifies the base. If the character is 'x' or 'X' then the number is interpretted to be a hexidecimal value. If the character is 'b' or 'B' it is interpreted to be a binary value. If the character is any numeric value, the value is interpreted to be octal.

Any numeric value may be (and occationally must be) preceeded by a type classification. This type classification insures that the size of the numeric value is the size of the type declared.

A character constant is produced by putting the value in single quotes ('). Only a single character is expected. The conversion is defaulted to the conversion specified at compile time in the compiler, and thus is system specific. This default may be overridden though by defining __CHAR_CONST_TYPE__ to __ANSI__ or __UNICODE__. Note that any character may be escaped, with the standard C conversions (\n = newline, etc).

A string is any number of characters occuring enclosed in double quotes ("). Any one of which may be escaped. Quotes can be extended across lines by escaping the newline (with a backslash ending the line) or by closing the quote, and then continuing it on the next line. Thus, any number of string definitions occuring in a row are concatenated together. Note that unlike C, string constants are highly discouraged in open code. It's suggested that string constants be seperately allocated. This is to increase code quality. Use of string constants in open code should only be done when there is absolutely no reason to pollute name-space with the string.

Type Casting

The type of a value may be changed by preceeding the value with a cast to a type. The format for a type cast is:

'(' <type> ')'

Casting a value does _NOT_ change it's actual data, and performs no conversion on the value. Thus, a float value casted into an int, will produce a value in the int, which is the exact bit representation of the float value, zero-extended if necessary. It's important to realize that you must manual convert all values yourself! Note though, that prebuilt functions are provided for converting so that these operations need not be re-implemented each time. These functions are defined as:

int float2int ( float value );

float int2float ( int value );

int sign ( <size-specific-type> value );

No conversion between semi-size-specified values (double, single, long, short, etc...) is provided since most math operations are intended to be done only in the int and float values. The sign function is provided to insure proper sign conversion between a size-specific value and the non-size-specific value.

The lack of automated conversion should allow for easy pointer manipulation, without resorting the work-arounds in C of using UNIONs, thus is the reason why C-flat does not provide a similar type to a UNION.