PL: Data Types • Jonathan Cook

From old CS 471 notes…

Data Types

Data Type Basics

A data type defines both the set of representable values and also the allowed operations on the values.

Information about typed variables needs to be maintained statically during compile time and possibly dynamically at runtime.

Primitive Data Types

Are often closely tied with the data types that CPU’s can inherently process: signed and unsigned integers of various sizes, and floating point values.

Note: Intel x86 implements quite a few machine instructions that manipulate strings, but these are rarely utilized by compilers.

Integer types

Signed integers are two’s complement representation, while unsigned integers are just pure magnitude-only binary representation.

C defined the popular “char/short/int/long” integer size progression, but did not actually specify what those sizes should be!

At least one computer was made with 9-bit bytes, with “char” being 9 bits.

C “int” typically changes platform to platform to be the “native” integer size for the CPU, but with 64-bit CPU’s, “int” has remained 32 bits for the most part.

Java “byte/short/int/long” have exact size specifications of 8/16/32/64 bits respectively.

Floating point types

CPU designers used to design and implement their own unique floating point representations, but today virtually all computers use the IEEE 754 standard for representing and operating on real values.

IEEE 754 is essentially binary scientific notation, where a value has an exponent and a mantissa. Both pieces need to be able to carry a sign, but 2C is not used.

The mantissa is stored sign-magnitude, with a separate sign bit indicating the sign of the overall value.

The exponent is stored as an unsigned offset from a base exponent value.

IEEE type	PL type	Total bits	Exponent bits	Mantissa bits	Range	Precision in decimal
Single	float	32	8	23	10^-38 to 10^38	~7 digits
Double	double	64	11	52	10^-308 to 10^308	~15 digits
Quad	?	128	15	112	lots	lots

IEEE specifies unqiue reserved bit patterns for special values: zero, infinity, not-a-number (NaN), etc.

IEEE specifies accuracy and rounding constraints for mathematical operations on floating point values.

Intel CPU’s internally compute floating point operations on 80 bit values, then shorten them to float/double as they are stored back to memory.

Binary-coded decimal (BCD)

Hardly ever seen anymore, but many CPUs still support it.

Usually two decimal digits per byte, 4 bits per digit.

It wastes some bit patterns, but makes converting back and forth (for input/output) very easy.

HC11 has BCD support, including the H flag, signalling a carry between the lower half and upper half of a byte.

Boolean and Character types

Most newer languages have some sort of boolean type (including C99 and C++).

Older C (C89) used integers as boolean: zero is false, nonzero is true.

Having a boolean type inherent in the language makes conditional control flow much cleaner: control expressions must result in a boolean value.

Booleans can be packed into compact representation, but rarely are.

Character types have undergone radical change in the recent years.

Old School: all that mattered was 7-bit ASCII (or perhaps 8-bit ISO 8859-1), and so one byte for a character.

New School: Software is used everywhere, needs extensive native language support, so support Unicode with 16-bit characters.

Java was “first” to do this “natively”.

You can read much more on Unicode, <a href"http://en.wikipedia.org/wiki/Wide_character">C/C++ wide characters, or even internationalize your program to support Klingon.

Internationalization is the process and support for creating software whose interface can be translated into various languages.

Typically, all strings are either stored externally in a database or file (or in a resource fork or section of an executable), and then the strings can be changed without recompiling the program.

You can read more on internationalization.

Translating the strings is only a very superficial support of internationalization: some languages are written in different directions, and many cultural issues arise in the design/use of the overall interface.

Strings

Computers are useless if they can’t produce output that humans understand (exception: embedded control devices, but even these need occasional user interaction).

IMO, many many more cycles are burned composing, formatting, and producing output, or reading and parsing input, than are burned doing “number crunching”.

Thus, language support for “strings” is of utmost importance.

Lots of decisions to make, one being: how much do you put into the language itself, and how much do you put into a runtime library?

C: “raw” strings, lots of library functions, very dangerous.

C++: string class with mutable strings.

Java: String class with immutable strings.

In Java, every string modification operation (function) produces a new string object.

Searching strings and tokenizing strings according to some rules is very common.

Many scripting languages support regular expression pattern matching and tokenizing.

Learning how to use these features is vitally important to using the language effectively.

User-Defined Ordinals

Ordinals, or enumerations, are very useful in programming.

Most programmers underutilize enumerations, myself included.

C/C++ enumerations are fairly limited, but still good to use.

Java enumerations are full-blown classes. Nice.

Namespace for values: Java uses class scope, C/C++ use global (bad!).

Arrays and Associative Arrays

You all know what arrays are, so we’ll stick to interesting points.

Storage: row-major (C/C++/Java) or column major (Fortran). In HPC (High Performance Computing), knowing how your arrays are laid out and how they might interact with the memory subsystem (cache, etc.) is very important.

Rectangular or jagged: book says C/C++/Java support jagged arrays. How?

C++ and Java have ArrayList classes that implement dynamic arrays. Very useful for unknown-sized data collections.

Lots of programming tasks require maps of data based on non-integral keys. So scripting languages almost always have associative arrays.

Associative arrays are often called hashes, hashmaps, or dictionaries.

Basically, store a record of data based on a key of any type (often a string).

Records and Unions

While arrays hold aggregations of uniform, homogeneous data, records and structures hold aggregations of disparate, heterogeneous data.

Most real-world data are collections of heterogeneous data.

An record takes up some amount of memory, depending on the size of the fields and how the compiler decides to lay out those fields in memory.

For example, does the compiler pack two or more adjacent “char” fields into the same memory word, or does it align every field on a word boundary for efficient memory access?

Assignment and equality comparison support:

does a programming language allow full structure to structure assignment? (most yes);
does a PL allow subset of fields assignment? (book: COBOL yes);
does a PL allow records to be compared for equality?

Unions are structures that hold different types of data not at the same time (like records), but at different times during program execution.

If there is any type checking on the use of union fields, it must be done dynamically (at runtime).

Languages that support runtime union field type checking implement discriminated unions; languages that don’t implement free unions.

Discriminated unions carry a tag (discriminant) which holds the type of the data that is currently held in the union variable.

Note that in languages that only have free unions, the programmer typically must have some sort of external discriminant in order to properly access a union variable.

In OO languages, subclassing provides a better alternative to discriminated unions.

Pointers and References

Most programs benefit from being able to manage dynamic storage, in order to grow their data structures to the size appropriate for the current execution.

Most programs can also benefit from being able to have variables that indirectly refer to some data; that is, they can be changed to refer to some other data.

I’ve heard it said that indirection is one of the most powerful concepts in computer science; I probably agree.

Pointers are variables that can typically be thought of as holding actual memory addresses of where data is stored.

Pointers assume some memory model of how collections such as arrays are laid out; thus one can do pointer arithmetic to change what a pointer is pointing to.

Warning: if you use pointer arithmetic or need to understand pointer arithmetic, be sure you understand what ‘1’ means!

Pointers are generally typed according to what they point to, but C/C++ allow void pointers, which point to any type of data (not ‘nothing’). Void pointers can be used as an unsafe generic data reference, or for low-level memory-type operations.

References are much stricter in that they allow indirect reference to an object or variable, but do not assume anything about what is around or outside that variable.

References can only be assigned to refer to actual data; “arithmetic” to change a reference makes no sense.

References are typically much safer than pointers.

The C programming language has three ways to dereference a pointer:

directly, using the ‘*’ operator;
indirectly, using array indexing;
indirectly, using the ‘->’ pointer field access operator

In Java, the use of a reference automatically dereferences it. (Q: what does (o1 == o2) mean in Java?)

Pointers, using arithmetic or array indexing, can end up allowing access to invalid data or even illegal addresses (causing a memory violation).

References (and pointers) in languages that require explicit deallocation can end up referring to already-deleted data; the dangling pointer or reference problem.

References in Java cannot be dangling because Java will not delete an object until there are no references to it anymore. But Java references can be incorrectly accessed before being assigned, thus it still has the null reference problem.

If a language runtime tracks all references, it can actually optimize memory usage by compacting and/or shuffling data around and updating references to refer to the correct place.

With pointer arithmetic and pointer-array-indexing, it is impossible for a language to track references and always know exactly how many references are pointing to each piece of data.

This is the pointer aliasing problem; lots of research on alias analysis has been done to try to improve the capabilities of tools to find and report pointer problems in languages like C/C++.

C++ also has references, which act like constant pointers, except that they are automatically dereferenced when accessed.