Computational Thinking: Data Representation
This page builds on our introduction to computational thinking. Recall our five foundations of computational thinking:
- Abstraction
- Decomposition
- Pattern Recognition
- Data Representation
- Algorithms
Data Representation
Many decades ago a famous software developer, Fred Brooks, said, “"Show me your [code] and conceal your [data], and I shall continue to be mystified. Show me your [data], and I won’t usually need your [code]; it’ll be obvious.” (e.g., here and here). Or (on that first link), Linus Torvalds saying, “Bad programmers worry about the code. Good programmers worry about data structures and their relationships.”
Data representation will make or break your large programming project! Never forget this.
So learning about data, how it is represented in a computer, and how we build abstractions on top of that, is important.
The Lowest Level: Bits to Information
Computers process bits. We need those bits to represent meaningful information for us. Thankfully, we don’t have to start from scratch with every program: there are lots of accepted, standardized data representations for simple data pieces:
- bits can represent unsigned integers, where each bit’s value is the power of 2 for its position; with N bits we can represent the values 0 to 2N-1; we usually use 32-bit and 64-bit versions, but still often use just 8 bits (1 byte) or sometimes 16 bits.
- bits can represent signed integers, similar to the first but using Two’s Complement representation, which allows for negative values; with N bits we can represent -(2N-1) to +(2N-1-1);
- bits can represent real values, which we call floating point; this is essentially binary scientific notations, with a mantissa and an exponent; the format is an IEEE standard, and we most often use the 32-bit and 64-bit formats.
- bits can represent characters in strings; in olden days and in many lower level modern software, these characters are 1-byte ASCII or ISO 8859 values, but in modern user-facing software, multi-bytes formats representing Unicode are used, such as UTF-8;
- other common data is built on these, such as hex color codes.
For simple and hobby programming in a specific programming language, we can often ignore how it exactly represents our data from the above choices. But when we get more serious and are going to create a real product, we really need to know what data representations are happening underneath our code.
In really low-level programming we may need to go study and learn the bit-level representations of all kinds of data: image formats, sound formats, data compression formats, encryption standards, and many more!
Organizing Data 1: Attributes of Something Real
Often in our program we will have individual pieces of data that are all attributes of the same concept, or thing. For example, for the Motor Vehicle Department, I have a name, date of birth, home address, height, weight, eye color, license number, and so on. Each of these is an attribute, a piece of data, that contributes to describing me.
A crucial part of programming is figuring out what are the necessary attributes for each of the data concepts that our program needs to use, and then picking a representation for them. For example, my weight for the MVD is an integer number of pounds, but my eye color is a 3-character string (“BRO”, for brown, I guess).
I have a friend who works for a Canadian MVD and they want to figure out how to accomodate names that are written in First Nation (Native American) languages, but their legacy software doesn’t use Unicode; so they are looking at spending lots of money for an upgrade.
Programming languages all have mechanisms to group a set of related attributes into a single data record. C has structs, Python has tuples and dictionaries, and C++ and Java (and Python) have objects. Objects are a bigger idea than just data, but all of these allow use to declare that a collection of data attributes should be packaged together, and that all together they represent a single, larger, conceptual item.
Once again, these larger conceptual items must be named. Naming is everywhere in computer science! But once we have them, then they can become attributes of other concepts. For example, an address is a complicated thing itself with separate pieces of data, but once I describe it then I can use it and just say, “one attribute of a customer is their shipping address.”
As another example, I can create a concept of a point in 3-dimensional space as an (x,y,z) 3-tuple (remember graphing?). Then I can create a concept of a 3D (flat) rectangle as a 2-point tuple, with one point being the lower left point and the other point being the upper right. This is all you need to describe a true rectangle – although a more general 4-corner polygon would need the points of all four corners recorded separately.
Organizing Data 2: Collections of Things
The heart of computer science – dealing with collections of data! In so many places we fundamentally need to deal with collections of the same kind of things. Our database of customers, the list of items in a shopping cart, the set of items in the game background, the set of options available for a pizza topping, and on and on.
With any collection, we have to consider:
- how often do we need to add new items to it? Is it built once and rarely changes, or does it need additions often?
- how often to we need to remove items? Never? Often?
- how do we use the items? do we usually just process the whole list, or do we often need to search and find particular items?
These questions help drive our selection of a data representation. If we have to search, we probably need some sort of sorted, or ordered reprsentation, possibly with some additional search key support.
The most basic collection is a list or array of things. Items are conceptually in a line, with each position generally indexed by their numerical position (often starting at position 0 (zero)). If you know the index, accessing an item is trivially easy, but if you don’t you may need to start from the beginning and search. If you keep items ordered this can be sped up. Adding a new item to the end is easy, but adding into the center, and removing items, is more difficult.
For more complex operations on ordered data, a tree is a useful, basic, data structure. You’ll learn more about trees in a data structures course. It allows fast searching and faster adding and removing of items.
Python, like many scripting and programming languages, support a data collection called a dictionary, and sometimes in other languages a hashmap (and associative array!). These have become so ubiquitous that many programmers think that they don’t need to know how to use anything else. That’s both great (easy programming) and depressing (there’s so much more)! In these, the items in your collection are stored based on a key, and then you can access them based on that key. Searching without a key is expensive, but everything else is generally very fast.
Organizing Data 3: Associations between Things
Collections of the same kind of things are great, but those things are often related to other things of different kinds: an online shopping cart is associated to the customer, items are placed “in” the cart and so are now associated to it, and so on. We need to be able to represent in our data associations between different things. This is a connection from one item to another that our program can use to get to the second item.
We often call the connection a reference. For example, if a customer record (or object) has a reference to a shopping cart, then we can get to the shopping cart record (or object) from the customer by following (or traversing the reference. However, if the shopping cart record does not have a reference to the customer, then we cannot get from the shopping cart to the customer. References are one-way; if we want both way, we must have two references, one on each side.
Essentially, a reference is an attribute of a data record. It is not a characteristic of the data record (like my eye color for the MVD), but it is an attribute that captures the relationship between one thing and another.
References look different in different programming languages. In Python, if you know the dictionary an item is in then its key is a reference to it. Java has object references, C has pointers, and so on. Learning how references are created, used, and act in your programming language is important.
A key decision in programming is the multiplicity question: How many items of some other kind do I need references to? If it is only one, then a simple direct reference can be used; but if might be more, then I need something like a list of references. Not getting this right can lead to costly mistakes!