Advanced Input and Output

Here's some information on input and output that's important to the class, but beyond what we're actually going to program: forms of IO that are used in larger systems than we deal with in this class. The two I'd like to cover here are DMA and IO Processors.

The basic point of these is to extend the same idea we had with DMA: move more of the processing load for IO away from the CPU and into the peripheral device. There are two reasons to do this: first, it reduces the load on the CPU, so the CPU can spend more time working on user programs. Second, a custom processor can do special-purpose operations needed for IO more quickly than the CPU can, anyway (this may have changed with the Pentium 4, but last I heard modern graphics chips actually had more transistors than modern CPUs!).

Direct Memory Access (DMA)

If we have a large amount of data to transfer in a stereotyped way, it makes sense to let the device itself take care of it. A perfect example of a device that's appropriate for DMA is a disk drive: data is on the device in nice regular-sized chunks (called sectors), and you always transfer a whole sector at a time between the device and memory. So it makes really good sense to add some more registers to the device: a pointer to the current location in memory it's writing to, and a counter saying how much data is to be sent.

Basics

The basic idea is that the device has two extra control/status registers: a pointer register, and a count register. The pointer register contains the address we're going to write to/read from next, and the count register says how much more data we have to transfer.

So, suppose a device is doing a DMA transfer to memory. We start by loading up the pointer register with the location in memory we want to transfer to. Then, we load up the count register with the amount of data we want to transfer. The ``start'' part of the device driver is done now, and we go back to running our program.

Now, as each byte of data becomes available, the device asserts a control line on the bus to request permission to do a transfer. The CPU responds by asserting a control line that grants permission (this is done completely by the hardware - there is no software intervention). The device writes the byte of data to the memory, adds one to the pointer, and subtracts one from the counter. This continues until the transfer is complete.

When the transfer has completed, the device requests an interrupt to let the CPU know it's all done.

Variations

There are a lot of variations on all of this. Here are a few of them:
  1. GO bit. I described the transfer as starting as soon as you load a non-zero value into the COUNT register. Many devices have a GO bit in one of their control registers; once you've set everything up, you have to set this bit to 1 before anything actually happens.
  2. Transfer size. What I described (one byte at a time) is what you'd do if you had a one-byte wide bus. If you've got a word-size bus, or a wider bus (current PCI busses are 64 bits wide), that's how much you'd transfer at a time. Frequently (but not always), the Count register is in units ``transfer size'' rather than bytes, as well (so if it were a word-size transfer, you'd specify how many words to transfer instead of how many bytes).
  3. Burst mode. Some DMA devices support a ``burst mode.'' If it transfers in burst mode, it grabs the bus and then does a bunch of transfers (maybe even the whole requested buffer) before giving the bus back. This is appropriate for a device that can either generate or receive data very quickly, but can have a really nasty effect on overall system performance.
  4. First and Third Party DMA. What I've described so far is ``first party DMA,'' which is how most systems other than PCs do it, and how most texts describe it.

    PCs use a slightly different technique called ``third party DMA.'' the idea here is that there is actually a separate controller to do DMA, which maintains the pointer and count registers. When a device is ready to transfer a byte, it asks the DMA controller to do it for it; the DMA controller handles the bus handshaking while the device just reads or writes the bus's data lines. The original PC had a DMA controller that would support four DMA channels (of which users were able to make use of three; channel 0 was dedicated to refreshing dynamic RAM); there are now two DMA controllers which can handle eight channels (though 0 and 3 are not usable by other devices).

    Just to make things more confusing, this is too slow to support high-performance disks and the like, so these devices use first-party DMA.

IO Processors

This basic idea of offloading IO processing can be extended; IBM was the first I know of to introduce the idea of IO processors (which IBM called channels) back in the 1950s.

The idea with IO processors is that we can create a whole sequence of IO operations to perform, do them all, and not get an interrupt until after the whole sequence is over. So, we have a brain-damaged little computer whose sole purpose in life is to do IO operations.

The archetypal device for use with channels was always the tape drive: it was quite reasonable to put together a sequence of operations like "skip over the first tape record, read the next three, skip one, write the next two, and then rewind the tape."

Today, the archetypal device is probably a graphics chip. Think of what's involved in drawing a realistic scene in real-time, as modern games require. First, the objects in the scene have to be broken down into polyhedra in three dimensions for drawing (real curved surfaces take too much processing -- so far!). It turns out that triangles have a lot of advantages for processing: you can't make a mistake and create a non-planar triangle, triangles are inherently convex, and an algorithm to draw a triangle is really easy. So: all the surfaces of the polyhedra are triangles.

A "program" for a graphics chip is a list (really an array) of triangles. Each of them is defined by their location and by a host of parameters defining their appearance. Typical parameters include the color at each of their vertices, their degree of transparency (that's called the alpha channel), and how shiny they are. The list is sorted so that triangles in back can be drawn first. This list of triangles is generated by the program. The graphics card goes through the list, performs some simple culling (don't draw any that aren't actually on the screen. Don't draw any that are facing away from the screen. Etc.) and draws them all. After it's all done, the CPU gets an interrupt; it's been busy constructing a new list of triangles while the graphics chip was rendering the old list, so it should be all set to tell the graphics chip where the new list is, and away we go.

The all-time champion of IO processor use was probably the CDC 6600, back in 1964. This was a hideously expensive machine, desgined for and dedicated to doing scientific programming; it's one of the many computers that people have referred to as "the first supercomputer." It was felt that CPU cycles on this machine were too expensive to waste on IO at all, so it had ten minicomputers which it used for doing the IO (CDC called these "peripheral processing units," or PPUs.

Unlike the IO processors we've talked about so far, these were indeed full-fledged minicomputers. In fact, the OS even ran on one of them, rather than on the CPU!