It seems like the most effective way to discuss the tradeoff between narrow, wide, and pipelined memory as discussed in the book is to discuss the most common memory in use in recent PCs today, SDRAM.
Most discussions of SDRAM focus on the fact that it is synchronized with the system clock. For our purposes, what's important is how it is accessed, without considering that detail.... The information here comes mostly from Texas Instruments' SDRAM Technical Reference
DRAM (including SDRAM) is always arranged as an array. An address is broken into two parts, a row and a column, before being presented to the array. The row is presented first, activating it, and then the column is presented. Naturally, getting successive columns from within a single row is faster than getting data from several rows, since there is no need to send a new row address.
The three most important features of SDRAM, from our perspective, are
What does this mean to our attempts to balance the costs of our memory bus against the performance we want? Instead of just using DRAM (so we would incur the full transfer latency on each eight bytes of a cache block, and the bus would be waiting for the latency most of the time), or using DRAM with a wide bus (which would be more expensive, and the bus would spend most of its time sitting and waiting for data), we can try to balance the width of the bus against the latency of the memory so we are getting better bus utilization than the wide version, but higher speed than the narrow version.
The P4 L2 Cache has the following specifications:
While we're at it, it has two L1 caches: an 8K, 4-way set associative data cache (the specifications I found didn't mention the block size), and an instruction cache that holds up to 12K micro-ops (so the instruction cache doesn't cache instructions!) with a completely unspecified structure.