Direct Rendering Infrastructure, Low-Level Design Document

Kevin E. Martin, Rickard E. Faith, Jens Owen, and Allen Akin Precision Insight, Inc.

$Date: 1999/05/11 22:45:24 $, $Revision: 1.23 $

While rendering 2D primitives can be efficiently and effectively handled by a single rendering pipeline in the X server, multiple pipes are required for rendering 3D primitives in a fast and responsive manner. In this low-level design document, we present a data-centric analysis of the interfaces required for the implementation of the direct rendering infrastructure. We examine the interfaces required by the XFree86 X server (with the GLX extension), a client making OpenGL calls, and a kernel-level DMA engine and device driver. We also outline the process by which hardware accelerated 3D video card support should be added.

1. Preamble

1.1 Copyright

Permission is granted to make and distribute verbatim copies of this document provided the copyright notice and this permission notice are preserved on all copies.

1.2 Trademarks

OpenGL is a registered trademark of Silicon Graphics, Inc. Unix is a registered trademark of The Open Group. The `X' device and X Window System are trademarks of The Open Group. XFree86 is a trademark of The XFree86 Project. Linux is a registered trademark of Linus Torvalds. Intel is a registered trademark of Intel Corporation. All other trademarks mentioned are the property of their respective owners.

2. Introduction

In this paper we present the theory of operation for the direct-rendering infrastructure (DRI). The basic architecture of the DRI involves three separate components: the X server, the direct rendering client, and a kernel-level device driver. Each of these components work together to provide direct access to the hardware for 3D rendering (via OpenGL). In the first section, we present the main communication pathways between each of these components.

In the next three sections, we give an overview of possible implementations, analyze their benefits and costs, and propose a specific implementation. Our analysis has been broken down into initialization, steady state, and finalization sections. This framework is meant to guide, but not restrict, future enhancements to the DRI. We conclude with potential enhancements for the DRI.

3. Main communication pathways

On PC-class hardware, there are two basic mechanisms for sending rendering commands to the graphics device: PIO/MMIO (see glossary for specific definitions) and DMA. The architecture described in this document is designed around DMA-style hardware, but can easily be extended to accommodate PIO/MMIO-style hardware. The main communication pathways between the three main components and the graphics device for a DMA-style design is shown below.

--------------------                --------------------
| Client (*)       |                | X server (*)     |
|                  |                |                  |
|                  | <----PROTO---> |                  |
|                  |                |                  |
|                  |                |                  |
|                  | <----SAREA---> |                  |
|                  |        |       |                  |
|                  |        |       |                  |
--------------------        |       --------------------
   |    |    |              |             |    |    |
   |    |    |              |             |    |    |
   |    |    |              |             |    |    |
   |    |  IOCTL            |           IOCTL  |    |
   |    |    |              v             |    |    |
   |    |    |    --------------------    |    |    |
   |    |    |--> | Kernel           | <--|    |    |
   |   DMA        |                  |        DMA   |
   | BUFFERS----> |                  | <----BUFFERS |
   |              |                  |              |
   |              --------------------              |
  MMIO                      ^                   MMIO/PIO
   |                        |                       |
   |                    DMA/MMIO                    |
   |                        |                       |
   v                        v                       v
--------------------------------------------------------
| Graphics Device                                      |
|                                                      |
--------------------------------------------------------

(*) Note: This figure is incomplete. The layering inside the Client and X server still needs to be defined and will be added in a later version of this document.

In this figure:

Client is a user-space X11 client which has been linked with various modules to support hardware-dependent direct rendering. Typical modules may include:
- libGL.so, the standard OpenGL (or Mesa) library with our device and operating system independent acceleration and GLX modifications.
- libDRI.so, our device-independent, operating-system dependent driver.
- libHW3D.so, our device-dependent driver.
X server is a user-space X server which has been modified with device and operating-system independent code to support DRI. It may be linked with other modules to support hardware-dependent direct rendering. Typical modules may include:
- libDRI.so, our device-independent, operating-system dependent driver.
- libH2D.so, our device-dependent driver. This library may provide hardware-specific 2D rendering, and 3D initialization and finalization routines that are not required by the client.
Kernel Driver is a kernel-level device driver that performs the bulk of the DMA operations and provides interfaces for synchronization. [Note: Although the driver functionality is hardware-dependent, the actual implementation of the driver may be done in a generic fashion, allowing all of the hardware-specific details to be abstracted into libH3D.so for loading into the Kernel Driver at DRI initialization time. An implementation of this type is desirable since the Kernel Driver will not then have to be updated for each new graphics device. The details of this implementation are discussed in an accompanying document, but are mentioned here to avoid later confusion.]
PROTO is the standard X protocol transport layer (e.g., a named pipe for a local client).
SAREA is a special shared-memory area that we will implement as part of the DRI. This area will be used to communicate information from the X server to the client, and may also be used to share state information with the kernel. This area should not be confused with DMA buffers. This abstraction may be implemented as several different physical areas.
DMA BUFFERS are memory areas used to buffer graphics device commands which will be sent to the hardware via DMA. These areas are not needed if memory-mapped IO (MMIO) is used exclusively to access the hardware.
IOCTL is a special interface to the kernel device driver. Requests can be initiated by the user-space program, and information can be transfered to and from the kernel. This interface incurs the overhead of a system call and memory copy for the information transfered. This abstract interface also includes the ability of the kernel to signal a listening user-space application (e.g., the X server) via I/O on a device (which may, for example, signal the user-space application with the SIGIO signal).
MMIO is direct memory-mapped access to the graphics device.

4. Initialization Analysis

The X server is the first application to run that is involved with direct rendering. After initializing its own resources, it starts the kernel device driver and waits for clients to connect. Then, when a direct rendering client connects, SAREA is created, the XFree86-GLX protocol is established, and other direct rendering resources are allocated. This section describes the operations necessary to bring the system to a steady state.

4.1 X server initialization

When the X server is started, several resources in both the X server and the kernel must be initialized if the GLX module is loaded.

Obviously, before the X server can do anything with the 3D graphics device, it will load the GLX module if it is specified in the XFree86 configuration file. When the GLX module (which contains the GLX protocol decoding and event handling routines) is loaded, the device-independent DRI module will also be loaded. The DRI module will then call the graphics device-dependent module (containing both the 2D code and the 3D initialization code) to handle the resource allocation outlined below.

X resource allocation initialization

Several global X resources need to be allocated to handle the client's 3D rendering requests. These resources include the frame buffer, texture memory, other ancillary buffers, display list space, and the SAREA.

Frame buffer

There are several approaches to allocating buffers in the frame buffer: static, static with dynamic reallocation of the unused space, and fully dynamic. Static buffer allocation is the approach we are adopting in the sample implementation for several reasons that will be outlined below.

Static allocation. During initialization, the resources supported by the graphics device are statically allocated. For example, if the device supports front, back and depth buffers in the frame buffer, then the frame buffer is divided into four areas. The first three are equal in size to the visible display area and are used for the three buffers (front, back and depth). The remaining frame buffer space remains unallocated and can be used for hardware cursor, font and pixmap caches, textures, pbuffers, etc.

In this approach, when clients create windows, the entire array of buffers available are already pre-allocated and available for use by the client: only the appropriate data structures need to be initialized. One advantage is that many large overlapping windows can be created without using all of the available resources. The main drawback is that the pre-allocated frame buffer space might not be used by a client and is unavailable for other uses (e.g., by another client).

Since statically allocating all of the buffers at server initialization time can limit the screen resolution or other features, a mechanism for selecting the types of buffers that will be allocated is available via X server startup options (e.g., command line options or XF86Config file options). Since the 3D graphics device driver knows the graphics device's capabilities and the X server knows the buffer types selected, the list of available extended visuals is generated by the 3D graphics device driver at this time.

Static allocation with dynamic reallocation of the unused space. In static allocation, some clients do not use all of their statically allocated buffers. When this occurs, an optimization is to use the unused (but pre-allocated) space for other buffers. For example, if a client only uses a front buffer, but a back and depth buffer were statically allocated, then a pbuffer that requested a depth buffer could use this extra space. Since this approach uses static allocation, the same frame buffer allocations as in the purely static approach are required.

How to implement this reallocation approach is a very difficult problem to solve. However, the sample implementation's design allows an approach such as this one to be added at a later date.

Dynamic allocation. With fully dynamic allocation, each of the buffers requested by the client are allocated only when the client associates a GLXContext with a drawable. When an X11 window is created, only the front buffer is normally allocated. Then, when a GLXContext is associated with the drawable, space is allocated in off-screen memory for the other buffers (e.g., back, depth and stencil buffers). During server initialization, no frame buffer space needs to be pre-allocated as in the previous two approaches, since all frame buffer allocations occur only when requested by the client.

A major drawback of this approach is that when a client tries to create a window, the resources that the server claimed possible might not be available. In the worst case, this might happen after the first window is created (i.e., if it grabs all of the available resources by opening a huge window with many ancillary buffers).

Texture memory

Texture memory is shared among all 3D rendering clients. On some types of graphics devices, it can be shared with other buffers, provided that these other buffers can be ``kicked out'' of the memory. On other devices, there is dedicated texture memory, which might or might not be sharable with other resources.

Since memory is a limited resource, it would be best if we could provide a mechanism to limit the memory reserved for textures. However, the format of texture memory on certain graphics devices is organized differently (banked, tiled, etc.) than the simple linear addressing used for most frame buffers. Therefore, the ``size'' of texture memory is device-dependent. This complicates the issue of using a single number for the size of texture memory.

Another complication is that once the X server reports that a texture will fit in the graphics device memory, it must continue to fit for the life of the client (i.e., the total texture memory for a client can never get smaller). Therefore, at initialization time, the maximum texture size and total texture memory available will need to be determined by the device-dependent driver. This driver will also provide a mechanism to determine if a set of textures will fit into texture memory (with a single texture as the degenerate case).

On hardware context swaps, texture memory might need to swapped as well. In the simplest case, all of the memory allocated to hold textures could be swapped at this time. The X server can handle the swapping and a notification from the kernel to the X server can signify that a swap is required. Texture memory swapping will be discussed below (see the Graphics hardware context switch section below).

Other ancillary buffers

All buffers associated with a window (e.g., back, depth, and GID) are preallocated by the static frame-buffer allocation. Pixmap, pbuffers and other ancillary buffers are allocated out of the memory left after this static allocation. During X server initialization, the size off-screen memory available for these buffers will be calculated by the device-dependent driver.

Note that pbuffers can be ``kicked out'' (at least the old style could), and so they don't require virtualization like pixmaps and potentially the new style pbuffers.

Display lists

For graphics devices that support display lists, the display list memory can be managed in the same way as texture memory. Otherwise, display lists will be held in the client virtual-address space.

SAREA

The SAREA is shared between the clients, the X server, and the kernel. It contains four segments that need to be shared: a per-device global hardware lock, per-context information, per-drawable information, and saved device state information.

Hardware lock segment. Only one process can access the graphics device at a time. For atomic operations that require multiple accesses, a global hardware lock for each graphics device is required. Since the number of cards is known at server initialization time, the size of this segment is fixed.
Per-context segment. Each GLXContext is associated with a particular drawable in the per-drawable segment and a particular graphics device state in the saved device state segment. Two pointers, one to the drawable that the GLXContext is currently bound and one to the saved device state is stored in the per-context segment. Since the number of GLXContexts is not known at server start up time, the size of this segment will need to grow. It is a reasonable assumption to limit the number of direct rendering contexts so the size of this segment can be fixed to a maximum. The X server is the only process that writes to this segment and it must maintain a list of available context slots that needs to be allocated and initialized.
Per-drawable segment. Each drawable has certain information that needs to be shared between the X server and the direct rendering client:
1. Buffer identification (e.g., front/back buffer) (int32)
2. Window information changed ID
3. Flags (int32)
  - Swap pending flag (0x00000001)
The window information changed ID signifies that the user has either moved, unmapped or resized the window, or the clipping information has changed and needs to be communicated to the client via the XFree86-GLX protocol.
Since OpenGL clients can create an arbitrary number of GLXDrawables, the size of this segment will need to grow. As with the per-context segment, the size of this segment can be limited to a fixed maximum. Again, the X server is the only process that writes to this segment, and it must maintain a list of available drawable slots that needs to be allocated and initialized.
Saved device state segment. Each GLXContext needs to save the graphics hardware context when another GLXContext has ownership of the graphics device. This information is fixed in size for each graphics device, but will be allocated as needed because it can be quite large. In addition, if the graphics device can read/write its state information via DMA, this segment will need to be locked down during the request.

Kernel initialization

When the X server opens the kernel device driver, the kernel loads and initializes the driver. See the next section for more details of the kernel device driver.

Double buffer optimizations

There are typically three approaches to hardware double buffering:

Video Page Flipping. The video logic is updated to refresh from a different page. This can happen very quickly with no per pixel copying required. This forces the entire screen region to be swapped.
Bitblt Double Buffering. The back buffer is stored in offscreen memory and specific regions of the screen can be swapped by coping data from the offscreen to onscreen. This has a performance penality because of the overhead of copying the swapped data, but allows for fine grain independent control for multiple windows.
Auxillary Per Pixel Control. An additional layer contains information on a per pixel basis that is used to determine which buffer should be displayed. Swapping entire regions is much quicker than Bitblt Double Buffering and fine grain independed control for multiple windows is achieved. However, not all hardware or modes support this method.

If the hardware support Auxillary Per Pixel Control for the given mode, then that is the preferred method for double buffer support. However, if the hardware doesn't support Auxillary Per Pixel Control, then the following combined opproach to Video Page Flipping and Bitblt Double Buffering is a potential optimization.

Initialize in a Bitblt Double Buffering mode. This allows for X Server performance to be optimized while not double buffering is required.
Transition to a Video Page Flipping mode for the first window requiring double buffer support. This allows for the fastest possible double buffer swapping at the expense of requiring the X Server to render to both buffers. Note, for the transition, the contents of the front buffer will need to be copied to the back buffer and all further rendering will need to be duplicated in both buffers for all non-double buffered regions while in this mode.
Transition back to Bitblt Double Buffering mode when additional double buffering windows are created. This will sacrifice performance for the sake of visual accuracy. Now all windows can be independently swapped.

In the initial SI, only the Bitblt Double Buffering mode will be implemented.

4.2 Kernel driver initialization

When the kernel device driver is opened by the X server, the device driver might not be loaded. If not, the module is loaded by kerneld and the initialization routine is called. In either case, the open routine is then called and finishes initializing the driver.

Kernel DMA initialization

Since the 3D graphics device drivers use DMA to communicate with the graphics device, we need to initialize the kernel device driver that will handle these requests. The kernel, in response to this request from the X server, allocates the DMA buffers that will be made available to direct rendering clients.

Kernel interrupt handling initialization

Interrupts are generated in a number of situations including when a DMA buffer has been processed by the graphics device. To acknowledge the interrupt, the driver must know which register to set and to what value to set it. This information could be hard coded into the driver, or possibly a generic interface might be able to be written. If this is possible, the X server must provide information to the kernel as to how to respond to interrupts from the graphics device.

Hardware context switching

Since the kernel device driver must be able to handle multiple 3D clients each with a different GLXContext, there must be a way to save and restore the hardware graphics context for each GLXContext when switching between them. Space for these contexts will need to be allocated when they are created by glXCreateContext(). If the client can use this hardware context (e.g., for software fallbacks or window moves), this information might be stored in the SAREA.

Client DMA wait queues

Each direct rendering context will require a DMA wait queue from which its DMA buffers can be dispatched. These wait queues are allocated by the X server when a new GLXContext is created (glXCreateContext()).

4.3 Client initialization

This section examines what happens before the client enters steady state behavior. The basic sequence for direct-rendering client initialization is that the GL/GLX library is loaded, queries to the X server are made (e.g., to determine the visuals/FBConfigs available and if direct rendering can be used), drawables and GLXContexts are created, and finally a GLXContext is associated with a drawable. This sequence assumes that the X server has already initialized the kernel device driver and has pre-allocated any static buffers requested by the user at server startup (as described above).

Library loading

When a client is loaded, the GL/GLX library will automatically be loaded by the operating system, but the graphics device-specific module cannot be loaded until after the X server has informed the DRI module which driver to load (see below). The DRI module might not be loaded until after a direct rendering GLXContext has been requested.

Client configuration queries

During client initialization code, several configuration queries are commonly made. GLX has queries for its version number and a list of supported extensions. These requests are made through the standard GLX protocol stream. Since the set of supported extensions is device-dependent, similar queries in the device-dependent driver interface (in the X server) are provided that can be called by device-independent code in GLX.

One of the required GLX queries from the client is for the list of supported extended visuals (and FBConfigs in GLX 1.3). The visuals define the types of color and ancillary buffers that are available and are device-dependent. The X server must provide the list of supported visuals (and FBConfigs) via the standard protocol transport layer (e.g., Unix domain or TCP/IP sockets). Again, similar interfaces in the device-dependent driver are provided that can be called by the device-independent code in GLX. All of this information is known at server initialization time (above).

Drawable creation

The client chooses the visual (or FBConfig) it needs and creates a drawable using the selected visual. If the drawable is a window, then, since we use a static resource allocation approach, the buffers are already allocated, and no additional frame buffer allocations are necessary at this time. However, if a dynamic resource allocation approach is added in the future, the buffers requested will need to be allocated.

Not all buffers need to be pre-allocated. For example, accumulation buffers can be emulated in software and might not be pre-allocated. If they are not, then, when the extended visual or FBConfig is associated with the drawable, the client library will need to allocate the accumulation buffer. In GLX 1.3, this can happen with glXCreateWindow(). For earlier versions of GLX, this will happen when a context is made current (below).

Pixmaps and pbuffers

GLXPixmaps are created from an ordinary X11 pixmap, which is then passed to glXCreatePixmap(). GLXPbuffers are created directly by a GLX command. Since we are using a static allocation scheme, we know what ancillary buffers need to be created for these drawables. In the initial SI, these will be handled by indirect rendering or software fallbacks.

GLXContext creation

The client must also create at least one GLXContext. The last flag to glXCreateContext() is a flag to request direct rendering. The first GLXContext created can trigger the library to initialize the direct rendering interface for this client. Several steps are required to setup the DRI. First, the DRI library is loaded and initialized in the client and X server. The DRI library establishes the private communication mechanism between the client and X server (the XFree86-GLX protocol). The X server sends the SAREA shared memory segment ID to the client via this protocol and the client attaches to it. Next, the X server sends the device-dependent client side 3D graphics device driver module name to client via the XFree86-GLX protocol, which is loaded and initialized in the client. The X server calls the kernel module to create a new WaitQueue and hardware graphics context corresponding to the new GLXContext. Finally, the client opens and initializes the kernel driver (including a request for DMA buffers).

Making a GLXContext current

The last stage before entering the steady state behavior occurs when a GLXContext is associated with a GLXDrawable by making the context ``current''. This must occur before any 3D rendering can begin. The first time a GLXDrawable is bound to a direct rendering GLXContext it is registered with the X server and any buffers not already allocated are now allocated. If the GLXDrawable is a window that has not been mapped yet, then the buffers associated with the window are initialized to size zero. When a window is mapped, space in the pre-allocated static buffers are initialized, or in the case of dynamic allocation, buffers are allocated from the available offscreen area (if possible).

For GLX 1.2 (and older versions), some ancillary buffers (e.g., stencil or accumulation), that are not supported by the graphics device, or unavailable due to either resource constraints or their being turned off through X server config options (see above), might need to be allocated.

At this point, the client can enter the steady-state by making OpenGL calls.

5. Steady-state analysis

The initial steady-state analysis presented here assumes that the client(s) and X server have been started and have established all necessary communication channels (e.g., the X, GLX and XFree86-GLX protocol streams and the SAREA segment). In the following analysis, we will impose simplifying assumptions to help direct the analysis towards the main line rendering case. We will then relax our initial assumptions and describe increasingly general cases.

5.1 Single 3D client (1 GLXContext, 1 GLXWindow), X server inactive

Assume: No X server activity (including hardware cursor movement).

This is the optimized main line rendering case. The primary goal is to generate graphics device specific commands and stuff them in a DMA buffer as fast as possible. Since the X server is completely inactive, any overhead due to locking should be minimized.

Processing rendering requests

In the simplest case, rendering commands can be sent to the graphics device by putting them in a DMA buffer. Once a DMA buffer is full and needs to be dispatched to the graphics device, the buffer can be handed immediately to the kernel via an ioctl.

The kernel then schedules the DMA command buffer to be sent to the graphics device. If the graphics device is not busy (or the DMA input queue is not full), it can be immediately sent to the graphics device. Otherwise, it is put on the WaitQueue for the current context.

In hardware that can only process a single DMA buffer at a time, when the DMA buffer has finished processing, an IRQ is generated by the graphics device and handled by the kernel driver. In hardware that has a DMA input FIFO, IRQs can be generated after each buffer, after the input FIFO is empty or (in certain hardware) when a low-water mark has been reached. For both types of hardware, the kernel device driver resets the IRQ and schedules the next DMA buffer(s).

A further optimization for graphics devices that have input FIFOs for DMA requests is that if the FIFO is not full, the DMA request could be initiated directly from client space.

Synchronization

GLX has commands to synchronize direct rendering with indirect rendering or with ordinary X11 operations. These include glFlush(), glFinish(), glXWaitGL() and glXWaitX() synchronization primitives. The kernel driver provides several ioctls to handle each of the synchronization cases. In the simplest case (glFlush()), any partially filled DMA buffer will be sent to the kernel. Since these will eventually be processed by the hardware, the function call can return. With glFinish(), in addition to sending any partially filled DMA buffer to the kernel, the kernel will block the client process until all outstanding DMA requests have been completely processed by the graphics device. glXWaitGL() can be implemented using glFlush(), glXWaitX() can be implemented with XSync().

Buffer swaps

Buffers swaps can be initiated by glXSwapBuffers(). When a client issues this request, any partially filled DMA buffers are sent to the kernel and all outstanding DMA buffers are processed before the buffer swap can take place. All subsequent rendering commands are blocked until the buffer has been swapped, but the client is not blocked and can continue to fill DMA buffers and send them to the kernel.

If multiple threads are rendering to a GLXDrawable, it is the client's responsibility to synchronize the threads. In addition, the idea of the current buffer (e.g., front or back) must be shared by all GLXContexts bound to a given drawable. The X double buffer extension must also agree.

Kernel-driver buffer swap ioctl

When the buffer swap ioctl is called, a special DMA buffer with the swap command is placed into the current GLXContext's WaitQueue. Because of sequentiality of the DMA buffers in the WaitQueue, all DMA buffers behind this are blocked until all DMA buffers in front of this one have been processed. The header information associated with this buffer lets the scheduler know how to handle the request. There are three ways to handle the buffer swap:

No vert sync. Immediately schedule the buffer swap and allow subsequent DMA buffers in the WaitQueue to be scheduled. With this policy there will be tearing. In the initial SI, we will implement this policy.
Wait for vert sync. Wait for the vertical retrace IRQ to schedule the buffer swap command and allow subsequent DMA buffers in the WaitQueue to be scheduled. With this policy, the tearing should be reduced, but there might still be some tearing if a DMA input FIFO is present and relatively full.
No tearing. Wait for vertical retrace IRQ and all DMA buffers in the input FIFO to be processed before scheduling the buffer swap command. Since the buffer swap is a very fast bitblt operation, no tearing should be present with this policy.

Software fallbacks

Not all OpenGL graphics primitives are accelerated in all hardware. For those not supported directly by the graphics device, software fallbacks will be required. Mesa and SGI's OpenGL SI provide a mechanism to implement these fallbacks; however, the hardware graphics context state needs to be translated into the format required by these libraries. The hardware graphics context state can be read from the saved device state segment of SAREA. An implicit glFinish() is issued before the software fallback can be initiated to ensure that the graphics state is up to date before beginning the software fallback. The hardware lock is required to alter any device state.

Image transfer operations

Many image transfer operations are required in the client-side direct rendering library. Initially these will be software routines that read directly from the memory mapped graphics device buffers (e.g., frame buffer and texture buffer). These are device-dependent operations since the format of the transfer might be different, though certain abstractions should be possible (e.g., linear buffers).

An optimization is to allow the client to perform DMA directly to/from the client's address space. Some hardware has support for page table translation and paging. Other hardware will require the ability to lock down pages and have them placed contiguously in physical memory.

The X server will need to manage how the frame and other buffers are allocated at the highest level. The layout of these buffers is determined at X server initialization time.

Texture management

Each GLXContext appears to own the texture memory. In the present case, there is no contention. In subsequent cases, hardware context switching will take care of texture swapping as well (see below).

For a single context, the image transfer operations described above provides the necessary interfaces to transfer textures and subtextures to/from texture memory.

Display list management

Display lists initially will be handled from within the client's virtual address space. For graphics devices that supports display lists, they can be stored and managed the same as texture memory.

Selection and feedback

If there is hardware support for selection and feedback, the rendering commands are sent to the graphics pipeline, which returns the requested data to the client. The amount of data can be quite large and are usually delivered to a collection of locked-down pages via DMA. The kernel should provide a mechanism for locking down pages in the client address space to hold the DMA buffer. If the graphics devices does not do address translation, then the pages should be contiguous physical pages.

Queries

Queries are handled similarly to selection and feedback, but the data returned are usually much smaller. When a query is made, the hardware graphics context state has to be read. If the GLXContext does not currently own the graphics device, the state can be read from the saved device state segment in SAREA. Otherwise, the graphics pipeline is temporarily stalled, so that the state can be read from the graphics device.

Events

GLX has a ``pbuffer clobbered'' event. This can only be generated as a result of reconfiguring a drawable or creating a new one. Since pbuffers will initially be handled by the software, no clobbered events will be generated. However, when they are accelerated, the X server will have to wrap the appropriate routine to determine when the event needs to be generated.

5.2 Single 3D client (1 GLXContext, 1 GLXWindow), X server can draw

Assume: X server can draw (e.g., 2D rendering) into other windows, but does not move the 3D window.

This is a common case and should be optimized if possible. The only significant different between this case and the previous case, is that we must now lock the hardware before accessing the graphics device directly directly from the client, X server or kernel space. The goal is to minimize state transitions and potentially avoid a full hardware graphics context switch by allowing the X server to save and restore 3D state around its access for GUI acceleration.

Hardware lock

Access to graphics device must be locked, either implicitly or explicitly. Each component of the system requires the hardware lock at some point. For the X server, the hardware lock is required when drawing or modifying any state. It is requested around blocks of 2D rendering, minimizing the potential graphics hardware context switches. In the 3D client, the hardware lock is required during the software fallbacks (all other graphics device accesses are handled through DMA buffers). The kernel also must request the lock when it needs to send DMA requests to the graphics device. The hardware lock is contained in the Hardware lock segment of the SAREA which can be accessed by all system components.

A two-tiered locking scheme is used to minimize the process and kernel context switches necessary to grant the lock. The most common case, where a lock is requested by the last process to hold the lock, does not require any context switches. See the accompanying locks.txt file for more information on two-tiered locking (available late February 1999).

In the case of DMA-style graphics devices, there could be one or more DMA buffers currently being processed. If the hardware lock is requested (by the X server or by the client for a software fallback), consideration needs to be given to the assumed state of hardware upon receipt of the lock, and it may be necessary for the DMA operation to be completed before the lock can be given to the requesting process.

Graphics hardware context switching

In addition to locking the graphics device, a graphics hardware context switch between the client and the X server is required. One possible solution is to perform a full context switch by the kernel (see the ``multiple contexts'' section below for a full explanation of how a full graphics hardware context switch is handled). However, the X server is a special case since it knows exactly when a context switch is required and what state needs to be saved and restored.

For the X server, the graphics hardware context switch is required only (a) when directly accessing the graphics device and (b) when the access changes the state of the graphics device. When this occurs, the X server can save the graphics device state (either via a DMA request or by reading the registers directly) before it performs its rendering commands and restore the graphics device state after it finishes.

Three examples will help clarify the situations where this type of optimization can be useful. First, using a cfb/mi routine to draw a line only accesses the frame buffer and does not alter any graphics device state. Second, on many vendor's cards changing the position of the hardware cursor does not affect the graphics device state. Third, certain graphics devices have two completely separate pipelines for 2D and 3D commands. If no 2D and 3D state is shared, then they can proceed independently (but usually not simultaneously, so the hardware lock is still required).

Certain graphics devices have the capability of saving a subset of graphics device state. If the device has this capability, then the X server could optimize its context switch by only saving and restoring the state that it changes.

5.3 Single 3D client (1 GLXContext, 1 GLXWindow), X server active

Assume: X server can move or resize the single 3D window.

When the X server moves or resizes the 3D window, the client needs to stop drawing long enough for the X server to change the window, and it also needs to request the new window location, size and clipping information. Current 3D graphics devices can draw using window relative coordinates, though the window offset might not be able to be updated asynchronously (i.e., it might only be possible to update this information between DMA buffers). Since this is an infrequent operation, it should be designed to have minimal impact on the other, higher priority cases.

X server operations

On the X server side, when a window move is performed, several operations must occur. First, the DMA buffers currently being processed by the graphics device must be completely processed before proceeding since they might associated with the old window position (unless the graphics device allows asynchronous window updates). Next, the X server grabs the hardware lock and waits for the graphics device to become quiescent. It then issues a bitblt to move the window and all of its associated buffers. It updates the window location in all of the contexts associated with the window, and increments the ``Window information changed'' ID in the SAREA to notify all clients rendering to the window of the change. It can then release the hardware lock.

Since the graphics hardware context has been updated with the new window offset, any outstanding DMA buffers for the context associated with the moved window will have the new window offset and thus will render at the correct screen location. The situation is slightly more complicated with window resizes or changes to the clipping information.

When a window is resized or when the clipping information changes due to another window popping up on top of the 3D window, outstanding DMA buffers might draw outside of the new window (if the window was made smaller). If the graphics device supports clipping planes, then this information can be updated in the graphics hardware context between DMA buffers. However, for devices that only support clipping rectangles, the outstanding DMA requests cannot be altered with the new clipping rects. To minimize this effect, the X server can (1) flush the DMA buffers in all contexts' WaitQueues associated with the window, and (2) wait for these DMA buffers to be processed by the graphics device. However, this does not completely solve the problem as there could be a partially filled DMA buffer in the client(s) rendering to the window (see below).

3D client operations

On the client side, during each rendering operation, the client checks to see if it has the most current window information. If it does, then it can proceed as normal. However, if the X server has changed the window location, size or clipping information, the client issues a XFree86-DRI protocol request to get the new information. See the accompanying XFree86-DRI.txt file for more information on the XFree86-DRI protocol implementation. This information will be mainly used for software fallbacks.

Since there could be several outstanding requests in the partially filled ``current'' DMA buffer, the rendering commands already in this buffer might draw outside of the window. The simplest solution to this problem is to send an expose event to the windows that are affected. This could be accomplished as follows: (1) send the partially filled DMA buffer to the kernel, (2) wait for it to be processed, (3) generate a list of screen-relative rectangles for the affected region, and (4) send a request to the X server to generate an expose event in the windows that overlap with that region.

On graphics devices that do not allow the window offset to be updated between DMA buffers, the situation described above will also occur for window moves. The ``generate expose events'' solution also will be used to solve the problem. It is not known at this time if any graphics devices of this type exist.

5.4 Single 3D client (1 GLXContext, N GLXDrawables)

Assume: There are now multiple 3D windows, pixmaps and/or pbuffers all using the same GLXContext.

This is a common and performance critical case that can be optimized by handling context switching within the client side library and registering only one hardware graphics context with the DRI. All window offsets, window sizes and clipping information in the hardware graphics context are updated during glXMakeContextCurrent().

GLXPixmaps and GLXPbuffers

Initially all pixmap and pbuffer rendering will be handled in software.

5.5 Single 3D client (N GLXContexts, N GLXDrawables)

Assume: There are now multiple 3D drawables each with their own GLXContext.

With the addition of multiple GLXContexts, it is necessary to coordinate access to the device and perform a complete hardware graphics context switch when transitioning ownership of the device to a different GLXContext. For example, when rendering commands from a new GLXContext are sent to the kernel to be processed, the old hardware graphics context must be swapped out of and the new one must be swapped into the graphics device before any rendering can take place. The kernel keeps track of which GLXContext was last used so that when the old and new GLXContexts are the same, the context switch can be avoided. The slower tier of the two-tier lock is useful for optimizing when a context switch is necessary.

With multiple GLXContexts, the kernel driver must now manage multiple WaitQueues -- one for each GLXContext. When there are DMA buffers waiting to be processed on two (or more) WaitQueues, the kernel driver recognizes a hardware graphics context switch is necessary.

Hardware graphics context switch

The hardware context switch can be performed by the kernel or by the X server. Initially, the hardware graphics context switch will be performed by the X server (as described below). An optimization will be to allow the kernel to handle the context switches. For the kernel to handle these, it will require knowledge of how to save and restore the hardware graphics context. If it is possible to create a generic context switch interface, this device-specific information can be passed to the kernel driver via an ioctl. Otherwise, a device-specific kernel driver can be written and hooked into the generic kernel driver.

When the X server is involved in hardware graphics context switching (e.g., when swapping textures, see below), the kernel notifies the X server (e.g., via I/O that results in a SIGIO signal). The kernel notifies the X server of the old and new GLXContext. The X server then saves the old graphics device state in the old GLXContext's state buffer and loads the new graphics device state from a previously saved GLXContext's state buffer (or when the first context switch is performed, initializes the graphics device state to its initial values).

This saved graphics device state is very useful to the direct rendering client when it needs to perform software fallbacks. The saved GLXContext's state buffer is placed in the device state segment of the SAREA, and the pointer to the saved device state of the GLXContext in the per-context is updated.

Accessing hardware graphics state

Certain graphics devices have the ability to use DMA to read/write the entire hardware graphics state from/to host memory. For these devices, a DMA request will be sent to the hardware to read/write the state information to a locked down page. The advantage is that the DMA buffer holding these requests (one to save the old state and one to restore the new state) can be inserted in front of the DMA buffer with the rendering commands, and no waiting is required.

For devices that do not save their state via a DMA request, the X server needs to be able to read/write the individual register values when the hardware context switch is performed. The X server must wait until the DMA buffer currently being processed or any DMA buffers are on the hardware input FIFO to be completely processed before it can save the old graphics device state.

Texture memory management

Textures can be shared by several GLXContexts and should not be swapped out if the new GLXContext is in the share list. The same is true of display lists.

The initial SI will provide a simplistic texture swap mechanism for hardware graphics context switches, which will be executed by the X server when triggered by the kernel driver. It is a design goal for the DRI that a more sophisticated two-tiered allocation scheme be implemented at a later date.

5.6 Multiple 3D clients

Assume: There are now multiple 3D clients, each of which has their own GLXContext(s).

As with the previous case, multiple GLXContexts are actively used in rendering, and this case can be handled the same as the previous one.

6. Finalization analysis

This section examines what happens after exiting steady state behavior via destroying a rendering surface or context, or via process termination. Process suspension and switching virtual consoles are special cases and are dealt with in this section.

6.1 Destroying a drawing surface

If the drawing surface is a window, it can be destroyed by the window manager. When this occurs, the X server must notify the direct rendering client that the window was destroyed. However, before the window can be removed, the X server must wait until all outstanding DMA buffer requests associated with the window have been completely processed in order to avoid rendering to the destroyed window after it has been removed. When the client tries to draw to the window again, it recognizes that the window is no longer valid and cleans up its internal state associated with the window (e.g., any local ancillary buffer), and returns an error.

GLX 1.3 uses glXDestroyWindow() to explicitly notify the system that the window is no longer associated with GLX, and that its resources should be freed.

6.2 Destroying a GLXContext

Since there are limited context slots available in the per-context segment of SAREA, a GLXContext's resources can be freed by calling glXDestroyContext() when it is no longer needed.

If the GLXContext is current to any thread, the context cannot be destroyed until it is no longer current. When this happens, the X server marks the GLXContext's per-context slot as free, frees the saved device state, and notifies the kernel that the WaitQueue can be freed.

Destroying shared resources

Texture objects and display lists can be shared by multiple GLXContexts. When a context is destroyed in the share list, the reference count should be decremented. If the reference count of the texture objects and/or display lists is zero, they can be freed as well.

6.3 Process finalization

When a process exits, its direct rendering resources should be freed and returned to the X server.

Graceful termination

If the termination is expected, the resources associated with the process are freed. The kernel reclaims its DMA buffers from the client. The X server frees the GLXDrawables and GLXContexts associated with the client. In the process of freeing the GLXContexts, the X server notifies the kernel that it should free any WaitQueues associated with the GLXContexts it is freeing. The saved device state is freed. The reference count to the SAREA is decremented. Finally, any additional resources used by the GLX and XFree86-GLX protocol streams are freed.

Unexpected termination

Detecting the client death is the hardest part of unexpected process termination. Once detected, the resources are freed as in the graceful termination case outlined above.

The kernel detects when a direct rendering client process dies since it has registered itself with the kernel exit procedure. If the client does not hold the hardware lock, then it can proceed as in the graceful termination case. If the hardware lock is held, the lock is broken. The graphics device might be in an unusable state (e.g., waiting for data during a texture upload), and might need to be reset. After reset, the graceful termination case can proceed.

6.4 Process suspension

Processes can suspend themselves via a signal that cannot be blocked, SIGSTOP. If the process holds the hardware lock during this time, the SIGSTOP signal must be delayed until the lock is freed. This can be handled in the kernel. As an initial approximation, the kernel can turn off SIGSTOP for all direct rendering clients.

6.5 Switching virtual consoles

XFree86 has the ability to switch to a different virtual console when the X server is running. This action causes the X server to draw to a copy of the frame buffer in the X server virtual address space. For direct rendering clients, this solution is not possible. A simple solution to use in the initial SI is to halt all direct access to the graphics device by grabbing the hardware lock.

In addition to switching virtual consoles, XFree86 can be started on multiple consoles (with different displays). Initially, only the first display will support direct rendering.

7. Future enhancements

7.1 MMIO

This architecture has been designed with MMIO based 3D solution in mind, but the initial SI will be optimized for DMA based solutions. A more complete MMIO driven implementation can be added later. Base support in the initial SI that will be useful for an MMIO-only solution is unprivileged mapping of MMIO regions and a fast two-tier lock. Additional optimizations that will be useful are virtualizing the hardware via a page fault mechanism and a mechanism for updating shared library pointers directly.

7.2 Device-specific kernel driver

Several optimizations (mentioned above) can be added by allowing a device-specific kernel driver to hook out certain functions in the generic kernel driver.

7.3 Other enhancements

We should consider additional enhancements including:

Multiple displays and multiple screens
More complex buffer swapping (cushion buffering, swap every N retraces, synchronous window swapping)

8. Glossary

MMIO.

Memory-Mapped Input-Output. In this document, we use the term MMIO to refer to operations that access a region of graphics card memory that has been memory-mapped into the virtual address space, or to operations that access graphics hardware registers via a memory-mapping of the registers into the virtual address space (in contrast to PIO).

Note that graphics hardware ``registers'' may actually be pseudo-registers that provide access to the hardware FIFO command queue.

PIO.

Programmed Input-Output. In this document, we use the term PIO to refer specifically to operations that must use the Intel in and out instructions (or equivalent non-Intel instructions) to access the graphics hardware (in contrast to using memory-mapped graphics hardware registers, which allow for the use of mov instructions).

Note that graphics hardware ``registers'' may actually be pseudo-registers that provide access to the hardware FIFO command queue.

9. References

[AM1997] James H. Anderson and Mark Moir. Universal constructions for large objects. Submitted to IEEE Transactions on Parallel and Distributed Systems, June 1997. Available from http://www.cs.pitt.edu/~moir/Papers/anderson-moir-tpds97.ps. (An earlier version of this paper was published in Proceedings of the Ninth International Workshop on Distributed Algorithms, Lecture Notes in Computer Science 972, Springer-Verlag, pp. 168-182, September 1995.)

[BKMS1998] David F. Bacon, Ravi Konuru, Chet Murthy, and Mauricio Serrano. Thin locks: featherweight synchronization for Java. Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation (Montreal, Canada, 17-19 June 1998). Published as SIGPLAN Notices 33, 5 (May 1998), 258-268.

[GLX1.2] Phil Karlton and Paula Womack. OpenGL® Graphics with the X Window System® (Version 1.2, 4 Mar. 1997). Available from http://trant.sgi.com/opengl/docs/Specs/GLXspec.ps.

[GLX1.3] Paula Womack and Jon Leech, editors. OpenGL® Graphics with the X Window System® (Version 1.3, 19 Oct. 1998). Available from ftp://sgigate.sgi.com/pub/opengl/doc/opengl1.2/glx1.3.ps.

[M1992] Henry Massalin. Synthesis: An Efficient Implementation of Fundamental Operating System Services. Ph.D. dissertation, published as Technical Report CUCS-039-92. Graduate School of Arts and Sciences, Columbia University, 1992, 71-91. Available from ftp://ftp.cs.columbia.edu/reports/reports-1992/cucs-039-92.ps.gz.

[MCS1991] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21-65.

[OM1998] Jens Owen and Kevin E. Martin. A Multipipe Diret Rendering Architecture for 3D. Precision Insight, Cedar Park, Texas, 15 Sept. 1998. Available from http://www.precisioninsight.com/dr/dr.html.

[OPENGL1.2] Mark Segal and Kurt Akely; edited by Jon Leech. The OpenGL® Graphics System: A Specification (Version 1.2.1, 14 Oct. 1998). Available from ftp://sgigate.sgi.com/pub/opengl/doc/opengl1.2/opengl1.2.ps.