OpenGL Performance Optimization(轉(zhuǎn))
SIGGRAPH '97
Course 24: OpenGL and Window System Integration
OpenGL Performance Optimization
Contents
- 1. Hardware vs. Software
- 2. Application Organization
- 3. OpenGL Optimization
- 4. Evaluation and tuning
OpenGL may be implemented by any combination of hardware and software. At the high-end, hardware may implement virtually all of OpenGL while at the low-end, OpenGL may be implemented entirely in software. In between are combination software/hardware implementations. More money buys more hardware and better performance.
Intro-level workstation hardware and the recent PC 3-D hardware typically implement point, line, and polygon rasterization in hardware but implement floating point transformations, lighting, and clipping in software. This is a good strategy since the bottleneck in 3-D rendering is usually rasterization and modern CPU's have sufficient floating point performance to handle the transformation stage.
OpenGL developers must remember that their application may be used on a wide variety of OpenGL implementations. Therefore one should consider using all possible optimizations, even those which have little return on the development system, since other systems may benefit greatly.
From this point of view it may seem wise to develop your application on a low-end system. There is a pitfall however; some operations which are cheep in software may be expensive in hardware. The moral is: test your application on a variety of systems to be sure the performance is dependable.
One should consider multiprocessing in these situations. By assigning rendering and computation to different threads they may be executed in parallel on multiprocessor computers.
For many applications, supporting multiprocessing is just a matter of partitioning the render and compute operations into separate threads which share common data structures and coordinate with synchronization primitives.
SGI's Performer is an example of a high level toolkit designed for this purpose.
Complexity may refer to the geometric or rendering attributes of a database. Here are a few examples.
Objects which are entirely outside of the field of view may be culled. This type of high level cull testing can be done efficiently with bounding boxes or spheres and have a major impact on performance. Again, toolkits such as Inventor and Performer have this feature.
Basically, one wants data structures which can be traversed quickly and passed to the graphics library in an efficient manner. For example, suppose we need to render a triangle strip. The data structure which stores the list of vertices may be implemented with a linked list or an array. Clearly the array can be traversed more quickly than a linked list. The way in which a vertex is stored in the data structure is also significant. High performance hardware can process vertexes specified by a pointer more quickly than those specified by three separate parameters.
Our first attempt at rendering this information may be:
We can still do better, however. If we redesign the data structures used to represent the city information we can improve the efficiency of drawing the city points. For example:
In the following sections the techniques for maximizing performance, as seen above, are explained.
After each of the following techniques look for a bracketed list of symbols which relates the significance of the optimization to your OpenGL system:
Example:
This is a very bad construct. The following is much better:
Wrong:
Example:
Note that software implementations of OpenGL may actually perform these operations faster than hardware systems. If you're developing on a low-end system be aware of this fact. [H,L] The It may be worthwhile to experiment with different visuals to determine if there's any advantage of one over another. Synchronization hurts performance. Therefore, if you need to render with both OpenGL and native window system calls try to group the rendering calls to minimize synchronization.
For example, if you're drawing a 3-D scene with OpenGL and displaying text with X, draw all the 3-D elements first, call Also, when responding to mouse motion events you should skip extra motion events in the input queue. Otherwise, if you try to process every motion event and redraw your scene there will be a noticable delay between mouse input and screen updates.
It can be a good idea to put a print statement in your redraw and event loop function so you know exactly what messages are causing your scene to be redrawn, and when.
Don't do this:
Do this:
Performance evaluation is a large subject and only the basics are covered here. For more information see "OpenGL on Silicon Graphics Systems".
After bottlenecks have been identified the techniques outlined in section 3 can be applied. The process of identifying and reducing bottlenecks should be repeated until no further improvements can be made or your minimum performance threshold has been met.
Measure the performance of rendering in single buffer mode to determine how far you really are from your target frame rate.
posted on 2009-08-25 06:05 RedLight 閱讀(892) 評論(0) 編輯 收藏 引用 所屬分類: 3D渲染技術(shù)
1. Hardware vs. Software
2. Application Organization
At first glance it may seem that the performance of interactive OpenGL applications is dominated by the performance of OpenGL itself. This may be true in some circumstances but be aware that the organization of the application is also significant.
2.1 High Level Organization
Multiprocessing
Some graphical applications have a substantial computational component other than 3-D rendering. Virtual reality applications must compute object interactions and collisions. Scientific visualization programs must compute analysis functions and graphical representations of data.
Image quality vs. performance
In general, one wants high-speed animation and high-quality images in an OpenGL application. If you can't have both at once a reasonable compromise may be to render at low complexity during animation and high complexity for static images.
GL_NEAREST
sampling and glHint( GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST )
.
glPolygonMode( GL_FRONT_AND_BACK, GL_LINE )
to inspect tesselation granularity and reduce if possible. Level of detail management and culling
Objects which are distant from the viewer may be rendered with a reduced complexity model. This strategy reduces the demands on all stages of the graphics pipeline. Toolkits such as Inventor and Performer support this feature automatically.
2.2 Low Level Organization
The objects which are rendered with OpenGL have to be stored in some sort of data structure. Some data structures are more efficient than others with respect to how quickly they can be rendered.
An Example
Suppose we're writing an application which involves drawing a road map. One of the components of the database is a list of cities specified with a latitude, longitude and name. The data structure describing a city may be:
struct city {
float latitute, longitude; /* city location */
char *name; /* city's name */
int large_flag; /* 0 = small, 1 = large */
};
A list of cities may be stored as an array of city structs.
void draw_cities( int n, struct city citylist[] )
{
int i;
for (i=0; i < n; i++) {
if (citylist[i].large_flag) {
glPointSize( 4.0 );
}
else {
glPointSize( 2.0 );
}
glBegin( GL_POINTS );
glVertex2f( citylist[i].longitude, citylist[i].latitude );
glEnd();
glRasterPos2f( citylist[i].longitude, citylist[i].latitude );
glCallLists( strlen(citylist[i].name),
GL_BYTE,
citylist[i].name );
}
}
This is a poor implementation for a number of reasons:
Here's a better implementation:
glPointSize
is called for every loop iteration.
glBegin
and glEnd
void draw_cities( int n, struct city citylist[] )
{
int i;
/* draw small dots first */
glPointSize( 2.0 );
glBegin( GL_POINTS );
for (i=0; i < n ;i++) {
if (citylist[i].large_flag==0) {
glVertex2f( citylist[i].longitude, citylist[i].latitude );
}
}
glEnd();
/* draw large dots second */
glPointSize( 4.0 );
glBegin( GL_POINTS );
for (i=0; i < n ;i++) {
if (citylist[i].large_flag==1) {
glVertex2f( citylist[i].longitude, citylist[i].latitude );
}
}
glEnd();
/* draw city labels third */
for (i=0; i < n ;i++) {
glRasterPos2f( citylist[i].longitude, citylist[i].latitude );
glCallLists( strlen(citylist[i].name),
GL_BYTE,
citylist[i].name );
}
}
In this implementation we're only calling glPointSize twice and we're maximizing the number of vertices specified between glBegin
and glEnd
.
struct city_list {
int num_cities; /* how many cities in the list */
float *position; /* pointer to lat/lon coordinates */
char **name; /* pointer to city names */
float size; /* size of city points */
};
Now cities of different sizes are stored in separate lists. Position are stored sequentially in a dynamically allocated array. By reorganizing the data structures we've eliminated the need for a conditional inside the glBegin/glEnd
loops. Also, we can render a list of cities using the GL_EXT_vertex_array
extension if available, or at least use a more efficient version of glVertex
and glRasterPos
.
/* indicates if server can do GL_EXT_vertex_array: */
GLboolean varray_available;
void draw_cities( struct city_list *list )
{
int i;
GLboolean use_begin_end;
/* draw the points */
glPointSize( list->size );
#ifdef GL_EXT_vertex_array
if (varray_available) {
glVertexPointerEXT( 2, GL_FLOAT, 0, list->num_cities, list->position );
glDrawArraysEXT( GL_POINTS, 0, list->num_cities );
use_begin_end = GL_FALSE;
}
else
#else
{
use_begin_end = GL_TRUE;
}
#endif
if (use_begin_end) {
glBegin(GL_POINTS);
for (i=0; i < list->num_cities; i++) {
glVertex2fv( &position[i*2] );
}
glEnd();
}
/* draw city labels */
for (i=0; i < list->num_cities ;i++) {
glRasterPos2fv( list->position[i*2] );
glCallLists( strlen(list->name[i]),
GL_BYTE, list->name[i] );
}
}
As this example shows, it's better to know something about efficient rendering techniques before designing the data structures. In many cases one has to find a compromize between data structures optimized for rendering and those optimized for clarity and convenience.
3. OpenGL Optimization
There are many possibilities to improving OpenGL performance. The impact of any single optimization can vary a great deal depending on the OpenGL implementation. Interestingly, items which have a large impact on software renderers may have no effect on hardware renderers, and vice versa! For example, smooth shading can be expensive in software but free in hardware While glGet*
can be cheap in software but expensive in hardware.
3.1 Traversal
Traversal is the sending of data to the graphics system. Specifically, we want to minimize the time taken to specify primitives to OpenGL.
GL_LINES, GL_LINE_LOOP, GL_TRIANGLE_STRIP, GL_TRIANGLE_FAN
, and GL_QUAD_STRIP
require fewer vertices to describe an object than individual line, triangle, or polygon primitives. This reduces data transfer and transformation workload. [all]
glVertex/glColor/glNormal
calls with the vertex array mechanism may be very beneficial. [all]
glVertex
, glColor
, glNormal
and glTexCoord
glVertex
, glColor
, etc. functions which take a pointer to their arguments such as glVertex3fv(v)
may be much faster than those which take individual arguments such as glVertex3f(x,y,z)
on systems with DMA-driven graphics hardware. [H,L]
glNormal
. If texturing is disabled don't call glTexCoord
, etc.
glBegin/glEnd
glBegin/glEnd
.
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n; i++) {
if (lighting) {
glNormal3fv( norm[i] );
}
glVertex3fv( vert[i] );
}
glEnd();
if (lighting) {
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n ;i++) {
glNormal3fv( norm[i] );
glVertex3fv( vert[i] );
}
glEnd();
}
else {
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n ;i++) {
glVertex3fv( vert[i] );
}
glEnd();
}
Also consider manually unrolling important rendering loops to maximize the function call rate. 3.2 Transformation
Transformation includes the transformation of vertices from glVertex
to window coordinates, clipping and lighting.
GL_SHININESS
material parameter. [L,S]
glEnable/Disable(GL_NORMALIZE)
controls whether normal vectors are scaled to unit length before lighting. If you do not use glScale
you may be able to disable normalization without ill effects. Normalization is disabled by default. [L,S]
GL_LINES
, GL_LINE_LOOP
, GL_TRIANGLE_STRIP
, GL_TRIANGLE_FAN
, and GL_QUAD_STRIP
decrease traversal and transformation load.
glRect
usage
glBegin(GL_QUADS)
... glEnd()
instead. [all] 3.3 Rasterization
Rasterization is the process of generating the pixels which represent points, lines, polygons, bitmaps and the writing of those pixels to the frame buffer. Rasterization is often the bottleneck in software implementations of OpenGL.
3.4 Texturing
Texture mapping is usually an expensive operation in both hardware and software. Only high-end graphics hardware can offer free to low-cost texturing. In any case there are several ways to maximize texture mapping performance.
GL_UNSIGNED_BYTE
component format is typically the fastest for specifying texture images. Experiment with the internal texture formats offered by the GL_EXT_texture
extension. Some formats are faster than others on some systems (16-bit texels on the Reality Engine, for example). [all]
GL_NEAREST
or GL_LINEAR
then there's no reason OpenGL has to compute the lambda value which determines whether to use minification or magnification sampling for each fragment. Avoiding the lambda calculation can be a good performace improvement.
GL_DECAL
or GL_REPLACE_EXT
functions for 3 component textures is a simple assignment of texel samples to fragments while GL_MODULATE
is a linear interpolation between texel samples and incoming fragments. [S,L]
glTexImage2D
to repeatedly change the texture. Use glTexSubImage2D
or glTexCopyTexSubImage2D
. These functions are standard in OpenGL 1.1 and available as extensions to 1.0. 3.5 Clearing
Clearing the color, depth, stencil and accumulation buffers can be time consuming, especially when it has to be done in software. There are a few tricks which can help.
glClear
carefully [all]
glClear
.
glClear( GL_COLOR_BUFFER_BIT );
if (stenciling) {
glClear( GL_STENCIL_BUFFER_BIT );
}
Right:
if (stenciling) {
glClear( GL_COLOR_BUFFER_BIT | GL_STENCIL_BUFFER_BIT );
}
else {
glClear( GL_COLOR_BUFFER_BIT );
}
glScissor()
to restrict clearing to a smaller area. [L].
int EvenFlag;
/* Call this once during initialization and whenever the window
* is resized.
*/
void init_depth_buffer( void )
{
glClearDepth( 1.0 );
glClear( GL_DEPTH_BUFFER_BIT );
glDepthRange( 0.0, 0.5 );
glDepthFunc( GL_LESS );
EvenFlag = 1;
}
/* Your drawing function */
void display_func( void )
{
if (EvenFlag) {
glDepthFunc( GL_LESS );
glDepthRange( 0.0, 0.5 );
}
else {
glDepthFunc( GL_GREATER );
glDepthRange( 1.0, 0.5 );
}
EvenFlag = !EvenFlag;
/* draw your scene */
}
3.6 Miscellaneous
glGetFloatv, glGetIntegerv, glIsEnabled, glGetError, glGetString
require a slow, round trip transaction between the application and renderer. Especially avoid them in your main rendering code.
glPushAttrib
glPushAttrib( GL_ALL_ATTRIB_BITS )
in particular can be very expensive on hardware systems. This call may be faster in software implementations than in hardware. [H,L]
glGetError
inside your rendering/event loop to catch errors. GL errors raised during rendering can slow down rendering speed. Remove the glGetError
call for production code since it's a "round trip" command and can cause delays. [all]
glColorMaterial
instead of glMaterial
glColorMaterial
may be faster than glMaterial
. [all]
glDrawPixels
glDrawPixels
often performs best with GL_UNSIGNED_BYTE
color components [all]
glDrawPixels
. [all]
glPolygonMode
glBegin
with GL_POINTS, GL_LINES, GL_LINE_LOOP
or GL_LINE_STRIP
instead as it can be much faster. [all] 3.7 Window System Integration
glXMakeCurrent
call, for example, can be expensive on hardware systems because the context switch may involve moving a large amount of data in and out of the hardware.
GLX_EXT_visual_rating
extension can help you select visuals based on performance or quality. GLX 1.2's visual caveat attribute can tell you if a visual has a performance penalty associated with it.
glXWaitX
and glXWaitGL
functions serve this purpose.
glXWaitGL
to synchronize, then call all the X drawing functions.
3.8 Mesa-specific
Mesa is a free library which implements most of the OpenGL API in a compatible manner. Since it is a software library, performance depends a great deal on the host computer. There are several Mesa-specific features to be aware of which can effect performance.
MESA_RGB_VISUAL
environment variable can be used to determine the quickest visual by experimentation.
glColor
command should be put before the glBegin
call.
glBegin(...);
glColor(...);
glVertex(...);
...
glEnd();
glColor(...);
glBegin(...);
glVertex(...);
...
glEnd();
glColor[34]ub[v]
are the fastest versions of the glColor
command.
4. Evaluation and Tuning
To maximize the performance of an OpenGL applications one must be able to evaluate an application to learn what is limiting its speed. Because of the hardware involved it's not sufficient to use ordinary profiling tools. Several different aspects of the graphics system must be evaluated.
4.1 Pipeline tuning
The graphics system can be divided into three subsystems for the purpose of performance evaluation:
At any given time, one of these stages will be the bottleneck. The bottleneck must be reduced to improve performance. The strategy is to isolate each subsystem in turn and evaluate changes in performance. For example, by decreasing the workload of the CPU subsystem one can determine if the CPU or graphics system is limiting performance.
4.1.1 CPU subsystem
To isosulate the CPU subsystem one must reduce the graphics workload while presevering the application's execution characteristics. A simple way to do this is to replace glVertex()
and glNormal
calls with glColor
calls. If performance does not improve then the CPU stage is the bottleneck.
4.1.2 Geometry subsystem
To isoslate the geometry subsystem one wants to reduce the number of primitives processed, or reduce the transformation work per primitive while producing the same number of pixels during rasterization. This can be done by replacing many small polygons with fewer large ones or by simply disabling lighting or clipping. If performance increases then your application is bound by geometry/transformation speed.
4.1.3 Rasterization subsystem
A simple way to reduce the rasterization workload is to make your window smaller. Other ways to reduce rasterization work is to disable per-pixel processing such as texturing, blending, or depth testing. If performance increases, your program is fill limited.
4.2 Double buffering
For smooth animation one must maintain a high, constant frame rate. Double buffering has an important effect on this. Suppose your application needs to render at 60Hz but is only getting 30Hz. It's a mistake to think that you must reduce rendering time by 50% to achive 60Hz. The reason is the swap-buffers operation is synchronized to occur during the display's vertical retrace period (at 60Hz for example). It may be that your application is taking only a tiny bit too long to meet the 1/60 second rendering time limit for 60Hz.
4.3 Test on several implementations
The performance of OpenGL implementations varies a lot. One should measure performance and test OpenGL applications on several different systems to be sure there are no unexpected problems.