Persistent mapped buffers

There has been lots of discussion lately on OpenGL with the advent of the next-gen low level APIs like Mantle, Metal and DirectX 12. These low level APIs promise better performance by a design which maps better to the hardware (that’s why they are low-level). I’m not going to talk about that in this article, there are a lot of good resources like:

OpenGL response to these low level APIs is what some people calls AZDO  (Approaching zero driver overhead ). AZDO is a presentation by Cass Everitt, Graham Sellers, John McDonald and Tim Foley. You can watch the presentation here :

Or check out the slides here: http://www.slideshare.net/CassEveritt/approaching-zero-driver-overhead

What AZDO proposes is to use a group of extensions aimed to reduce driver overhead on modern OpenGL applications.

So, first of all, what is driver overhead? Driver overhead is noticeable when the application and the gpu could render more but the driver can not keep up with them so it becomes the bottleneck.
OpenGL remains the only multiplatform API at the moment and this will not change in the immediate future. Mantle is not even out yet and will be Windows and AMD only (at least at first), and Metal is aimed at Apple products. For this reason I think we should embrace AZDO to reduce driver overhead and write more efficient programs. That’s the reason why I have decided to write some articles explaining some of these extensions. This first article on this series is devoted to persistent mapped buffers.

Persistent-mapped buffers

Use Case: To update dynamic buffers faster. ( Dynamic VB/IB data, highly dynamic uniform data, MultiDrawIndirect command buffers )

Normally, if a buffer object is mapped, it cannot be used in a non-mapped fashion. Rendering commands that would read or write to a mapped buffer will throw an error if the buffer is mapped.
However, if the buffer is created with immutable storage and the GL_MAP_PERSISTENT_BIT flag, then the buffer can remain mapped while the GPU is using it. Immutable storage means you will be unable to rellocate that storage (e.g orphaning the buffer won’t work ). You can modify the data but not its “memory address”.
To allocate immutable storage you call the function glBufferStorage( GLenum taget, GLsizeiptr size,  const GLvoid * data,  GLbitfield flags ) with the GL_MAP_PERSISTENT_BIT enabled. Now you can map the buffer and keep it mapped forever.

Obviously, you get this at a cost, because, now, you need to do synchronization yourself, that is, you will need to use fences to ensure that the GPU is not using the buffer while you are writting data to it. Every time you issue a draw calls which may use the persistent-mapped buffer you will need to place a fence using glFenceSync. Next time you want to update the buffer you will need to call glClientWaitSync to ensure the fence has been removed and the GPU is no longer working with your buffer. If you do this naively, chances are your application will spend a lot of time in glClientWaitSync, that’s when you will need to use double (or even triple) buffering on your buffer object, so you can update a region of the buffer while the gpu is using another region. Actually, AZDO proposes using triple buffering, so you need to create an immutable storage three times bigger than needed, so, one region is what the GPU will be using, another region is what the driver will be holding getting ready for the GPU to use, and the last region will be the one you are updating.

This is a little program to show the use of this extension. It creates a vertex buffer and maps it at the begging of the execution and then uses the pointer to update the data every frame. I don’t use double or triple buffering in this example to keep it simple but it can be added easily. You can find this sample app and some others in this GitHub repository https://github.com/fsole/GLSamples

EDIT : I forgot about this and it is actually quite important. As neobrain points out, to make sure the data you have just written to a mapped buffer is visible to the GPU you’ll need to place a memory barrier or create and map the buffer using the GL_MAP_COHERENT_BIT. If you use the GL_MAP_COHERENT_BIT when allocating the buffer’s storage and you map it with the same bit the data you write will become visible to every OpenGL command issued after the write, automatically.

#include 
#include 

#include "GL/glew.h"
#include "GL/freeglut.h"

namespace
{
  struct SVertex2D
  {
    float x;
    float y;
  };
  
  const GLchar* gVertexShaderSource[] = {
                         "#version 440 core\n"
                         "layout (location = 0 ) in vec2 position;\n"
                         "void main(void)\n"
                         "{\n"
                         "  gl_Position = vec4(position,0.0,1.0);\n"
                         "}\n" 
                         };
                         
  const GLchar* gFragmentShaderSource[] = {
                         "#version 440 core\n"
                         "out vec3 color;\n"                     
                         "void main(void)\n"
                         "{\n"
                         "  color = vec3(0.0,1.0,0.0);\n"
                         "}\n" 
                         };
                         
                         
  const SVertex2D gTrianglePosition[] = { {-0.5f,-0.5f}, {0.5f,-0.5f}, {0.0f,0.5f} };
  GLfloat gAngle = 0.0f;
  GLuint gVertexBuffer(0);  
  SVertex2D* gVertexBufferData(0);
  GLuint gProgram(0);
  GLsync gSync;
         
}//Unnamed namespace


GLuint CompileShaders(const GLchar** vertexShaderSource, const GLchar** fragmentShaderSource )
{
  //Compile vertex shader
  GLuint vertexShader( glCreateShader( GL_VERTEX_SHADER ) );
  glShaderSource( vertexShader, 1, vertexShaderSource, NULL );
  glCompileShader( vertexShader );
  
  //Compile fragment shader
  GLuint fragmentShader( glCreateShader( GL_FRAGMENT_SHADER ) );
  glShaderSource( fragmentShader, 1, fragmentShaderSource, NULL );
  glCompileShader( fragmentShader );
  
  //Link vertex and fragment shader together
  GLuint program( glCreateProgram() );
  glAttachShader( program, vertexShader );
  glAttachShader( program, fragmentShader );
  glLinkProgram( program );
  
  //Delete shaders objects
  glDeleteShader( vertexShader );
  glDeleteShader( fragmentShader );   

  return program;  
}

void Init(void)
{
  //Check if Opengl version is at least 4.4
  const GLubyte* glVersion( glGetString(GL_VERSION) );
  int major = glVersion[0] - '0';
  int minor = glVersion[2] - '0';  
  if( major < 4 || minor < 4 )
  {
    std::cerr<<"ERROR: Minimum OpenGL version required for this demo is 4.4. Your current version is "<<major<<"."<<minor<<std::endl;
    exit(-1);
  }

  //Init glew
  glewInit(); 
    
  //Set clear color
  glClearColor(1.0f, 1.0f, 1.0f, 0.0f);
  
  //Create and bind the shader program
  gProgram = CompileShaders( gVertexShaderSource, gFragmentShaderSource );
  glUseProgram(gProgram);
  glEnableVertexAttribArray(0);

  //Create a vertex buffer object
  glGenBuffers( 1, &gVertexBuffer );
  glBindBuffer( GL_ARRAY_BUFFER, gVertexBuffer );
  glVertexAttribPointer(0, 2, GL_FLOAT, GL_FALSE, 0, 0 );
  
  //Create an immutable data store for the buffer
  size_t bufferSize( sizeof(gTrianglePosition) );  
  GLbitfield flags = GL_MAP_WRITE_BIT           | 
                               GL_MAP_PERSISTENT_BIT |
                               GL_MAP_COHERENT_BIT;
 
  glBufferStorage( GL_ARRAY_BUFFER, bufferSize, 0, flags );
  
  //Map the buffer for ever
  gVertexBufferData = (SVertex2D*)glMapBufferRange( GL_ARRAY_BUFFER, 
                                                                                            0, 
                                                                                            bufferSize, 
                                                                                            flags ); 

}

void LockBuffer()
{
  if( gSync )
  {
    glDeleteSync( gSync );	
  }
  gSync = glFenceSync( GL_SYNC_GPU_COMMANDS_COMPLETE, 0 );
}

void WaitBuffer()
{
  if( gSync )
  {
    while( 1 )	
	{
	  GLenum waitReturn = glClientWaitSync( gSync, GL_SYNC_FLUSH_COMMANDS_BIT, 1 );
	  if (waitReturn == GL_ALREADY_SIGNALED || waitReturn == GL_CONDITION_SATISFIED)
	    return;
    }
  }
}

void Display()
{
  glClear( GL_COLOR_BUFFER_BIT );
  gAngle += 0.1f;
  
  //Wait until the gpu is no longer using the buffer
  WaitBuffer();
  
  //Modify vertex buffer data using the persistent mapped address
  for( size_t i(0); i!=6; ++i )
  {
    gVertexBufferData[i].x = gTrianglePosition[i].x * cosf( gAngle ) - gTrianglePosition[i].y * sinf( gAngle );
    gVertexBufferData[i].y = gTrianglePosition[i].x * sinf( gAngle ) + gTrianglePosition[i].y * cosf( gAngle );    
  }  

  //Draw using the vertex buffer
  glDrawArrays( GL_TRIANGLES, 0, 3 );
  
  //Place a fence wich will be removed when the draw command has finished
  LockBuffer();

  glutSwapBuffers();
}

void Quit()
{
  //Clean-up
  glUseProgram(0);
  glDeleteProgram(gProgram);
  glUnmapBuffer( GL_ARRAY_BUFFER );
  glDeleteBuffers( 1, &gVertexBuffer );
  
  //Exit application
  exit(0);
}

void OnKeyPress( unsigned char key, int x, int y )
{
  //'Esc' key
  if( key == 27 )
    Quit();    
}

int main( int argc, char** argv )
{
  glutInit(&argc, argv);
  glutInitDisplayMode(GLUT_DOUBLE | GLUT_RGB );
  glutInitWindowSize(400,400);  
  glutCreateWindow("Persistent-mapped buffers example");  
  glutIdleFunc(Display);
  glutKeyboardFunc( OnKeyPress );
  
  Init();
  
  //Enter the GLUT event loop
  glutMainLoop();
}
Advertisements

6 thoughts on “Persistent mapped buffers

  1. Since you don’t specify GL_MAP_COHERENT_BIT in glBufferStorage, shouldn’t you call glMemoryBarrier after writing the the buffer and before calling glDrawArrays ?

  2. thanks for a great post!

    So how can I do triple buffering? The first thing is to obviously allocate 3x of buffer size… but how to manage synchronization? Can your sync code be simplified in that case?

    • Hi fenbf and thank you for your comment.
      In theory, you are using triple buffering to avoid the synchronization point so, you could think, you don’t need to place fences anymore. The best option though, is to keep using fences to make sure any rendering command using the range you are about to update has finish before updating that range of the buffer, just in case…
      This pseudo-code may help

      struct Range
      {
      size_t begin;
      size_t count;

      GLsync sync;
      };

      Range BufferRange[3];
      int index(0);

      void Display()
      {
      //WaitBuffer should return immediately ( that’s why we are
      //using 3x memory! )
      WaitBuffer( BufferRange[index].sync );

      //Update that region of the buffer
      UpdateBuffer( BufferRange[index] );

      glDrawArrays( GL_TRIANGLES,
      BufferRange[index].begin,
      BufferRange[index].count );

      LockBuffer( BufferRange[index].sync );
      index = ( index + 1 ) % 3;
      }

      • Thanks for a quick answer!

        I have a basic app where I used 3x buffer size and persistent mapped buffers… and it worked without any synchronization (at least it’s not crashing :), AMD GPU). But I think, it’s better to use sync just to be sure everything work as expected.

        So thanks for the code, I’ll need to use it in my test app.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s