I haven't figured out how particularly OpenGL 4.0 makes this feature work, since it has existed before as well as far as I have understood. I'm not sure if this answers your question, but I'll tell what I know about the subject anyway.
It refers to a situation where some other library than OpenGL, such as OpenCL or CUDA, produces some data directly into the memory of the graphics card, and then OpenGL continues from where the other library left, and uses that data as
- pixel buffer object (PBO) when they want to draw the data to the screen as it is
- texture when they want to use the graphics data as a part of some other scene
- vertex buffer object (VBO) when they want to use the produced data as some arbitrary attribute input for vertex shader. (one example of this might be a particle system which is simulated with CUDA and rendered with OpenGL)
In a situation like this, it's a very good idea to keep the data in the graphics card all the time and not copy it around, especially not copy it through CPU, because the PCIe bus is very slow when compared to the memory bus of the graphics card.
Here's some sample code to do the trick with CUDA and OpenGL for VBOs and PBOs:
// in the beginning
glGenBuffers(&id, 1);
// for every frame
cudaGLRegisterBufferObject(id);
CUdeviceptr ptr;
cudaGLMapBufferObject(&ptr, id);
// <launch kernel here>
cudaGLUnmapBufferObject(id);
// <now use the buffer "id" with OpenGL>
cudaGLUnregisterBufferObject(id);
And here's how you can load the data into a texture:
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, id);
glBindTexture(GL_TEXTURE_2D, your_tex_id);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 256, 256, GL_RGBA, GL_UNSIGNED_BYTE, 0);
Also note that if you use some more unusual format instead of GL_RGBA it might be slower because it has to convert all the values.
I don't know OpenCL but the idea is the same. Only function names are different.
Another way to do the same thing is what is called host pinned memory. In that approach you map some CPU memory address range to the graphics card memory.