Updating pixel shaders

Unfortunately it turns out that a bunch of small updates per draw call can create severe bottlenecks.The goal of this post is to revisit constant buffer usage patterns and to also take a look at some of the superior new partial constant buffer updates that Direct X 11.1 ™ in Window 8 ™ allows for.So from a conceptual point of view there is no need to store any constant twice.Unfortunately binding the same constant buffer to several shader stages will make updates more expensive (see Table 1 below), so in practice you may want to avoid this.If an application wants to change the contents of a constant buffer, usually the old contents are still needed because either the GPU is still accessing them or has not yet used/consumed them.As a consequence, to avoid stalling, the graphics driver returns a pointer to a different block of memory for each pair of calls to Map(MAP_WRITE_DISCARD)/Unmap() or a call to Update Subresource().Furthermore, if the same values had to be set for both the pixel and vertex shader stages, they had to be set twice, as constants could not be shared between stages. They’re objects that preserve the values of the stored shader constants until it becomes necessary to change them.

Game engines were supposed to turn the necessary constant changes into updates of a few small constant buffers.

Figure 1 – Renaming a sequence of constant buffer updates Example: 4096 bytes of constant buffer data (2048 for each, the pixel stage and the vertex stage) being updated for 10000 draw calls per frame, gets renamed to ~156MB (assuming 4 frames are in flight – e.g. So clearly this exceeds the above mentioned limit of 128MB, causing the driver to throttle down.

Let’s talk about the cost in CPU cycles on a contemporary CPU (3.3GHz Intel ™ Core-i7™) for various constant buffer operations – see Table 1 or Figure 2 which offers a different view of the same data.

In order to make copying memory to constant buffers fast it makes sense to use _aligned_malloc() to allocate memory that is aligned to 16 byte boundaries, as this speeds up the necessary memcpy() operation from application memory to the memory returned by Map().

In a similar throw Update Subresource() will be able to perform the copy operation faster too.