During this last week, some of the work I did was to optimize my particle system, since it was showing up consistently on my profiles and I was adding more and more particles to my environments.
There are a few basic approaches you can take when trying to optimize code like my particle system.
- The fact that you have a large number of particles all behaving in the same way means that you can easily distribute the work of updating/rendering particles across multiple cores, as long as your data structures and libraries are set up correctly to handle it – so one option is to multithread your particle system.
- The parallel-friendly nature of a particle system also means that it’s possible to offload much of the work involved in rendering particles directly to the GPU, and do it in a shader instead of on the CPU. This is almost always faster.
- In fact, in many cases, you can even update your particles on the GPU, by storing their state in a texture or vertex buffer and having a shader run over all the particles and write their new state to another texture/buffer. You can then take the new state and feed it into another shader as input to render your particles.
- And of course, you can always take the standard approach of brute-force optimization, by making your particle system as efficient as possible with the same basic algorithm.
While I could have attempted to shift all the work onto the GPU, or make use of multiple threads, for now I decided to simply offload rendering work onto the GPU, because my profiles showed that I was spending a considerable amount of time handing particle system state off to the XNA SpriteBatch class. Reading the source code for SpriteBatch in Reflector shows that there are lots of tiny inefficiencies in its implementation when you’re trying to use it to render particles – it does a lot of work to handle state changes, texture sorting, and other considerations that do not apply when you have large batches of particles with identical parameters.
As an upside, rendering particles yourself using a shader makes it easier to distribute your updating and rendering logic across threads, because you can now generate vertex information for your particles in batches on multiple threads, before handing them to the GPU. When using SpriteBatch, you’re stuck because every SpriteBatch.Draw call requires synchronization.
Since this was my first time writing a HLSL shader, the process of moving from SpriteBatch to my custom shader was an interesting one. I ran into lots of little snags and ended up having to change my design multiple times along the way, and spent a little while experimenting with different rendering techniques to try and figure out which one performed the best. One particularly surprising conclusion was that small batches of vertices were much faster than large batches – initially, I was updating and rendering all my particles in a single batch, and then handing them all to the GPU at once.
I assumed that this would allow the driver and the GPU to crunch away on all the particles in the background while I moved on to doing other work on the CPU, but in practice, generating a small batch and handing it to the GPU while I work on the next batch is consistently faster on both the PC and the XBox 360. This is one of those cases where you might assume a code change will improve performance, but if you don’t benchmark and tune carefully, it can actually impair your game’s performance – disappointing if you just spent 8 hours hacking on something only to realize it was a bad idea.
The first step when implementing the shader for my particle system was to determine how to convert my particle system’s state into vertices for the shader to consume. There are a few factors that make this a bit of a challenge:
- In general, GPUs operate on collections of values – vectors and matrices – not individual values. This means you can’t just toss 16 uniquely named floats and integers into a vertex and get good performance; for ideal performance you need to pack groups of related values into vectors. This is straightforward for things like positions and velocities.
- With a few exceptions, you need to send the GPU as many vertices as you want it to draw. This is a bit of a pain when dealing with a particle system, since you typically want to map one set of values (a particle) into 6 vertices for a textured quad that represents the particle. In some cases, you can utilize Point Sprite support to get the job done, but hardware point sprites are tremendously limited. This means you need to find an efficient way to transform each particle into 6 vertices.
- Since you have to generate 6 vertices from each particle, you need to minimize redundant calculations – there are lots of calculations that the GPU is capable of doing, so you want to offload as many of them to the GPU as possible, so you can avoid doing them on the main CPU, where they cost significantly more.
For my particle system, the state of a particle looks like this:
public struct Particle {
public Vector2 Position;
public Vector2 Velocity;
public float Opacity;
public float Scale;
public float Rotation;
public Color Color;
}
After some experimenting and thinking, the vertex format I ended up with looks like this:
public struct ParticleVertex {
public Vector2 Position;
public Vector3 Params; // Opacity, Scale, Rotation
public Color Color;
public short Corner;
public short Unused;
}
So, to begin with, you’ll notice that I’m packing three unique values (opacity, scale, rotation) into a single vector. This is important because the vertex shader will only need to use one register to hold all three values, instead of needing a register for each individual value. I’m also separating the opacity value from the particle’s color, because combining the two values on the CPU is prohibitively expensive (mostly due to some stupid design decisions in the XNA framework, but I digress…), so I multiply out the alpha in the shader instead. The ‘Corner’ value is used so the shader can determine which of the particle’s four corners are being shaded – this allows us to duplicate a given particle vertex six times to satisfy the video card’s desire for two triangles, by only changing the Corner. There’s also that strange looking ‘Unused’ value there, which exists for a reason I’ll explain later.
Given the two formats, it’s relatively simple to write some code to transform from one to the other:
Particle p = particles[i]; vertex.Position = p.Position; vertex.Params.X = p.Opacity; vertex.Params.Y = p.Scale; vertex.Params.Z = p.Rotation; vertex.Color = p.Color;
Once you have a vertex for a given particle, then all you have to do is emit the vertices for each corner:
for (short k = 0; k < 4; j++, k++) {
vertex.Unused = vertex.Corner = k;
d[j] = vertex;
}
You may notice that Unused has shown up again. Here’s why: Originally, I only populated the Corner field, and the shader worked perfectly – on my PC. On the XBox, it mysteriously rendered nothing. I finally realized that the XBox has a different byte ordering from my PC, since it’s a PowerPC-based chip instead of an x86 one. As a result, my shader was reading from Unused on the 360 instead of from Corner. As a simple solution, I just populate both fields, since I have to send them anyway (there’s no way to send a single byte or integer as part of a vertex).
You may also notice that I’m generating four vertices, not six. This is so that I can take advantage of a pre-generated index buffer and only send four vertices per particle to the GPU instead of six. The index buffer is really simple to generate:
for (short i = 0, j = 0; i < numVertices; i += 4, j += 6) {
indices[j] = i;
indices[j + 1] = (short)(i + 1);
indices[j + 2] = (short)(i + 3);
indices[j + 3] = (short)(i + 1);
indices[j + 4] = (short)(i + 2);
indices[j + 5] = (short)(i + 3);
}
Once I have the vertex format set up, and I have code to generate vertices for my particles, the only hard parts remaining are to write a shader and set it up to be used by the game. The shader ends up being relatively simple – the only real complicated part is handling rotation:
float2 TextureSize;
float2 Translation;
texture ParticleTexture;
float4x4 MatrixTransform;
sampler TextureSampler = sampler_state {
Texture = (ParticleTexture);
MinFilter = Linear;
MagFilter = Linear;
MipFilter = Linear;
};
const float2 Corners[] = {
{-0.5f, -0.5f},
{ 0.5f, -0.5f},
{ 0.5f, 0.5f},
{-0.5f, 0.5f}
};
void VertexShader(
in float2 position : POSITION0, // x, y
inout float4 color : COLOR0,
in float3 params : POSITION1, // opacity, scale, rotation
in int2 cornerIndex : BLENDINDICES0, // 0-3
out float2 texCoord : TEXCOORD0,
out float4 result : POSITION0
) {
float2 corner = Corners[cornerIndex.x];
texCoord = corner + Corners[2];
float2 sinCos, rotatedCorner;
corner *= TextureSize.xy;
sincos(params.z, sinCos.x, sinCos.y);
rotatedCorner.x = (sinCos.y * corner.x) - (sinCos.x * corner.y);
rotatedCorner.y = (sinCos.x * corner.x) + (sinCos.y * corner.y);
position.xy += (rotatedCorner * params.y) - Translation;
color *= params.x;
result = mul(float4(position.xy, 0, 1), MatrixTransform);
}
void PixelShader(
inout float4 color : COLOR0,
float2 texCoord : TEXCOORD0
) {
color *= tex2D(TextureSampler, texCoord);
}
technique ParticleTechnique
{
pass P0
{
vertexShader = compile vs_1_1 VertexShader();
pixelShader = compile ps_1_1 PixelShader();
}
}
There are a few things at work here: We define a Sampler in our shader that represents the texture for our particles, and set the parameters that determine how the texture will be scaled and mipmapped. We also define some variables that can be set by the game at runtime to feed into the shader, along with a constant array that contains offsets for all four vertex corners. The array lets us map those integer corner indices to x/y coordinates easily, so we can convert four identical points into the corners of a quad.
The pixel shader doesn’t do anything of note, so I’ll just go over the vertex shader. First, we map the corner index into an xy coordinate pair, by looking it up in the constant array. Then, we read the rotation out of the parameters structure, and use sincos to generate a rotated version of the corner coordinate, so that the resulting quad for the particle is rotated appropriately (You could do this with a matrix multiply instead of individual arithmetic, but I’m too lazy.
).
Finally, we combine the rest of the parameters: Add the location of the rotated corner to the particle’s centerpoint, scale it by the scale parameter, and then translate it by the position of the camera.
Once we’ve done that, we multiply the input color by the input opacity to generate the actual color for the particle, and apply our transform matrix to generate the actual position of the particle’s vertex. Note that the translation and rotation stages could be done here if we wanted, since we’re using a 4×4 transform matrix. All in all, a relatively simple shader.
After adding the shader to my game’s content project and compiling it, I can load it up at runtime as an Effect, and apply it when I want to render particles. Of course, using it requires filling in the various constants with the right values so that the shader can generate particles at the right coordinates:
Effect.CurrentTechnique = Effect.Techniques["ParticleTechnique"]; Effect.Parameters["TextureSize"].SetValue(new Vector2(Texture.Width, Texture.Height)); Effect.Parameters["ParticleTexture"].SetValue(Texture); Effect.Parameters["Translation"].SetValue(Camera.ViewportPosition);
Once we’ve set the constants, we’re ready to render some particles.
At this point, we have a relatively efficient GPU-accelerated particle system. There’s lots of room for improvement, but as-is this system is considerably faster than my previous SpriteBatch-based implementation. The fact that I’m generating vertices in arrays and handing them to the GPU directly also means that if I want to, I can improve my particle system to use multiple threads to do updating and rendering without much hassle, since I won’t need to add any sophisticated synchronization – I can just slice up the particle array into chunks and hand each chunk to a thread.



#1 by Markus Ewald on September 14, 2009 - 10:02 am
Quote
This very much reminds me of the thoughts I went to when I designed my own particle system.
I didn’t have the problem of SpriteBatch’s overhead because I had previously written my own SpriteBatch-alike class (which efficiently handles mass-data and supports point sprites) – so I ended up taking the CPU-based multi-threaded approach first.
Letting the GPU update the particles sounds interesting, but how do you eliminate dead particles? I can’t think of a way that would avoid reading particles back into system memory, either to shift alive particles over dead ones or to locate particles marked as dead when you want to add new ones – unless all particles had the same life time in which case you could blindly overwrite the texture like a ring-buffer (but with quite some limitations on how the particle system can be used).
#2 by Kael on September 14, 2009 - 10:42 am
Quote
That’s definitely the biggest challenge, I think – the only reasonable solution I can think of that doesn’t involve doing a readback every frame would be something like this:
Your particle system uses multiple textures as a backing store. You start out by filling one up, and once it’s full, you move on to the next one. Periodically (say, every N frames, or whenever you have a certain number of particles), you pull down one or all of the textures, scan over them, and remove all the dead particles, reordering everything. You could also do this on one texture at a time, in a sort of round-robin fashion.
The GPU isn’t going to pay any additional cost for dead particles, so leaving them in the textures for a few frames won’t kill you, and you only need to reclaim the space used by a dead particle when you’re trying to spawn a new one.