One of the most common methods for rendering vast quantities of stuff (grass, rocks, particles, etc) is called instanced rendering. Essentially, instanced rendering allows us to instruct the GPU to draw the same vertex data (mesh) with a different applied transform over and over. The reason this method of rendering is so efficient boils down to two simple principles which will govern our design decisions: The GPU is very good at parallel tasks, (rendering the same thing many times can be an independent operation) but transferring data from the CPU to the GPU is slow. Essentially, if the GPU knows where to draw our meshes it will be fast, if it doesn’t and needs the CPU to tell it the operation will be much slower.
Let’s tackle the first part of this issue first, how do we render instanced meshes in Unity? There are a few ways as far as I can tell, however from experimentation I found utilizing the DrawMeshInstancedIndirect method to be the most performant. Unfortunately this method can seem somewhat difficult to work with, especially if like me you haven’t had a lot of experience with ComputeBuffers. Luckily, however, once you understand what this function is trying to achieve it gets a lot simpler. If we take a look at the documentation for DrawMeshInstancedIndirect we can see that there are a bunch of parameters: mesh, submeshIndex, material, bounds, bufferWithArgs, argsOffset, properties, castShadows, receiveShadows, layer, camera and lightProbeUsage. Let’s look at some of these in order.
mesh is straightforward, we’re rendering a mesh so we need that to...render it.
submeshIndex is maybe less obvious but since in Unity a Mesh instance can contain multiple graphical meshes called “sub meshes,” it stands to reason that we may need to specify which of those we actually want to render. In actuality this parameter seems unnecessary since it will appear again later on, the docs explain that this is only needed for meshes with different topologies. All I can really say is that I still supplied the submesh index I wanted to render here and it seems to work alright.
material is also pretty obvious if you’re used to Unity, since everything you render needs a material (which is basically a way to expose variables of the underlying shader)
bounds probably doesn’t make sense and honestly I’m not sure exactly what it’s used for, my guess is some sort of optimization or API requirement when communicating with the GPU. This is just a big rectangular prism that is large enough to fit everything you want to render inside. I will say this tripped me up a lot in testing since I would forget to calculate the proper bounding volume for my instances and then they would fail to render.
bufferWithArgs is the real meat of this method and the least intuitive. The most important thing to note is that the type of this argument is a ComputeBuffer. ComputeBuffers are essentially used to communicate between the CPU and the GPU, so they are pretty particular. In the documentation, we can see that a ComputeBuffer takes 2-3 constructor arguments, a count, a stride, and an optional type. The first confusing thing is that the count claims to be the number of elements in the buffer, but in the DrawMeshInstancedIndirect sample we see a buffer with 5 elements initialized with this value set to 1. The reason for this is the stride which is just the size of the data chunks we are putting into this buffer. In the example this is set to 5 * sizeof(uint), so we are saying our buffer has 1 element with a size of 5 unsigned integers. It might seem like overkill if you’re not used to GPU programming, but the GPU will move through the buffer in the exact increment you specify, so getting the exact byte length of your data is critical. In my testing I found that (in Unity 2019.2 at least) you could initialize this sample buffer with a count of 5 and a stride of sizeof(uint) and it still works. There might be a performance difference somewhere down the line but in rendering the hundreds of thousands of meshes in this demo I didn’t find a practical one and I found it easier to keep in my head as I started rendering multiple submeshes and LODs to just make a constant DATA_STRIDE variable and change the size of the buffer.
theActualArgsInTheBufferWithArgs gets a special section (even though it’s not really a parameter), since they are also super important. You can find the specifics of what to put into this buffer in the documentation, essentially you just need to store information about the internal layout of the vertex and index buffers in the Mesh you want to draw, the only real catch here is that you need to do this for every submesh and that the buffer really does seem to want to have a trailing zero that I can’t find an explanation for, so in practice it looks something like this: