Procedural Terrain in Unity

By: Brian - UpRoom Games founder

In app performance ~180-200fps. Render resolution of 1920x1080

**Tree and some vegetation assets from TriForge Assets, Fantasy Forest Environment

Introduction

Over the last year, most of my time has been spent working on the release of A Token War (which you should definitely check out and wishlist on Steam.) Like most indie devs, I thought the majority of my time would be spent developing rather than on the approximately one million other things that go into actually finishing a game. When I find myself getting a little itchy for some development I like to work on experimental projects that may or may not make their way into games. One of the projects I keep coming back to is terrain generation. I’ve played massive, open world games like Skyrim as much as the next person, probably more, but despite the fact that I will probably start a new game sometime in the next few months, eventually it’s impossible not to become so familiar with the terrain and layout of the game that I wish there was some way to enjoy the same experience while maintaining the initial sense of exploration. Plenty of better qualified people have taken a stab at procedural terrain and environments for games, with a lot of very impressive success. However, a lot of the available information is difficult to parse and sometimes feels deliberately confusing, at least it seems that way to me. I decided to make this post to explain what I’ve learned about terrain generation as it pertains to my preferred engine (Unity) although hopefully the principles will be helpful elsewhere. I realized at some point that my first post should probably be about our upcoming game release, but at the end of the day what I love about game development is how many new challenges there are so I hope that sharing my experience with this one in particular will help other developers out there who might be struggling with a similar problem, or at least be something cool to read.

***Disclaimer: The rest of this post is pretty technical, if that seems like a thing you do not wish to wade into, this is definitely the time to bail out.

Engine + Platform

For this project I used the Unity game engine, version 2019.2 built for DX11 using the (at the time of writing) experimental High Definition Render Pipeline version 6.9.1. The computer on which the demo video was recorded is running Windows 10 with a GTX 1070 graphics card.

The Problem

The first thing I do when starting a new project is make sure I actually understand the problem I’m trying to solve. In this case, we are not trying to simulate an entire world accurately, but rather create an immersive world for our player(s) to continually be able to explore. The distinction being that rather than making one very very big world, we can get away with making a smaller world as long as we have the ability to continually regenerate that world with different parameters (eg: different topography, different types of environment, different tree locations, etc). While this is an easier problem than trying to simulate an entire planet, players like to move around a lot so our smaller world still needs to be quite big, the one in the video above is 256km^2 (technically 16,384m x 16,384m) which while not gigantic, is still a pretty decent size. The second part of the problem is that we want our world to be immersive, which (in a wild over-generalization) means we want it to be densely populated with things like vegetation and other props and we also want it to be interactable (ie: we want the player to be able to run into stuff and interact with it in some way, not just have a pretty looking flyover of a bunch of polygons.) Of course once we have all of that we also need our world to be performant, since the terrain, trees, vegetation, props and other things are only a small portion of the computational work that a game needs to do. In summary, we need to make a terrain generation system that is:

1.) Procedural and modifiable

2.) Very large

3.) Interactable

4.) Full of stuff

5.) Very fast

In this project, the overall problem is loosely broken down into two main components, the World Generator and the World Streamer. As the name implies, the World Generator is responsible for generating whatever data we need to represent our terrain. The World Streamer is responsible for loading in the parts of the generated data that are relevant to our player at any given time, since the entire world is going to be too large to reasonably keep in runtime memory.

World Generator

Since we are not attempting to make an infinite terrain, we can run some generation the first time a player loads into a world. In the demo video there are only two sets of data that are generated ahead of time, height maps and placement textures. The height maps store vertical displacement data for the terrain, whereas the placement textures encode data about where grass, trees, pebbles and any other objects that need to be placed on our terrain go. The height maps represent 2048m^2 areas of terrain with a resolution of 1024x1024, for a total of 64 individual tiles to cover the ~16km^2 terrain and are stored as raw byte files. The placement maps are more dense, covering 1024m^2 with a resolution of 1024x1024 and are stored as textures, with each color channel (r, g, b, a) representing placement information for a different type of object. For example, if the data at a single pixel in a placement texture is [1, 0, 200, 0] then the corresponding world space location would get one object from the red channel pool of type 1 and overlap that with one object of type 200 from the blue pool. This lets us place things like grass, bushes and pebbles in the same locations, but not place grass where there are trees, with the limitation of only supporting 256 different types of each object. This is not necessarily the optimal solution but it works well enough and is what the demo video uses. With only empty height and placement maps, our terrain currently looks like this

Not particularly interesting. Obviously we need to actually generate some height data to give our terrain some character. There are a lot of resources about terrain generation that cover this step, so I’ll just give a brief summary of what I did.

The base height data is generated by combining various fractal noise functions. These noise functions have the property of being able to return a deterministic value given any coordinate pair (or triplet depending on the dimensionality of the noise). I used Fast Noise Unity for this generation since I find it very simple and effective but there are many other ways to generate fractal noise. This gives us a nice procedural heightmap which we can regenerate with different values as much as we would like. However, it has a very smooth unnatural look to it.

To help with this I added a stamper which uses a collection of pre created height maps and combines the noise height with the “stamp” height at random points on the terrain (based on whatever global random seed is used.) This gives a better look since it breaks up the repetitive nature of noise based terrain, but it still looks quite smooth and unnatural in places. Finally, I applied a hydraulic erosion simulation to each height map, using Compute Shaders to speed up the crunching. For reference, this is a fantastic tutorial by Sebastian Lague, which is what I adapted my hydraulic erosion implementation from, and this is an excellent tutorial series on Compute Shaders by Claire Blackshaw. As for hydraulic erosion we just simulate a few million drops of rain hitting our terrain at random points and have those drops “carve out” some height, depositing portions of it somewhere downhill based on gradients and few other parameters. After running the generator with the noise parameters, stamping and hydraulic erosion we get something that looks like this.

Much Better! The placement textures for the demo video were generated just using fractal noise functions with cutoff values, apart from the grass which is placed “everywhere” and uses local rejection parameters (such as ground normal and overall terrain height) to determine if it shouldn’t be placed somewhere. We’ll go over what exactly the placement textures are for soon, but since they are part of the World Generator they are worth mentioning.

World Streamer

The idea of world streaming is pretty well understood, by which I mean there seem to be a bunch of people that claim to understand it quite well and from the results they present I can’t really argue with that. I am not one of those people, but I took a stab at what I thought made up a good world streaming system:

The ability to load a subset of data from some place (in this case a hard drive) localized around a spatial point (for instance a player, or the location of an in game cinematic)
The ability to unload previously loaded data
The ability to “understand” that things closer to the load point are more important than things that are further away
The ability to modify data without interrupting the flow of gameplay (little loading freezes etc)

The underlying data structure that makes the World Streamer work is what people in the know seem to call a “chunk.” When programmers say “chunk” they usually mean a specific memory range that contains some arbitrary amount of data. In this case a chunk feels much more visceral since it refers to an actual chunk of the terrain, as in a piece of the actual terrain containing whatever data is deemed necessary. I found a lot of systems which described breaking up a world into evenly sized square pieces of terrain, containing data for height maps, vegetation and texture data as well as prop information. The streaming system then loads whatever number of these chunks it deems necessary around the player to provide an immersive experience. Usually, this means loading a large square of these chunks in some sort of radius around the player, but more elaborate systems can take line of sight information into account and load or unload chunks as they would become visible to the player based on level geometry. In the case of procedural terrain, line of sight information can’t be easily computed since the level geometry can change with each regeneration. While you could compute this type of occlusion information in an offline process after generating your world data, it is more difficult still to determine if unloading a subset of temporarily invisible chunks will be a net benefit to performance without the ability to manually test every configuration. For this reason I think the simpler radius based chunk loading system is probably more appropriate for procedural terrain. Also dealing with squares is easy, so let’s just deal with squares. For this demo the height and placement textures are loaded in chunks of 9 each (to form a square around the player) which means that ~14% of our height data is loaded at any time and ~4% of our placement data.

While the underlying data to generate our terrain is loaded in uniform chunks, the meshes used to actually draw the terrain wound up not being uniform. Apparently exponents make numbers get big very fast, which means that if we have a mesh chunk size of 64 (in world units, so let’s just say a 64x64 meter square) and we want to be able to see about 3072m, we would need to render 9216 individual mesh tiles around our player. While a number in the thousands isn’t necessarily large for a computer, remember that just to draw the terrain shape (without vegetation, props, colliders or anything else) at a density of one vertex per meter, we have 4096 vertices to process for each tile. If we maintain the same vertex density we would be processing 37,748,736 vertices and twice as many triangles every time we trigger a full load. Apart from this, maintaining so many independent objects can be a struggle for game engines, and a hassle for developers trying to parse what is actually loaded in a scene. Luckily, keeping a constant mesh density is overkill, since our distant terrain can be much less detailed than the stuff closer to the player, so we can reduce the triangle count quite a lot. We can also reduce our mesh count by using non-uniform mesh tiles, that is to say we can use progressively less dense and larger tiles to render distant terrain. However when meshes of a different density are next to each other, the seam between the two can get pretty nasty, since the two meshes are getting different levels of detailed height information. In computer graphics this is often referred to as a stitching problem as you are trying to “stitch” meshes together so they have no cracks. There are a couple ways to handle this in our case, you can either model your tiles with a higher density edge when necessary, or have custom logic to reduce the effective height density of higher res meshes at the seams. I think both can work but it really depends on your implementation and engine specifics as to which method will be faster (or if a hybrid approach works better.)

The demo vide uses the hybrid stitching approach and 3 different tile sizes, a high density tile to load around the player, a tile five times as large (and 1/4 as dense) to load further away, and a tile three times as large as that (and 1/2 as dense) to load for distant terrain. The high-res tiles are modelled to avoid the stitching problem by having a higher density edge when necessary and the mid level of detail tiles reduce their effective height density at the edge where they meet to lowest level of detail mesh. The balance between custom edge modelling (making it important which specific tiles are adjacent to another) and density reduction (tiles can go anywhere but have more per-vertex work) seems pretty finicky, and I ended up on the hybrid approach I described just through brute force trying all the combinations I could think of. Unfortunately that means I have no real reason why the hybrid approach was faster, but in my implementation it was so I figured I’d lay my cards on the table here. I imagine that given different implementations any of these methods could be acceptable.

Wireframe view of loaded mesh tiles at 3 levels of detail.

For the sake of efficiency, the mesh tiles each parse their respective height data in separate worker threads to convert it to new vertex data. While there is a Job System in Unity I found that just using traditional Thread Pools worked better for this. The tiles notify the World Streamer once finished and the tile is added to a queue for composition. The composition queue is handled in a coroutine so only a couple mesh tiles will be composed (have their new vertex and normal vector data applied to the actual Mesh class instance) per frame. Since each mesh tile can query the world generator for a height value independently, we not only gain a lot by using multithreading, but our mesh tiles don’t even need to align with our height map chunk size, which gives a lot of freedom in deciding on view distance.

An issue I ran into with this implementation, however, was how obvious it was when the terrain switched from one density of mesh to a different one. This is not an uncommon problem in traditional Level of Detail (LOD) systems and to solve it I took inspiration from the built-in solution Unity offers, which is crossfading LODs. The basic idea is to fade one model out while you fade another, higher or lower resolution model in. Anyone who has worked with computer graphics or rendering knows that fading isn’t quite as simple as it seems, since rendering transparent objects can be a hassle. A classic solution is to use an effect known as dithering to generate a pattern of pixels which you can “cut out” from your model. There are a few different methods of dithering, but a simple implementation (and one provided in the Shadergraph) lets you cut out progressively more or fewer pixels based on a single parameter, so basically to make something more “transparent” you make a number go up and to make it more “opaque” you make that number go down without having to do any real transparent rendering. Applying this to our terrain, we can fade in higher level of detail meshes and, once they are loaded, we can fade out the lower detail ones which turns a dramatic “pop” into a much smoother fade, watch the demo carefully and see if you can see the fade points for the terrain. Here is an example of using a dither node in the Unity Shadergraph to achieve a fading effect based on an “Opacity” variable.

Instanced Rendering

Our terrain is currently completely empty since we aren’t using our placement textures, which doesn’t exactly fulfill the “has lots of stuff” portion of our problem statement. So how do we get from the empty terrain above to something like this?

One of the most common methods for rendering vast quantities of stuff (grass, rocks, particles, etc) is called instanced rendering. Essentially, instanced rendering allows us to instruct the GPU to draw the same vertex data (mesh) with a different applied transform over and over. The reason this method of rendering is so efficient boils down to two simple principles which will govern our design decisions: The GPU is very good at parallel tasks, (rendering the same thing many times can be an independent operation) but transferring data from the CPU to the GPU is slow. Essentially, if the GPU knows where to draw our meshes it will be fast, if it doesn’t and needs the CPU to tell it the operation will be much slower.

Let’s tackle the first part of this issue first, how do we render instanced meshes in Unity? There are a few ways as far as I can tell, however from experimentation I found utilizing the DrawMeshInstancedIndirect method to be the most performant. Unfortunately this method can seem somewhat difficult to work with, especially if like me you haven’t had a lot of experience with ComputeBuffers. Luckily, however, once you understand what this function is trying to achieve it gets a lot simpler. If we take a look at the documentation for DrawMeshInstancedIndirect we can see that there are a bunch of parameters: mesh, submeshIndex, material, bounds, bufferWithArgs, argsOffset, properties, castShadows, receiveShadows, layer, camera and lightProbeUsage. Let’s look at some of these in order.

mesh is straightforward, we’re rendering a mesh so we need that to...render it.

submeshIndex is maybe less obvious but since in Unity a Mesh instance can contain multiple graphical meshes called “sub meshes,” it stands to reason that we may need to specify which of those we actually want to render. In actuality this parameter seems unnecessary since it will appear again later on, the docs explain that this is only needed for meshes with different topologies. All I can really say is that I still supplied the submesh index I wanted to render here and it seems to work alright.

material is also pretty obvious if you’re used to Unity, since everything you render needs a material (which is basically a way to expose variables of the underlying shader)

bounds probably doesn’t make sense and honestly I’m not sure exactly what it’s used for, my guess is some sort of optimization or API requirement when communicating with the GPU. This is just a big rectangular prism that is large enough to fit everything you want to render inside. I will say this tripped me up a lot in testing since I would forget to calculate the proper bounding volume for my instances and then they would fail to render.

bufferWithArgs is the real meat of this method and the least intuitive. The most important thing to note is that the type of this argument is a ComputeBuffer. ComputeBuffers are essentially used to communicate between the CPU and the GPU, so they are pretty particular. In the documentation, we can see that a ComputeBuffer takes 2-3 constructor arguments, a count, a stride, and an optional type. The first confusing thing is that the count claims to be the number of elements in the buffer, but in the DrawMeshInstancedIndirect sample we see a buffer with 5 elements initialized with this value set to 1. The reason for this is the stride which is just the size of the data chunks we are putting into this buffer. In the example this is set to 5 * sizeof(uint), so we are saying our buffer has 1 element with a size of 5 unsigned integers. It might seem like overkill if you’re not used to GPU programming, but the GPU will move through the buffer in the exact increment you specify, so getting the exact byte length of your data is critical. In my testing I found that (in Unity 2019.2 at least) you could initialize this sample buffer with a count of 5 and a stride of sizeof(uint) and it still works. There might be a performance difference somewhere down the line but in rendering the hundreds of thousands of meshes in this demo I didn’t find a practical one and I found it easier to keep in my head as I started rendering multiple submeshes and LODs to just make a constant DATA_STRIDE variable and change the size of the buffer.

theActualArgsInTheBufferWithArgs gets a special section (even though it’s not really a parameter), since they are also super important. You can find the specifics of what to put into this buffer in the documentation, essentially you just need to store information about the internal layout of the vertex and index buffers in the Mesh you want to draw, the only real catch here is that you need to do this for every submesh and that the buffer really does seem to want to have a trailing zero that I can’t find an explanation for, so in practice it looks something like this:

for (int si = 0; si < instancedMeshes[i].meshes[j].subMeshCount; si++)
{
    args[argIndex] = (uint)instancedMeshes[i].meshes[j].GetIndexCount(si); //first submesh index
    args[argIndex + 1] = (uint)instances[i].Length; // the number of objects you are actually rendering
    args[argIndex + 2] = (uint)instancedMeshes[i].meshes[j].GetIndexStart(si);
    args[argIndex + 3] = (uint)instancedMeshes[i].meshes[j].GetBaseVertex(si);
    args[argIndex + 4] = 0;
    argIndex += 5;
}

argsOffset is a sneaky one, since the sample in the documentation doesn’t use it, but in reading the method parameters description it seems super important, which it is. Basically if we are drawing more than one Mesh (not to be confused with an instance of a mesh, which we are drawing a lot of, but the number of different models we want to render) we can avoid rebuilding the bufferWithArgs for every different mesh by adding the data for all the different meshes to it initially and indicating to the GPU where in the bufferWithArgs we want to look for the current mesh data we are rendering. For example, to render two different meshes in the docs sample, when we were rendering the second one we would have to supply argsOffset = 5 * sizeof(uint)

properties, castShadows, receiveShadows, layer, camera and lightProbeUsage: are all straightforward, if you don’t know what a MaterialPropertyBlock is there is no problem setting this to null, and the shadow modes and layer seem to work as described. I found some past examples where shadows cast and received by instanced meshes seemed to be incorrect, however in my testing as of Unity 2019.2 I found no issues so I can’t speak to this. Just keep in mind that rendering shadows is obviously less performant than not doing that. Also important to note is that if you don’t explicitly use the “current camera” you won’t necessarily see your instanced meshes in the scene view in Unity, only in the game view.

The next issue we encounter after figuring out how to use DrawMeshInstancedIndirect is how to supply the position data to tell the GPU where to draw our many many instances. In The docs we see another ComputeBuffer is used for this purpose, although it’s probably a bit confusing since the DrawMeshInstancedIndirect call doesn’t seem to reference it in any way, and it seems to just get bound to the shader we are using to do the rendering. Effectively, all we are doing is telling the GPU to reuse the same vertex and index buffers a certain number of times, and relying on the shader to transform the vertices to the correct spots. The ComputeBuffer used to store the position data follows the same rules as above, for example if you are using a Vector4 to store position data, you need to tell the buffer that each entry has a byte size of 4 * sizeof(float), which is 16. In the shader you are using, you just need to redefine the unity_ObjectToWorld and unity_WorldToObject matrices to utilize the position data in the ComputeBuffer, the docs show this well.

However, in the demo video we are using the High Definition Render Pipeline. Unfortunately manually writing a shader for the HDRP is pretty squirrely since there are a lot of render passes and things to think of. To make our lives easier it would be really nice if we could leverage the Shader Graph to create a shader that supports indirect instancing. In Unity 2019.2 we can accomplish this with a custom function node which defines our ComputeBuffer and redefines the matrices we are interested in, I found the discussion here to be very helpful, as well as the resources provided by AwesomeTechnologies here. In Unity 2019.2 making a custom function node is very simple, since you can just create the node from the context menu in ShaderGraph and point to the .hlsl file you want to run. Following the advice in the above resources I made a passthrough node for the position which also includes the required .hlsl file. I’m not really sure if the passthrough is necessary but I figured it wouldn’t hurt and helped me understand how the custom function node worked.

I made this whole thing a subgraph which is connected to the position input of every shader that needs to support instancing. I could never get an “undeclared identifier UNITY_MATRIX_M” error to go away, but as far as I can tell it never had any impact beyond breaking the preview of the node in the Shader Graph. I would have to investigate this further if this was ever intended for an actual game.

The InstancedIndirectPorixyNode.hlsl file is very simple

#include "Relative/FilePath/To/InstancedIndirectFunctions.hlsl"
void PassthroughPositionInstanced_float(float3 A, out float3 Out)
{
    Out = A;
}

And the InstancedIndirectFunctions.hlsl file looks something like this.

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
    struct Data
    {
        float4 c1;
        float4 c2;
        float4 c3;
    };
    StructuredBuffer<Data> dataBuffer;
#endif

void makeInstanced()
{
    #define unity_ObjectToWorld unity_ObjectToWorld
    #define unity_WorldToObject unity_WorldToObject
    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        float4 c1 = dataBuffer[unity_InstanceID].c1;
        float4 c2 = dataBuffer[unity_InstanceID].c2;
        float4 c3 = dataBuffer[unity_InstanceID].c3;

        unity_ObjectToWorld._11_21_31_41 = float4(c1.x, c2.x, c3.x, 0);
        unity_ObjectToWorld._12_22_32_42 = float4(c1.y, c2.y, c3.y, 0);
        unity_ObjectToWorld._13_23_33_43 = float4(c1.z, c2.z, c3.z, 0);
        unity_ObjectToWorld._14_24_34_44 = float4(c1.w, c2.w, c3.w, 1);

        unity_WorldToObject = unity_ObjectToWorld;
        unity_WorldToObject._14_24_34 *= -1;
        unity_WorldToObject._11_22_33 = 1.0f /unity_WorldToObject._11_22_33;
    #endif
}

Which is very similar to the example in the documentation with a small change. The Data struct is defined to hold all our transform data: rotation, scale and translation. Since we have 3 times the amount of data as the docs example, we would instantiate this buffer with sizeof(float) * 4 * 3.

In order to actually have the makeInstanced function have any impact there is one last somewhat irritating step I had to take. From the Unity docs on GPUInstancing we can see that there is a nifty switch we can use with the #pragma instancing_options directive, which is specifically used with DrawMeshInstancedIndirect. Great! Unfortunately this directive shows up in every pass of our generated Shader Graph code, which means we need to copy the shader code (available by right clicking on the output node in the Shader Graph editor) and paste it into a new shader file. Then you just do a good old find replace (there were usually about 7 instances)

find:

#pragma instancing_options renderinglayer

and replace with:

#pragma instancing_options renderinglayer procedural:makeInstanced

I’m pretty confident there is a better way to do this, possibly in an updated version of Unity but I couldn’t find one in my searching so here we are. Now that we have a way to make a custom shader which is compatible with DrawMeshInstancedIndirect, we can finally get around to populating a ComputeBuffer with the data we need to draw it. We will have to use the same struct to hold transform data that we defined in our helper hlsl file (it isn’t “the same” per se, but it needs to be structured the same with the same type of data), so when we want to render our instanced meshes we need to populate a ComputeBuffer with a bunch of

struct Data
{
    Vector4 c1;
    Vector4 c2;
    Vector4 c3;
};

In the documentation, this buffer is populated on the CPU using the SetData function of the ComputeBuffer. Since this only happens once, the performance isn’t really an issue. However, if we remember that one of our core principles is that communicating between the CPU and the GPU is slow, populating a GPU buffer from the CPU seems like a bad idea if we want to do it more than once. In fact, if we could find a way to populate our transform buffer from the GPU, we could do all sorts of interesting things, such as streaming in new position data for instanced meshes dynamically without a massive chunking as we upload a giant buffer to the GPU. To accomplish this we can populate our transform buffer from a custom Compute Shader.

I took a lot of inspiration for the design of the placement system from this excellent GDC talk: titled “GPU-Based Run-Time Procedural Placement in Horizon: Zero Dawn,” although my implementation is obviously a very very watered down interpretation. The idea of GPU placement in combination with instanced rendering is very simple.

Generate some sort of input data (could be a texture, a rule set, or any other type of data)
Use the input data to populate a ComputeBuffer with transform data
- This transform data can optionally be used as input to a Compute Shader for calculating Level of Detail and Frustum Culling
Use the output ComputeBuffer as the input for instanced rendering for each mesh

Here is a sample of what a placement Compute Shader for instanced rendering could look like

#pragma kernel CSMain

// This struct needs to match the Data struct we use for DrawMeshInstancedIndirect
struct Data {
    float4 c1;
    float4 c2;
    float4 c3;
};

// The ComputeBuffer we construct our output with. An AppendStructuredBuffer is a type of ComputeBuffer
AppendStructuredBuffer<Data> output;

// Example input data. In the demo this is an array of foliage placement data which is generated during world creation and stored on disk as a texture
Texture2DArray<float4> inputs;

// Sample function, doesn't actually do anything meaningful here
float3 getPositionFromInputs(float3 coords)
{
    // use your input data to extract placement information
    // the id in CSMain is used to identify which "element" we are working on
    return float3(coords.x, coords.y, coords.z);
}

// The worst random number generator ever
float random(float input1, float input2)
{
    return 0.5;
}

// The main function to dispatch with our shader.
[numthreads(16, 16, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
    // dummy position generation based on whatever custom rules you want
    float3 fragmentWorldPosition = getPositionFromInputs(float3(id.x, id.y, id.z));

    // generate random numbers based on our thread id.
    float r = random(id.x * 0.01, id.z * 0.01);
    float s = random(id.z * 0.1, id.y * 0.1);

    // construct an arbitrary rotation matrix
    float4x4 rotationMatrix = float4x4(
        r, r * 2, r * 3, 0,
        r, r * 2, r * 3, 0,
        r, r * 2, r * 3, 0,
        0, 0, 0, 1
        );

    // construct a random uniform scale for this instance
    float4x4 scaleMatrix = {
        s, 0, 0, 0,
        0, s, 0, 0,
        0, 0, s, 0,
        0, 0, 0, 1
    };

    // translation matrix to move our mesh to the position calculated from our inputs
    float4x4 translationMatrix = {
        1, 0, 0, fragmentWorldPosition.x,
        0, 1, 0, fragmentWorldPosition.y,
        0, 0, 1, fragmentWorldPosition.z,
        0, 0, 0, 1
    };

    // composite matrix
    float4x4 trs = mul(mul(translationMatrix, (rotationMatrix)), scaleMatrix);

    // extract data
    Data d;
    d.c1 = float4(trs[0][0], trs[0][1], trs[0][2], trs[0][3]);
    d.c2 = float4(trs[1][0], trs[1][1], trs[1][2], trs[1][3]);
    d.c3 = float4(trs[2][0], trs[2][1], trs[2][2], trs[2][3]);

    // Our AppendBuffer lets us build up a ComputeBuffer without knowing an exact size.
    // Useful if you want to cull certain objects based on visibility, player interaction, or local rejection rules
    // such as a normal vector threshold or whatever else you want.
    output.Append(d);
}

***It’s important to note that the ComputeBuffer we are using is an AppendBuffer (instantiated in Unity by supplying the ComputeBufferType.Append type argument to the ComputeBuffer constructor.) An AppendBuffer is to a StructuredBuffer (the default ComputeBuffer type) what a List is to an Array, basically a type of buffer we can use without knowing it’s exact size ahead of time. Much like using a List over an Array there are some performance considerations, however the flexibility more than makes up for it in my opinion.

The inputs in this example come from the demo video implementation. The placement data for foliage is stored in textures with a resolution of 1024x1024 and can be loaded by the World Streamer in blocks of 9, which are flattened into a Texture2DArray. The texture data uses color to encode specific object ids to enable a large number of unique foliage elements. In the case of the demo, the output from this ComputeShader is used as the input to a second ComputeShader which performs frustum culling and LOD swapping. Normally frustum culling is handled by Unity just fine, but since we are directly invoking a draw call from the GPU we need to do it ourselves. As for LOD swapping, we can actually just build up multiple output buffers (one for each level of detail we want) and append our placement data to the appropriate output buffer. Since the mesh data is completely separate from the transform data we can make multiple DrawMeshInstancedIndirect calls for each level of detail mesh we have and as long as our buffers are bound correctly we can see the proper levels of detail swapping in and out in real-time. The LOD Compute Shader would look something like this:

// can be set like uniforms
float3 LOD_TRANSITION_DISTANCES;
float4x4 _UNITY_MATRIX_MVP;
float3 eye;

AppendStructuredBuffer<Data> outputLOD0;
AppendStructuredBuffer<Data> outputLOD1;
AppendStructuredBuffer<Data> outputLOD2;

StructuredBuffer<Data> input;

[numthreads(128, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
    Data d = input[id.x];
    float4 p4 = float4(d.c1.w, d.c2.w, d.c3.w, 1);
    float4 p = mul(_UNITY_MATRIX_MVP, p4);

    // calculate the distance from the camera
    float l = length(eye - p4.xyz);

    if (l <= LOD_TRANSITION_DISTANCES.x)
    {
            outputLOD0.Append(d);
    }

    if (l >= (LOD_TRANSITION_DISTANCES.x ) && l <= LOD_TRANSITION_DISTANCES.y)
    {
            outputLOD1.Append(d);
    }

    if (l >= (LOD_TRANSITION_DISTANCES.y) && l <= LOD_TRANSITION_DISTANCES.z)
    {
            outputLOD2.Append(d);
    }
}

*** In the demo, there is an extra step in this shader to calculate if a given position is visible and skip appending it to an output buffer if it is not, this is the frustum culling step which is elided for clarity.

As for actually chaining Compute Shaders together it turns out to be fairly simple, when we bind the buffers (during construction) we just need to link the inputs and the outputs properly:

gpuFrustumCullingShader.SetBuffer(gpuFrustumCullingKernal, "input", input);
gpuPlacementShader.SetBuffer(gpuPlacementKernal, "output", input);

When we get new placement data (the player moves and the Streamer requests more data etc) we simply dispatch the placement shader:

// specific variable names and thread sizes will vary based on implementation
gpuPlacementShader.SetTexture(gpuPlacementKernal, "inputs", textures);
gpuPlacementShader.Dispatch(gpuPlacementKernal, 1024 / 16, 1024 / 16, 9);

Finally in our render loop before we call DrawMeshInstancedIndirect we need to update and dispatch our LOD shader

gpuFrustumCullingShader.SetMatrix("_UNITY_MATRIX_MVP", camera.projectionMatrix * camera.worldToCameraMatrix);
gpuFrustumCullingShader.SetVector("eye", camera.transform.position);
gpuFrustumCullingShader.Dispatch(gpuFrustumCullingKernal, Mathf.Max(1, instances.Length / 128), 1, 1);

***Since we are using AppendBuffers for our ComputeBuffers, it’s important to remember to reset the counter on the buffer (buffer.SetCounterValue(0)) before dispatching a shader to fill it back up, otherwise things get weird.

Collision + Interaction

Our terrain looks pretty nice now, we can render hundreds of thousands (if not millions) of grass patches, tens of thousands of trees and pebbles and our terrain has a nice, eroded look. Unfortunately, as far as making an actual game we’re pretty far off since we can’t even collide with our terrain, let alone our trees and other objects. Apart from colliding it’s also very difficult to create any other user interaction with our environment since all our placement data is generated in a Compute Shader and attaching logic to anything in particular isn’t very obvious.

Luckily, since we now have a system for generating placement data for objects we want to render, we can extend that system to include generation data for non instanced objects as well, such as colliders and prefabs. Unfortunately in order to accomplish this we will have to copy the placement data from the GPU to the CPU, although by carefully controlling when new prefab and collider data is requested the impact on the user can be fairly minimal. To start we should create some object pools to instantiate colliders and whatever prefabs we need when the game starts. This step is especially important in an engine like Unity where game object creation is a fairly heavy operation. Once we have our object pools we need a way to extract the correct placement data from our Compute Shader. Luckily this isn’t too difficult. Since we already have transformation matrix information we could simply reuse that data to generate placements. However, since we likely want to have different type of colliders and prefabs based on loaded radius or just mesh type, I found it easier to simply create a second output AppendBuffer in the placement shader that appends data of this type:

struct PlacementData {
    float4x4 transformationMatrix;
    float3 colliderDimensions;
    int colliderType; // set per mesh like a uniform
    int spawnPrefab; // spawn a specific prefab instead
};

We can then just add instances of PlacementData to a new append buffer (when appropriate) while we populate our standard output buffer.

In order to read our placement data back on the CPU, we can utilize an AsyncGPUReadbackRequest. The asynchronous readback allows us to continue on our main thread while we wait for the request to process as long as we’ve triggered the request in a coroutine. Once the request has finished processing it’s a quick operation to extract the data and send it to our object pools to activate our colliders and prefabs. In the demo video, trees within a few hundred meters have colliders, and trees within about 100m are actually replaced entirely with a prefab of the tree model with more accurate colliders. Completely replacing an instanced mesh is a bit more tricky, since we can’t just skip appending the instance to our standard output because there could be a few frames delay for extracting our prefab placement data from the GPU, resulting in a nasty flash of the tree blinking out of existence briefly. To solve this I used a ComputeBuffer of culled mesh indices which I could populate on the CPU and push to the GPU when new tiles were requested. This allows the prefabs to become fully activated before culling the instanced version of the mesh. The copying of data to the GPU is unfortunate, but it doesn’t happen often and the buffers are small. For ground collisions after a lot of trial and error I found that adding mesh colliders to the high resolution terrain tiles worked best, as long as the ‘cooking options” are set to None. It also helped to make sure the mesh collider re-computation was handled by the composition queue to spread the work out for different tiles over a couple frames.

COnclusion

There are a couple things in the demo video that I didn’t cover here, mostly related to terrain blending with the vegetation and some other custom shader trickery. However, since this post is already extremely long I don’t feel too bad leaving that out. If you’ve made it all the way through I hope you found some useful information, or at the very least found a good repository for resources related to procedural generation, instanced rendering and compute shaders. The next step for this tech demo is going to be adding procedurally generated structures using the Wave Function Collapse algorithm. Until then, I hope you’ll check out A Token War, and thanks for reading!