1.4 Optimizing Components
WARNING
This post has been translated by artificial intelligence and needs proofreading.
The Essence of the Problem
As I previously wrote here - naively using EntityId as an index in component storage is not very efficient, although it is the fastest access method (excluding resizes, of course). Unused occupied memory, in practice, will tend to 100% as the number of entity types grows, and I am not the first to think about this.
The most obvious solution would be to store components in a dictionary: Dictionary<entityId(int), data(TComponent)>. Simple - yes, O(1) - yes, efficient - no, components will be scattered across the heap, and there will be a lot of "heap jumps" and cache misses. Not very good.
Expected Result
Here, it is probably worth describing what solution we should strive for and what we want to achieve in the end.
So, the expected result:
- Access to elements in O(1) for both reading and writing.
- Linear arrangement of elements in memory.
- Minimal overhead on element insertion/deletion.
Desired API:
Don't forget about CRUD
interface IStorage<T>
{
int Count { get; }
ReadOnlySpan<T> All();
ReadOnlySpan<int> AllEntities();
// [C]reate
void Add(in int entityIid, in T value);
// [R]ead
bool Has(in int entityIid);
// [R]ead/[U]pdate
ref T Ref(in int entityIid);
// [D]elete
void Remove(in int entityIid);
}
Additionally, Count
, All()
, and AllEntities()
are added because it is a common practice to iterate over existing data. Just remember that they need to be exposed.
In all my examples, int
is used as the Entity Internal Identifier or entityIid used - in other frameworks, it could be a different data type, it's not essential.
What's essential is that EntityId is not used in its entirety here, and checks for world membership and whether the entity is alive or not are not implemented - these checks are performed at a higher level, and this element of the system is a so-called hot spot or hot path - a very high-load element of the system where even the most innocuous check can significantly degrade performance - the fastest code is the code that isn't there.
As a starting point, let's build on a basic "sparse" linear implementation:
class Storage<T> : IStorage<T>
{
private bool[]? has;
private T[]? data;
public int Count { get; private set; }
public ReadOnlySpan<T> All() => throw new NotImplementedException();
public ReadOnlySpan<int> AllEntities() => throw new NotImplementedException();
public Storage() => Resize(32);
public void Add(in int entityIid, in T value)
{
if (entityIid >= has!.Length) Resize((entityIid / 32 + 1) * 32);
data![entityIid] = value;
has![entityIid] = true;
Count++;
}
public bool Has(in int entityIid) => entityIid < has!.Length && has[entityIid];
public ref T Ref(in int entityIid) => ref data![entityIid];
public void Remove(in int entityIid)
{
if (entityIid >= has!.Length) return;
has[entityIid] = false;
Count--;
}
private void Resize(in int size)
{
if (has == null)
{
has = new bool[size];
data = new T[size];
return;
}
var initialSize = has!.Length;
if (initialSize >= size) return;
Array.Resize(ref has, size);
Array.Resize(ref data, size);
}
}
As you can see, there is no efficient way to implement All()
and AllEntities()
with the current API, either through IEnumerator
or by allocating a new array, both options are unacceptable, so we won’t dwell on them for now.
In the current implementation, entityIid
itself is the index in the data
array, which is very convenient but inefficient - still gaps.
To get rid of the gaps, you need to shift the array elements to the left, something like this:
Compact - yes, but something broke - also yes: now entityIid
does not point to the correct element in the data
array. In simple mathematical terms:
Was: ce = data[ie]
Became: ce = data[iс] where iс = f(ie) or simply ce = data[f(ie)]
What is f()? Where did it come from?
f() is a function that transforms entityIid
into an internal index within the data
array. This is what was done with the elements in the animation above: took the index that matched entityIid
and changed it to another one, ensuring the elements are closely packed in the array.
In reality, it's not feasible to express this through a function simply because these changes depend on user code and are not deterministic, but it can be expressed through mapping, storing the pair entityIid
-> index
.
You can use Dictionary here!
You can, but we won’t because there’s a better option.
Sparse Set
Explained here and here, so I won't go into much depth.
In a nutshell - it's a clever mapping with addition, reading, writing, and deletion in O(1) but with certain limitations, namely:
- It can only store values in the range [0 : n), where n is the maximum number of elements.
- Will be sparse instead of our data array (this is acceptable in principle)
How it looks schematically:
We got rid of the gaps in the component storage, components lie linearly, is that all? - No, not all.
Assuming that we have figured out how it all fits together and is interconnected, let's look at the code:
Generally, the code will have the following appearance:
class Storage<T> : IStorage<T>
{
private int[]? sparse;
private int[]? dense;
private T[]? data;
public int Count { get; private set; }
public ReadOnlySpan<T> All() => new(data, 0, Count);
public Storage() => Resize(32, 32);
public bool Has(in int entityIid) => sparse![entityIid] != -1;
public ref T Ref(in int entityIid) => ref data![sparse![entityIid]];
private void Resize(in int sparseSize, in int dataSize)
{
if (sparse == null)
{
sparse = new int[sparseSize];
dense = new int[sparseSize];
Array.Fill(dense, -1);
Array.Fill(sparse, -1);
data = new T[dataSize];
return;
}
var initialSparseSize = sparse!.Length;
if (initialSparseSize < sparseSize)
{
Array.Resize(ref sparse, sparseSize);
Array.Resize(ref dense, sparseSize);
Array.Fill(sparse, -1, initialSparseSize, sparseSize - initialSparseSize);
Array.Fill(dense, -1, initialSparseSize, sparseSize - initialSparseSize);
}
var initialDataSize = data!.Length;
if (initialDataSize < dataSize)
Array.Resize(ref data, dataSize);
}
public ReadOnlySpan<int> AllEntities() => throw new NotImplementedException(); // <-- ignoring for now
public void Add(in int entityIid, in T value) => throw new NotImplementedException(); // <-- will solve below
public void Remove(in int entityIid) => throw new NotImplementedException(); // <-- will also solve solve below
}
This code takes into account that the components are already in place and packed tightly. The AllEntities()
method signature is questionable because identifiers will never be packed tightly in sparse
. We will deal with addition and deletion.
I had doubts about the economy and efficiency of this approach, so I armed myself with a calculator and Google spreadsheets.
There is one theory, and now we will test it
With low component utilization, a large number of entities, and small component sizes, we do not save memory, and we also add additional operations that complicate reading/writing.
Well, this needs to be checked, determine the following values and ranges:
- N: Total number of entities - let's say 1000.
- C: Component size in bytes (sizeof, assuming align = 8 and the minimum component size is also 8, step - also 8).
- U: Utilization [0:1] with a step of 0.1 (10%) - what % of all entities have this component, i.e., has value.
For a simple linear storage, the amount of memory consumption is calculated as (C + 1) * N
. U
is not taken into account because we allocate memory potentially for all possible entities.
In reality, since we are obsessed with optimization, for linear implementation, memory consumption can be reduced using a bitmask, and the formula would look like C * N + ceil(N / 8)
- i.e., 1 byte for every 8 entities.
For storage with a sparse set, it will be sizeof(int) * 2 * N + C * N * U
.
The table with calculations and the chart can be viewed here, along with the conclusions. The vertical axis - component size
The horizontal axis - utilization
Size | LINEAR | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1 |
---|---|---|---|---|---|---|---|---|---|---|---|
8 | 8125 | 8800 | 9600 | 10400 | 11200 | 12000 | 12800 | 13600 | 14400 | 15200 | 16000 |
16 | 16125 | 9600 | 11200 | 12800 | 14400 | 16000 | 17600 | 19200 | 20800 | 22400 | 24000 |
24 | 14125 | 10400 | 12800 | 15200 | 17600 | 20000 | 22400 | 24800 | 27200 | 29600 | 32000 |
32 | 32125 | 11200 | 14400 | 17600 | 20800 | 24000 | 27200 | 30400 | 33600 | 36800 | 40000 |
40 | 40125 | 12000 | 16000 | 20000 | 24000 | 28000 | 32000 | 36000 | 40000 | 44000 | 48000 |
48 | 48125 | 12800 | 17600 | 22400 | 27200 | 32000 | 36800 | 41600 | 46400 | 51200 | 56000 |
56 | 56125 | 13600 | 19200 | 24800 | 30400 | 36000 | 41600 | 47200 | 52800 | 58400 | 64000 |
64 | 64125 | 14400 | 20800 | 27200 | 33600 | 40000 | 46400 | 52800 | 59200 | 65600 | 72000 |
72 | 72125 | 15200 | 22400 | 29600 | 36800 | 44000 | 51200 | 58400 | 65600 | 72800 | 80000 |
80 | 80125 | 16000 | 24000 | 32000 | 40000 | 48000 | 56000 | 64000 | 72000 | 80000 | 88000 |
88 | 88125 | 16800 | 25600 | 34400 | 43200 | 52000 | 60800 | 69600 | 78400 | 87200 | 96000 |
96 | 96125 | 17600 | 27200 | 36800 | 46400 | 56000 | 65600 | 75200 | 84800 | 94400 | 104000 |
104 | 104125 | 18400 | 28800 | 39200 | 49600 | 60000 | 70400 | 80800 | 91200 | 101600 | 112000 |
112 | 112125 | 19200 | 30400 | 41600 | 52800 | 64000 | 75200 | 86400 | 97600 | 108800 | 120000 |
120 | 120125 | 20000 | 32000 | 44000 | 56000 | 68000 | 80000 | 92000 | 104000 | 116000 | 128000 |
128 | 128125 | 20800 | 33600 | 46400 | 59200 | 72000 | 84800 | 97600 | 110400 | 123200 | 136000 |
One can notice that the use of sparse sets in terms of memory consumption is not advantageous in two cases:
- Very small components.
- Very dense utilization.
Do not abuse splitting components into sparse-set ECSes, you will just waste memory for the sake of a false optimization.
Addition / Deletion
The methods of addition and deletion are closely related, so, as always, a bit of theory.
There are several memory utilization strategies in linear storage systems such as arrays, and each of them needs to be understood, choosing something suitable.
Let me clarify right away that there is no ideal strategy - each has its pros and cons.
List Approach
Here, in general, everything is simple:
public void Add(in int entityIid, in T value)
{
// убедились, что влезаем
Resize((entityIid / 32 + 1) * 32, data!.Length == Count + 1 ? data.Length + 32 : data.Length)
sparse![entityIid] = Count; // проставили индексы
dense![Count] = entityIid;
data![Count] = value; // записали данные
Count++; // инкрементровали счетчик
}
public void Remove(in int entityIid)
{
var arrayIndex = sparse![entityIid];
Array.Copy(data!, arrayIndex + 1, data!, arrayIndex, Count - arrayIndex - 1);
for (var i = arrayIndex; i < Count - 1; i++)
{
sparse![dense![i + 1]] = i;
dense![i] = dense![i + 1];
}
Count--;
sparse![entityIid] = -1;
dense![Count] = -1;
}
Let's now see how all this looks schematically:
Addition
Deletion
If Add()
is more or less okay, then in Remove()
we have a loop - in the worst case, removing the first added component will require adjusting all indices, O(n) in short - does not meet our requirements.
Is it fixable?
Yes, it's fixable, let's move on.
Remove and Swap Back
A very interesting mechanism for removing array elements, charming in its simplicity.
Its essence is that instead of shifting all elements to the left, the removed element is simply replaced by the last one. This allows for much faster removal of elements by index because only one element is copied. As a result, we have removal in constant time (not to be confused with constant complexity).
This is how it looks in code:
Such things are very convenient to implement as extension methods.
public static void RemoveAndSwapBack<T>(in T[] array, in int index, ref int count)
=> array[index] = array[--count];
Well, that's basically it. The last element can be overwritten if array elements are reference types or structures containing reference types so that the garbage collector can pick them up. But that is a topic for another article.
Schematically it looks like this:
Cool and beautiful, but there are also negative sides:
- The order of storing elements is disrupted - in general, this is not a big problem, for example, when iterating over one type of component. But as we discussed earlier - the access will be through the entity ID mapped to the index in the array and this will be more like random access by index, which will lead to a lot of cache misses (and affect performance accordingly). Here's a good video on this topic.
Some ECS frameworks have optimizations aimed at ordering memory so that iterations always occur from left to right.This significantly speeds up the iteration time but requires additional time for ordering.
But let's get back to implementations and see how it will look in the context of component storage.
public void Remove(in int entityIid)
{
var arrayIndex = sparse![entityIid];
var lastEntityIid = dense![--Count];
data![arrayIndex] = data![Count];
dense![arrayIndex] = lastEntityIid;
sparse[lastEntityIid] = arrayIndex;
Array.Copy(data!, arrayIndex + 1, data!, arrayIndex, Count - arrayIndex - 1);
for (var i = arrayIndex; i < Count - 1; i++)
{
sparse![dense![i + 1]] = i;
dense![i] = dense![i + 1];
}
Count--;
dense![Count] = -1;
sparse![entityIid] = -1;
}
Compared to what was before: minus the loop and a constant number of operations that do not depend on the size and index of the deleted element.
But who am I not to measure all this
Type | Method | Count | Mean | Error | StdDev | Median |
---|---|---|---|---|---|---|
RemoveAndSwapBack | Reverse | 10000 | 41.68 us | 0.825 us | 1.209 us | 41.79 us |
RemoveAndSwapBack | Linear | 10000 | 42.82 us | 0.851 us | 1.399 us | 42.83 us |
RemoveAndSwapBack | Random | 10000 | 45.59 us | 0.903 us | 1.075 us | 45.50 us |
List | Reverse | 10000 | 61.86 us | 1.353 us | 3.725 us | 61.48 us |
RemoveAndSwapBack | Reverse | 100000 | 290.98 us | 5.730 us | 8.217 us | 286.89 us |
RemoveAndSwapBack | Linear | 100000 | 295.25 us | 5.882 us | 7.002 us | 295.54 us |
RemoveAndSwapBack | Random | 100000 | 409.35 us | 7.702 us | 7.565 us | 405.67 us |
List | Reverse | 100000 | 440.00 us | 8.429 us | 7.472 us | 437.75 us |
RemoveAndSwapBack | Reverse | 250000 | 687.89 us | 3.690 us | 3.081 us | 687.25 us |
RemoveAndSwapBack | Linear | 250000 | 691.51 us | 3.748 us | 3.322 us | 691.50 us |
List | Reverse | 250000 | 1,059.04 us | 16.636 us | 15.561 us | 1,057.12 us |
RemoveAndSwapBack | Random | 250000 | 1,064.65 us | 19.698 us | 36.512 us | 1,048.54 us |
List | Random | 10000 | 28,061.13 us | 490.183 us | 434.534 us | 27,916.44 us |
List | Linear | 10000 | 57,214.52 us | 686.687 us | 642.328 us | 57,218.58 us |
List | Random | 100000 | 2,898,043.15 us | 57,450.454 us | 72,656.430 us | 2,859,386.54 us |
List | Linear | 100000 | 5,824,036.29 us | 73,614.984 us | 61,471.846 us | 5,812,698.46 us |
List | Random | 250000 | 17,908,060.10 us | 249,017.326 us | 232,930.961 us | 17,889,613.81 us |
List | Linear | 250000 | 36,972,363.97 us | 472,965.859 us | 442,412.558 us | 36,850,371.92 us |
As expected, yes, it is much faster, and it is clear, in principle, why. But here you can make a number of interesting observations:
Random()
is faster thanLinear()
in the case ofList
, and the fastest wasReverse()
.
This is due to the fact that in linear iteration from start to end, the amount and volume of copied memory will be maximum, i.e., we always delete the first element and copy the entire "tail". In the case of Random, we copy an average of half the total length. And in the case of Reverse - deleting the last element we copy the element itself, only 1 (this case can be handled separately and just decrement
Count
).
Random()
is slower thanLinear()
in the case ofRemoveAndSwapBack
, butReverse()
is faster.
Since our operation is atomic and there is no need to manipulate the "tail," everything comes down to the speed of accessing array elements. In linear iteration, the utilization of the prefetcher is maximized (you can read about this here. In the case of random access, cache misses occur more frequently, which is evident in the benchmark.
Reverse()
is faster because there is no need to jump between the current index and the last one, thus avoiding cache misses.
Can it be even faster?
In theory - yes, but only due to deferred operations of compacting storages, which will maximally preserve the order of stored components.
How to further speed up the in-place version I do not know =)
Conclusions
Sparse Sets - not an ideal solution, at least not for all cases.
Compact in-place better through RemoveAndSwapBack.
The benchmark code can be found here, and the storage code here.