Table of contents
Open Table of contents
- The Naive Pool
- The Free List
- The Packed Array
- The Slot Map (Generational Handles)
- The Performance Showdown: By the Numbers
- Set 1: High Churn (95% Free) - 10,000 Allocs, 9,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
- Set 2: Low Churn (5% Free) - 10,000 Allocs, 500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
- Set 3: High Alloc/Free Cost - 100,000 Allocs, 99,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
- Set 4: High Iteration Cost - 10,000 Allocs, 9,500 Frees, 100,000 Iterations, 500 Churn/1,000 Frames
- Conclusion
The Naive Pool
In Part 2, we built the Arena, which is ideal for objects with shared lifetimes. However, systems like particle effects or entity lists often require individual object lifetimes. Using an Arena for these results in fragmentation, as freed slots cannot be reclaimed until the entire arena is reset.
To address this, we need a Pool Allocator. A pool manages objects of a fixed size, allowing us to recycle memory efficiently.
The simplest implementation is an array of slots with a boolean flag indicating active status.
struct Particle {
bool is_active;
float position[2];
float velocity[2];
float lifetime_remaining;
};
struct ParticlePool {
Arena* backing_arena;
size_t capacity;
};
void InitPool(ParticlePool* pool, Arena* backing_arena) {
pool->backing_arena = backing_arena;
pool->capacity = 0;
}
Particle* PoolAlloc(ParticlePool* pool) {
pool->capacity++;
return (Particle*)ArenaAlloc(pool->backing_arena, sizeof(Particle));
}
This approach has significant flaws.
- Padding: The
bool is_activeoften introduces padding bytes due to alignment, wasting memory per object. - O(N) Allocation: Finding a free slot requires a linear scan of the array. As the pool fills, allocation performance degrades linearly.
- Fragmentation: Freed slots create “holes” that are expensive to locate and reuse.
The Free List
To address the O(N) allocation and fragmentation issues, we need a mechanism to locate free slots in O(1) time. A common initial approach is to maintain a separate, external container to track available slots.
Approach 1: The External Free List
We could maintain a std::vector<Particle*> containing pointers to every free slot.
// Hypothetical external list
std::vector<Particle*> free_slots;
void Free(Particle* p) {
p->is_active = false;
free_slots.push_back(p);
}
Particle* Alloc() {
if (!free_slots.empty()) {
Particle* p = free_slots.back();
free_slots.pop_back();
return p;
}
// ... allocate new ...
}
While this provides O(1) access, it introduces significant overhead:
- Double Allocation: The
std::vectoritself requires dynamic memory allocation, defeating the purpose of a custom allocator. - Cache Inefficiency: The vector lives in a different region of memory than the pool, reducing cache locality.
Approach 2: The Intrusive Linked List
We can eliminate this overhead by observing that a “dead” particle’s memory is unused. We can repurposed this memory to store the free list node itself. This is known as an intrusive linked list.
By using a union, we can overlay a next_free pointer on top of the particle’s data fields. This ensures that the free list consumes zero additional memory.
struct Particle {
// is_active is useful for debugging or iteration, but not strictly necessary for the free list itself
bool is_active;
union {
// State when is_active = true
struct {
float position[2];
float velocity[2];
float lifetime_remaining;
} data;
// State when is_active = false
Particle* next_free;
};
};
When a particle is freed, we write the address of the current free_list_head into its next_free field and update the head to point to this particle. This effectively pushes the slot onto a stack threaded through the unused memory.
Implementation
The pool maintains a pointer to the head of the free list.
struct ParticlePool {
Arena* backing_arena;
Particle* free_list_head;
size_t capacity;
};
void InitPool(ParticlePool* pool, Arena* backing_arena) {
pool->backing_arena = backing_arena;
pool->free_list_head = nullptr;
pool->capacity = 0;
}
void PoolFree(ParticlePool* pool, Particle* p) {
if (!p) return;
p->is_active = false;
// Push onto the free list stack
p->next_free = pool->free_list_head;
pool->free_list_head = p;
}
Particle* PoolAlloc(ParticlePool* pool) {
// 1. Try to recycle from the free list
if (pool->free_list_head) {
Particle* recycled = pool->free_list_head;
pool->free_list_head = recycled->next_free;
recycled->is_active = true;
return recycled;
}
// 2. Fallback: Allocate fresh memory from the arena
pool->capacity++;
Particle* new_particle = (Particle*)ArenaAlloc(pool->backing_arena, sizeof(Particle));
if (new_particle) {
new_particle->is_active = true;
}
return new_particle;
}
This provides O(1) allocation and deallocation. However, it does not preserve memory locality. Over time, active objects become scattered throughout the array, which can degrade cache performance during iteration.
The Packed Array
The Free List solves allocation speed, but it introduces a new problem: Iteration Performance.
In a Free List, active objects are scattered throughout the memory block. Iterating requires checking every slot’s is_active flag.
- Cache Misses: We load cache lines containing inactive data, wasting bandwidth.
- Branch Misprediction: The random pattern of active/inactive slots confuses the CPU’s branch predictor.
For systems where iteration is the dominant operation (e.g., updating 10,000 particles per frame), we need Contiguity.
The “Swap and Pop” Strategy
To ensure optimal iteration, we must keep all active objects packed in a contiguous block at the start of the array. We achieve this with the “Swap and Pop” removal technique.
The invariant is: active_count always represents the number of contiguous active elements.
- Allocation: Trivial. Use the slot at
index = active_countand increment. - Deallocation: To remove an element at
index i:- Copy the data from the last active element (
index = active_count - 1) into the sloti. - Decrement
active_count.
- Copy the data from the last active element (
void RemoveAt(ParticlePool* pool, size_t index) {
size_t last_index = pool->active_count - 1;
// Move the last element into the hole
pool->particles[index] = pool->particles[last_index];
pool->active_count--;
}
This operation fills the “hole” created by deletion with the last element, preserving the dense packing.
The result is a loop that compiles to a simple pointer increment, maximizing instruction throughput and cache efficiency.
for (size_t i = 0; i < pool.active_count; ++i) {
Particle* p = &pool.particles[i];
// ... update ...
}
The trade-off is pointer invalidation. When an object is moved to fill a hole, any external pointers to that object become invalid (dangling) or incorrect (pointing to the moved object). This makes the Packed Array unsuitable for systems where stable references are required.
The Slot Map (Generational Handles)
The Packed Array offers perfect iteration but suffers from Pointer Invalidation. When “Swap and Pop” moves an object, any pointer referring to it becomes invalid (pointing to the old location, which now holds a different object).
To support both fast iteration and stable references, we need a Slot Map. This architecture introduces a layer of indirection: instead of a raw pointer, we give the user a Handle.
A Handle is a composite key containing a stable Index and a Generation ID.
Step 1: Indirection
We decouple the “identity” of a particle from its physical location.
- Dense Array: Stores the actual particle data, packed tightly for iteration.
- Sparse Array: A lookup table mapping a stable
ID(the handle index) to the currentindexin the Dense Array. - Reverse Map: A map from Dense Index to Sparse Index, needed to update the Sparse Array during a swap.
When we allocate a particle, we assign it a stable ID. This ID never changes, even if the particle’s data moves in the Dense Array.
Example:
- Allocate Particle A (ID 0) at Dense Index 0.
sparse[0] = 0.
- Allocate Particle B (ID 1) at Dense Index 1.
sparse[1] = 1.
- Allocate Particle C (ID 2) at Dense Index 2.
sparse[2] = 2.
The Swap: If we free Particle B (ID 1), we must fill the hole at Dense Index 1 with the last element, Particle C.
- Move C’s data from Dense Index 2 to Dense Index 1.
- Update the Sparse Array:
sparse[2]must now point to Dense Index 1. - Update the Reverse Map:
dense_to_sparse[1]is now 2 (ID of C).
This two-way mapping ensures handles remain valid even as the underlying data moves to maintain cache locality.
Step 2: The ABA Problem
Indirection alone is insufficient. If we free an ID and then immediately reuse it for a new object, an old handle referring to the previous object will now point to the new one. This is the ABA Problem, leading to subtle bugs where code modifies the wrong entity.
Step 3: Generational Handles
To solve this, we add a version number, or Generation, to each slot in the Sparse Array.
- Handle = Index + Generation: The handle given to the user contains both the Sparse Index and the Generation at the time of allocation.
- Validation: When accessing a slot, we compare
Handle.GenerationwithSlot.Generation. - Increment: When a slot is freed, we increment its Generation.
If the generations match, the handle is valid. If they differ, the slot has been reused, and we can safely reject the access.
Implementation
struct ParticleHandle {
uint32_t sparse_index;
uint32_t generation;
};
struct SparseEntry {
uint32_t dense_index;
uint32_t generation;
};
struct ParticlePool {
Particle* particles; // Dense data
uint32_t* dense_to_sparse; // Reverse mapping
SparseEntry* sparse; // Lookup table
uint32_t active_count;
uint32_t first_free_sparse; // Head of sparse free list
ParticleHandle Alloc() {
// 1. Get a free sparse slot
uint32_t sparse_idx = first_free_sparse;
first_free_sparse = sparse[sparse_idx].dense_index; // Pop from free list
// 2. Append to dense array
uint32_t dense_idx = active_count++;
// 3. Link them
sparse[sparse_idx].dense_index = dense_idx;
dense_to_sparse[dense_idx] = sparse_idx;
// 4. Return handle with current generation
return { sparse_idx, sparse[sparse_idx].generation };
}
void Free(ParticleHandle handle) {
if (!IsValid(handle)) return;
uint32_t dense_idx = sparse[handle.sparse_index].dense_index;
uint32_t last_dense = --active_count;
// 1. Swap & Pop in Dense Array
particles[dense_idx] = particles[last_dense];
// 2. Update mappings for the moved particle
uint32_t moved_sparse = dense_to_sparse[last_dense];
sparse[moved_sparse].dense_index = dense_idx;
dense_to_sparse[dense_idx] = moved_sparse;
// 3. Retire the sparse slot
sparse[handle.sparse_index].generation++; // Invalidate old handles
sparse[handle.sparse_index].dense_index = first_free_sparse; // Push to free list
first_free_sparse = handle.sparse_index;
}
bool IsValid(ParticleHandle handle) {
return sparse[handle.sparse_index].generation == handle.generation;
}
};
The Performance Showdown: By the Numbers
Now that we’ve built our arsenal of pool designs, let’s put them to the test. We’ll simulate a particle system under stress, measuring three key phases:
- Alloc/Free Time: The cost of initial allocations and deallocations.
- Churn Time: Simulating dynamic gameplay by replacing 500 particles over 1,000 “frames” (frees followed by allocs), this tests real-world object turnover. (Note: Times here are purely the accumulative cost of alloc/free calls, excluding setup like shuffling.)
- Iteration Time: Updating all active particles a set number of times, mimicking per-frame logic.
These benchmarks were run with varying total allocations, free rates, and iteration counts to expose each design’s strengths and weaknesses. We’ll walk through each test set, highlighting what the numbers reveal about trade-offs like cache efficiency, allocation speed, and iteration bottlenecks.
Ran on a M4 Pro 12-Core Macbook
Set 1: High Churn (95% Free) - 10,000 Allocs, 9,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
This set models a sparse system with lots of frees (e.g., short-lived particles exploding on impact).
| Design | Alloc/Free Time (ms) | Churn Time (ms) | Iteration Time (ms) |
|---|---|---|---|
| Naive Pool | 12.99 | 35.91 | 2.79 |
| Free List Pool | 0.17 | 0.44 | 5.48 |
| Packed Array | 0.27 | 0.39 | 0.31 |
| Slot Map | 0.15 | 1.19 | 0.31 |
Insights:
- The Naive Pool struggles with alloc/free (12.99 ms) due to its O(N) scan for free slots. It’s repeatedly combing through nearly 10,000 elements.
- Yet, its iteration shines (2.79 ms, faster than Free List) thanks to front-filling, which clusters actives at the array’s start for better cache hits despite the
ifbranches. - Free List dominates churn (0.44 ms total for 500,000 operations) with pure O(1) efficiency, but iteration lags (5.48 ms) from scattered actives causing cache misses.
- Packed Array’s churn is equally impressive (0.39 ms), showing swap-and-pop copies are negligible for small structs.
- Slot Map adds a small churn overhead (1.19 ms) from indirection, but matches Packed’s blazing iteration (0.31 ms) while providing safe handles.
Set 2: Low Churn (5% Free) - 10,000 Allocs, 500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
Here, we have a dense system with few frees (e.g., long-lived entities like NPCs).
| Design | Alloc/Free Time (ms) | Churn Time (ms) | Iteration Time (ms) |
|---|---|---|---|
| Naive Pool | 12.94 | 613.37 | 6.51 |
| Free List Pool | 0.13 | 0.51 | 7.63 |
| Packed Array | 0.15 | 0.80 | 6.92 |
| Slot Map | 0.15 | 1.69 | 7.00 |
Insights:
- Naive’s churn balloons to 613 ms in this dense setup: each alloc must scan almost the full array to find rare free slots, making it painfully inefficient.
- Its iteration holds up well (6.51 ms) due to clustering, but with 9,500 actives, the
ifchecks become less of a bottleneck. - Free List and Packed Array breeze through churn (under 1 ms total), as frees quickly create reusable slots without heavy costs.
- Slot Map’s churn is a tad higher (1.69 ms) from handle and generation management.
- Iteration sees packed designs pull ahead slightly, benefiting from high active density, while Free List trails due to minor scattering from churn.
Set 3: High Alloc/Free Cost - 100,000 Allocs, 99,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames
Scaling up to test raw alloc/free scalability (e.g., massive simulations).
| Design | Alloc/Free Time (ms) | Churn Time (ms) | Iteration Time (ms) |
|---|---|---|---|
| Naive Pool | 1244.46 | 34.84 | 25.13 |
| Free List Pool | 1.41 | 0.46 | 32.69 |
| Packed Array | 2.73 | 0.39 | 0.31 |
| Slot Map | 1.62 | 1.19 | 0.31 |
Insights:
- Naive buckles under the scale (1,244 ms alloc/free) with its O(N²)-esque scanning, completely unusable for massive pools.
- Free List and Slot Map scale effortlessly with O(1) operations, keeping times low.
- Packed’s initial frees take a hit (2.73 ms) from numerous swaps, but its churn stays minimal (0.39 ms).
- Iteration amplifies the packed advantage (0.31 ms vs. 25-33 ms), as they process only 500 actives while others slog through 100,000 slots.
- With abundant free slots post-initial frees, churn remains tiny for all, but Free List and Packed edge out with sub-0.5 ms totals.
Set 4: High Iteration Cost - 10,000 Allocs, 9,500 Frees, 100,000 Iterations, 500 Churn/1,000 Frames
Cranking iterations to extremes (e.g., stress-testing update loops).
| Design | Alloc/Free Time (ms) | Churn Time (ms) | Iteration Time (ms) |
|---|---|---|---|
| Naive Pool | 12.78 | 35.44 | 273.18 |
| Free List Pool | 0.15 | 0.43 | 542.18 |
| Packed Array | 0.26 | 0.38 | 30.79 |
| Slot Map | 0.15 | 1.17 | 30.62 |
Insights:
- Iteration dominates, magnifying gaps: Packed and Slot Map wrap up in ~30 ms (handling 500 actives 100,000 times), while Naive (273 ms) and Free List (542 ms) burn time on empty slots.
- Naive’s clustering pays off again, outpacing Free List in iteration, its naive strategy accidentally optimizes for density.
- Churn aligns with other sets (all under 2 ms except Naive’s 35 ms), underscoring that for iteration-heavy workloads, packed designs deliver 9-18x speedups.
- Free List’s churn is stellar (0.43 ms), but if looping is your hot path, the cache costs make it lag.
What We’ve Learned from the Numbers
The big lesson? There’s no universal “best”, it hinges on your priorities. High churn? Lean toward Free List for raw efficiency. Iteration-heavy? Packed or Slot Map will transform your frame times. Need safety too? Slot Map reigns supreme as the balanced powerhouse. Always benchmark in your real app. These tests are a starting point, but your data patterns might flip the script!
Conclusion
We have moved from the raw power of the arena to specialized tools. The Free List, Packed Array, and Slot Map aren’t just algorithms. They are solutions to specific data problems.
Data-oriented design isn’t about abstract rules. It is about understanding hardware reality: cache lines, pointer indirection, and branch prediction. When you align your data structure with your access pattern, you get performance for free. These aren’t optimizations you sprinkle on top later. They are architectural decisions that make your system robust and scalable from the start.