Memory Magic Part 3: The Specialist's Toolkit

Open Table of contents

The Naive Pool
The Free List
The Packed Array
- The “Swap and Pop” Strategy
The Slot Map (Generational Handles)
The Performance Showdown: By the Numbers
Conclusion

The Naive Pool

In Part 2, we built the Arena, which is ideal for objects with shared lifetimes. However, systems like particle effects or entity lists often require individual object lifetimes. Using an Arena for these results in fragmentation, as freed slots cannot be reclaimed until the entire arena is reset.

To address this, we need a Pool Allocator. A pool manages objects of a fixed size, allowing us to recycle memory efficiently.

The simplest implementation is an array of slots with a boolean flag indicating active status.

struct Particle {
    bool is_active;
    float position[2];
    float velocity[2];
    float lifetime_remaining;
};

struct ParticlePool {
    Arena* backing_arena;
    size_t capacity;
};

void InitPool(ParticlePool* pool, Arena* backing_arena) {
    pool->backing_arena = backing_arena;
    pool->capacity = 0;
}

Particle* PoolAlloc(ParticlePool* pool) {
    pool->capacity++;
    return (Particle*)ArenaAlloc(pool->backing_arena, sizeof(Particle));
}

This approach has significant flaws.

Padding: The bool is_active often introduces padding bytes due to alignment, wasting memory per object.
O(N) Allocation: Finding a free slot requires a linear scan of the array. As the pool fills, allocation performance degrades linearly.
Fragmentation: Freed slots create “holes” that are expensive to locate and reuse.

Naive Pool

The Free List

To address the O(N) allocation and fragmentation issues, we need a mechanism to locate free slots in O(1) time. A common initial approach is to maintain a separate, external container to track available slots.

Approach 1: The External Free List

We could maintain a std::vector<Particle*> containing pointers to every free slot.

// Hypothetical external list
std::vector<Particle*> free_slots;

void Free(Particle* p) {
    p->is_active = false;
    free_slots.push_back(p);
}

Particle* Alloc() {
    if (!free_slots.empty()) {
        Particle* p = free_slots.back();
        free_slots.pop_back();
        return p;
    }
    // ... allocate new ...
}

While this provides O(1) access, it introduces significant overhead:

Double Allocation: The std::vector itself requires dynamic memory allocation, defeating the purpose of a custom allocator.
Cache Inefficiency: The vector lives in a different region of memory than the pool, reducing cache locality.

Approach 2: The Intrusive Linked List

We can eliminate this overhead by observing that a “dead” particle’s memory is unused. We can repurposed this memory to store the free list node itself. This is known as an intrusive linked list.

By using a union, we can overlay a next_free pointer on top of the particle’s data fields. This ensures that the free list consumes zero additional memory.

struct Particle {
    // is_active is useful for debugging or iteration, but not strictly necessary for the free list itself
    bool is_active; 

    union {
        // State when is_active = true
        struct {
            float position[2];
            float velocity[2];
            float lifetime_remaining;
        } data;
        
        // State when is_active = false
        Particle* next_free;
    };
};

When a particle is freed, we write the address of the current free_list_head into its next_free field and update the head to point to this particle. This effectively pushes the slot onto a stack threaded through the unused memory.

Free List Pool

Implementation

The pool maintains a pointer to the head of the free list.

struct ParticlePool {
    Arena* backing_arena;
    Particle* free_list_head;
    size_t capacity;
};

void InitPool(ParticlePool* pool, Arena* backing_arena) {
    pool->backing_arena = backing_arena;
    pool->free_list_head = nullptr;
    pool->capacity = 0;
}

void PoolFree(ParticlePool* pool, Particle* p) {
    if (!p) return;

    p->is_active = false;
    // Push onto the free list stack
    p->next_free = pool->free_list_head;
    pool->free_list_head = p;
}

Particle* PoolAlloc(ParticlePool* pool) {
    // 1. Try to recycle from the free list
    if (pool->free_list_head) {
        Particle* recycled = pool->free_list_head;
        pool->free_list_head = recycled->next_free;
        recycled->is_active = true;
        return recycled;
    }

    // 2. Fallback: Allocate fresh memory from the arena
    pool->capacity++;
    Particle* new_particle = (Particle*)ArenaAlloc(pool->backing_arena, sizeof(Particle));
    if (new_particle) {
        new_particle->is_active = true;
    }
    return new_particle;
}

This provides O(1) allocation and deallocation. However, it does not preserve memory locality. Over time, active objects become scattered throughout the array, which can degrade cache performance during iteration.

The Packed Array

The Free List solves allocation speed, but it introduces a new problem: Iteration Performance.

In a Free List, active objects are scattered throughout the memory block. Iterating requires checking every slot’s is_active flag.

Cache Misses: We load cache lines containing inactive data, wasting bandwidth.
Branch Misprediction: The random pattern of active/inactive slots confuses the CPU’s branch predictor.

For systems where iteration is the dominant operation (e.g., updating 10,000 particles per frame), we need Contiguity.

The “Swap and Pop” Strategy

To ensure optimal iteration, we must keep all active objects packed in a contiguous block at the start of the array. We achieve this with the “Swap and Pop” removal technique.

The invariant is: active_count always represents the number of contiguous active elements.

Allocation: Trivial. Use the slot at index = active_count and increment.
Deallocation: To remove an element at index i:
1. Copy the data from the last active element (index = active_count - 1) into the slot i.
2. Decrement active_count.

void RemoveAt(ParticlePool* pool, size_t index) {
    size_t last_index = pool->active_count - 1;
    
    // Move the last element into the hole
    pool->particles[index] = pool->particles[last_index];
    
    pool->active_count--;
}

This operation fills the “hole” created by deletion with the last element, preserving the dense packing.

Swap and Pop

The result is a loop that compiles to a simple pointer increment, maximizing instruction throughput and cache efficiency.

for (size_t i = 0; i < pool.active_count; ++i) {
    Particle* p = &pool.particles[i];
    // ... update ...
}

The trade-off is pointer invalidation. When an object is moved to fill a hole, any external pointers to that object become invalid (dangling) or incorrect (pointing to the moved object). This makes the Packed Array unsuitable for systems where stable references are required.

The Slot Map (Generational Handles)

The Packed Array offers perfect iteration but suffers from Pointer Invalidation. When “Swap and Pop” moves an object, any pointer referring to it becomes invalid (pointing to the old location, which now holds a different object).

To support both fast iteration and stable references, we need a Slot Map. This architecture introduces a layer of indirection: instead of a raw pointer, we give the user a Handle.

A Handle is a composite key containing a stable Index and a Generation ID.

Step 1: Indirection

We decouple the “identity” of a particle from its physical location.

Dense Array: Stores the actual particle data, packed tightly for iteration.
Sparse Array: A lookup table mapping a stable ID (the handle index) to the current index in the Dense Array.
Reverse Map: A map from Dense Index to Sparse Index, needed to update the Sparse Array during a swap.

When we allocate a particle, we assign it a stable ID. This ID never changes, even if the particle’s data moves in the Dense Array.

Example:

Allocate Particle A (ID 0) at Dense Index 0. sparse[0] = 0.

Handle Example 1

Allocate Particle B (ID 1) at Dense Index 1. sparse[1] = 1.

Handle Example 2

Allocate Particle C (ID 2) at Dense Index 2. sparse[2] = 2.

Handle Example 3

The Swap: If we free Particle B (ID 1), we must fill the hole at Dense Index 1 with the last element, Particle C.

Move C’s data from Dense Index 2 to Dense Index 1.
Update the Sparse Array: sparse[2] must now point to Dense Index 1.
Update the Reverse Map: dense_to_sparse[1] is now 2 (ID of C).

Handle Example 4

This two-way mapping ensures handles remain valid even as the underlying data moves to maintain cache locality.

Step 2: The ABA Problem

Indirection alone is insufficient. If we free an ID and then immediately reuse it for a new object, an old handle referring to the previous object will now point to the new one. This is the ABA Problem, leading to subtle bugs where code modifies the wrong entity.

Step 3: Generational Handles

To solve this, we add a version number, or Generation, to each slot in the Sparse Array.

Handle = Index + Generation: The handle given to the user contains both the Sparse Index and the Generation at the time of allocation.
Validation: When accessing a slot, we compare Handle.Generation with Slot.Generation.
Increment: When a slot is freed, we increment its Generation.

If the generations match, the handle is valid. If they differ, the slot has been reused, and we can safely reject the access.

Implementation

struct ParticleHandle {
    uint32_t sparse_index;
    uint32_t generation;
};

struct SparseEntry {
    uint32_t dense_index;
    uint32_t generation;
};

struct ParticlePool {
    Particle* particles;        // Dense data
    uint32_t* dense_to_sparse;  // Reverse mapping
    SparseEntry* sparse;        // Lookup table
    
    uint32_t active_count;
    uint32_t first_free_sparse; // Head of sparse free list

    ParticleHandle Alloc() {
        // 1. Get a free sparse slot
        uint32_t sparse_idx = first_free_sparse;
        first_free_sparse = sparse[sparse_idx].dense_index; // Pop from free list

        // 2. Append to dense array
        uint32_t dense_idx = active_count++;
        
        // 3. Link them
        sparse[sparse_idx].dense_index = dense_idx;
        dense_to_sparse[dense_idx] = sparse_idx;
        
        // 4. Return handle with current generation
        return { sparse_idx, sparse[sparse_idx].generation };
    }

    void Free(ParticleHandle handle) {
        if (!IsValid(handle)) return;

        uint32_t dense_idx = sparse[handle.sparse_index].dense_index;
        uint32_t last_dense = --active_count;

        // 1. Swap & Pop in Dense Array
        particles[dense_idx] = particles[last_dense];

        // 2. Update mappings for the moved particle
        uint32_t moved_sparse = dense_to_sparse[last_dense];
        sparse[moved_sparse].dense_index = dense_idx;
        dense_to_sparse[dense_idx] = moved_sparse;

        // 3. Retire the sparse slot
        sparse[handle.sparse_index].generation++; // Invalidate old handles
        sparse[handle.sparse_index].dense_index = first_free_sparse; // Push to free list
        first_free_sparse = handle.sparse_index;
    }

    bool IsValid(ParticleHandle handle) {
        return sparse[handle.sparse_index].generation == handle.generation;
    }
};

The Performance Showdown: By the Numbers

Now that we’ve built our arsenal of pool designs, let’s put them to the test. We’ll simulate a particle system under stress, measuring three key phases:

Alloc/Free Time: The cost of initial allocations and deallocations.
Churn Time: Simulating dynamic gameplay by replacing 500 particles over 1,000 “frames” (frees followed by allocs), this tests real-world object turnover. (Note: Times here are purely the accumulative cost of alloc/free calls, excluding setup like shuffling.)
Iteration Time: Updating all active particles a set number of times, mimicking per-frame logic.

These benchmarks were run with varying total allocations, free rates, and iteration counts to expose each design’s strengths and weaknesses. We’ll walk through each test set, highlighting what the numbers reveal about trade-offs like cache efficiency, allocation speed, and iteration bottlenecks.

Ran on a M4 Pro 12-Core Macbook

Set 1: High Churn (95% Free) - 10,000 Allocs, 9,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

This set models a sparse system with lots of frees (e.g., short-lived particles exploding on impact).

Design	Alloc/Free Time (ms)	Churn Time (ms)	Iteration Time (ms)
Naive Pool	12.99	35.91	2.79
Free List Pool	0.17	0.44	5.48
Packed Array	0.27	0.39	0.31
Slot Map	0.15	1.19	0.31

Insights:

The Naive Pool struggles with alloc/free (12.99 ms) due to its O(N) scan for free slots. It’s repeatedly combing through nearly 10,000 elements.
Yet, its iteration shines (2.79 ms, faster than Free List) thanks to front-filling, which clusters actives at the array’s start for better cache hits despite the if branches.
Free List dominates churn (0.44 ms total for 500,000 operations) with pure O(1) efficiency, but iteration lags (5.48 ms) from scattered actives causing cache misses.
Packed Array’s churn is equally impressive (0.39 ms), showing swap-and-pop copies are negligible for small structs.
Slot Map adds a small churn overhead (1.19 ms) from indirection, but matches Packed’s blazing iteration (0.31 ms) while providing safe handles.

Set 2: Low Churn (5% Free) - 10,000 Allocs, 500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

Here, we have a dense system with few frees (e.g., long-lived entities like NPCs).

Design	Alloc/Free Time (ms)	Churn Time (ms)	Iteration Time (ms)
Naive Pool	12.94	613.37	6.51
Free List Pool	0.13	0.51	7.63
Packed Array	0.15	0.80	6.92
Slot Map	0.15	1.69	7.00

Insights:

Naive’s churn balloons to 613 ms in this dense setup: each alloc must scan almost the full array to find rare free slots, making it painfully inefficient.
Its iteration holds up well (6.51 ms) due to clustering, but with 9,500 actives, the if checks become less of a bottleneck.
Free List and Packed Array breeze through churn (under 1 ms total), as frees quickly create reusable slots without heavy costs.
Slot Map’s churn is a tad higher (1.69 ms) from handle and generation management.
Iteration sees packed designs pull ahead slightly, benefiting from high active density, while Free List trails due to minor scattering from churn.

Set 3: High Alloc/Free Cost - 100,000 Allocs, 99,500 Frees, 1,000 Iterations, 500 Churn/1,000 Frames

Scaling up to test raw alloc/free scalability (e.g., massive simulations).

Design	Alloc/Free Time (ms)	Churn Time (ms)	Iteration Time (ms)
Naive Pool	1244.46	34.84	25.13
Free List Pool	1.41	0.46	32.69
Packed Array	2.73	0.39	0.31
Slot Map	1.62	1.19	0.31

Insights:

Naive buckles under the scale (1,244 ms alloc/free) with its O(N²)-esque scanning, completely unusable for massive pools.
Free List and Slot Map scale effortlessly with O(1) operations, keeping times low.
Packed’s initial frees take a hit (2.73 ms) from numerous swaps, but its churn stays minimal (0.39 ms).
Iteration amplifies the packed advantage (0.31 ms vs. 25-33 ms), as they process only 500 actives while others slog through 100,000 slots.
With abundant free slots post-initial frees, churn remains tiny for all, but Free List and Packed edge out with sub-0.5 ms totals.

Set 4: High Iteration Cost - 10,000 Allocs, 9,500 Frees, 100,000 Iterations, 500 Churn/1,000 Frames

Cranking iterations to extremes (e.g., stress-testing update loops).

Design	Alloc/Free Time (ms)	Churn Time (ms)	Iteration Time (ms)
Naive Pool	12.78	35.44	273.18
Free List Pool	0.15	0.43	542.18
Packed Array	0.26	0.38	30.79
Slot Map	0.15	1.17	30.62

Insights:

Iteration dominates, magnifying gaps: Packed and Slot Map wrap up in ~30 ms (handling 500 actives 100,000 times), while Naive (273 ms) and Free List (542 ms) burn time on empty slots.
Naive’s clustering pays off again, outpacing Free List in iteration, its naive strategy accidentally optimizes for density.
Churn aligns with other sets (all under 2 ms except Naive’s 35 ms), underscoring that for iteration-heavy workloads, packed designs deliver 9-18x speedups.
Free List’s churn is stellar (0.43 ms), but if looping is your hot path, the cache costs make it lag.

What We’ve Learned from the Numbers

The big lesson? There’s no universal “best”, it hinges on your priorities. High churn? Lean toward Free List for raw efficiency. Iteration-heavy? Packed or Slot Map will transform your frame times. Need safety too? Slot Map reigns supreme as the balanced powerhouse. Always benchmark in your real app. These tests are a starting point, but your data patterns might flip the script!

Conclusion

We have moved from the raw power of the arena to specialized tools. The Free List, Packed Array, and Slot Map aren’t just algorithms. They are solutions to specific data problems.

Data-oriented design isn’t about abstract rules. It is about understanding hardware reality: cache lines, pointer indirection, and branch prediction. When you align your data structure with your access pattern, you get performance for free. These aren’t optimizations you sprinkle on top later. They are architectural decisions that make your system robust and scalable from the start.