Table of contents
Open Table of contents
The Performance Trap We All Fall Into
You have a shiny new thread pool. You have a mountain of small tasks to run in parallel. The obvious next step is to create a single, shared queue where all threads can submit work and grab new tasks. It feels clean. It feels fair.
It is a trap.
This central queue becomes a bottleneck, silently strangling your performance. Imagine a stadium with thousands of fans (your tasks) trying to exit through a single turnstile (your queue). It doesn’t matter how fast the fans run; they are all serialized at that one choke point. This is what happens to your threads.
Let’s dissect this turnstile and understand the two invisible monsters that devour your performance.
Monster 1: The Lock
To prevent chaos, a simple shared queue needs a lock (like a std::mutex). Only one thread can hold the lock at a time. When Thread A is pushing a task, Threads B, C, and D are all lined up, waiting. Your parallel system has become a single file line. The very tool you used to enable concurrency is now forcing your threads to wait their turn.
Monster 2: The Shared Notebook
This monster is more subtle and far more vicious. To understand it, we need to know one simple thing about modern CPUs: they have their own private, super fast notepads called caches. Instead of working directly with slow main memory (RAM), a CPU core copies the data it needs onto its personal notepad.
But here’s the catch: data is not copied byte by byte. It’s copied in chunks called cache lines, which are typically 64 bytes long.
Now, imagine two coworkers, Alice and Bob, sitting at different desks. They need to keep track of their own, separate to-do lists, but for some reason, both lists are written in the same physical notebook.
- Alice needs to add an item to her list on page 1. She grabs the notebook and starts writing.
- Bob now needs to add an item to his list on page 2. He can’t. The notebook is at Alice’s desk. He has to interrupt her, wait for her to finish, and have the entire notebook passed over to him.
- A moment later, Alice needs to update her list again. The process repeats: she has to interrupt Bob and get the notebook back.
They are constantly fighting over the notebook, even though they are working on completely separate lists! The notebook itself is the bottleneck. This is exactly what happens inside your computer. The notebook is the cache line, and this phenomenon is called false sharing.
A New Way of Thinking: From a Pile of Bricks to a Blueprint
The turnstile problem is often caused by thinking about work as a static “pile of bricks.” The work stealing model uses a more powerful paradigm. Instead of a pile of bricks, you give a worker a blueprint.
Imagine the main task is UpdateSceneGraph(rootNode). You give this single task to Worker 1. It looks at the rootNode and sees it has four children: A, B, C, and D. Instead of processing them sequentially, it realizes these are four large, independent sub problems. It generates four new sub tasks and pushes them onto its own personal to-do list.
This is the core idea: a task generates more, smaller tasks for its own thread. Work is not a static pile; it’s a tree that grows organically as it’s explored.
The Magic of the Steal
Now, what happens when Worker 2 becomes idle? It steals one task from Worker 1’s list. But which one? It steals the oldest one: UpdateSceneGraph(A). This is the “Aha!” moment. Worker 2 didn’t just steal a single brick. It stole the entire blueprint for subtree A. The thief has become an owner, now self sufficient and exploring its own branch of the work tree.
The LIFO/FIFO Magic for the Task Tree
To support this model, we need a special kind of to-do list. A deque (double ended queue) is designed for this exact purpose. The two ends are not treated equally; each is optimized for a different kind of user.
- The Owner works at the
bottom(LIFO): When a worker spawns new sub tasks, it pushes them onto thebottomof its own deque. It then immediately pops a task from thebottomto work on. This Last-In, First-Out (LIFO) behavior is like a stack. It has fantastic cache performance, because the thread is always working on the freshest, newest data, which is almost certainly still in its local cache. - Helpers steal from the
top(FIFO): When another worker runs out of tasks, it steals from thetop, the oldest task in the deque. This First-in, First-Out (FIFO) behavior is critical. By stealing the oldest task, the thief gets a large, coarse grained chunk of work. This maximizes the chance that the thief can become self sufficient for a long time, and it minimizes contention with the owner, who is busy working at the opposite end of the deque.
Building Our First Deque: The Training Wheels Version
Before we can build a lock free beast, we must first build a simple, correct, and easy to understand version. This “training wheels” deque will use a single lock to guarantee correctness. It will be slow, but it will teach us the fundamental behavior we need to preserve when we remove the lock later.
A Deliberate Design: The Circular Buffer
A keen C++ developer might ask, “Why are we building a data structure from scratch? The standard library has std::deque, which provides push_back, pop_back, and pop_front. Couldn’t we just wrap one in a std::mutex and be done?”
This is an excellent question, and the answer reveals our strategy. We are not just building a locked queue; we are building a blueprint for our final lock free version. There are two key reasons for this manual approach:
- Performance & Predictability:
std::dequeis a complex container designed to grow dynamically. This growth can involve memory reallocations, which are unpredictable and can be slow poison for a high performance scheduler. By using a simple, fixed size array that wraps around on itself (a circular buffer), we guarantee zero allocations after initialization and predictable, cache friendly memory access. - Preparing for the Lock Free Leap: The entire purpose of this journey is to create a data structure where threads can operate simultaneously without waiting. The vision is a parallel dance: the owner thread exclusively manages the
bottomof the deque, while multiple “thief” threads compete to take work from thetop. To make this possible, the two ends must be controlled by independent variables. Thebottomcounter is the owner’s private territory, while thetopcounter is the thieves’ public one. These counters are the fundamental control levers of the lock free algorithm. This is the primary reason we don’t usestd::deque. Its internal workings are a black box; it doesn’t expose the separatetopandbottomlevers we need to manipulate. We build our deque with these two counters from day one so we can perfect the logic. In this version, a singlestd::mutexlocks everything. But we are building the correct shape.
Assembling the Pieces
// A helper to round any number up to the next power of two.
static std::size_t next_pow2(std::size_t x) {
if (x == 0) return 1;
--x;
x |= x >> 1; x |= x >> 2; x |= x >> 4;
x |= x >> 8; x |= x >> 16;
#if INTPTR_MAX == INT64_MAX
x |= x >> 32;
#endif
return x + 1;
}
template<typename T>
class DequeLocked {
public:
explicit DequeLocked(std::size_t capacity_hint) {
m_capacity = next_pow2(capacity_hint);
m_mask = m_capacity - 1;
m_buffer.resize(m_capacity);
}
bool push_bottom(T item) {
std::unique_lock<std::mutex> lock(m_mtx);
if (m_bottom - m_top >= m_capacity) return false;
m_buffer[m_bottom & m_mask] = item;
m_bottom++;
return true;
}
bool pop_bottom(T* out) {
std::unique_lock<std::mutex> lock(m_mtx);
if (m_bottom == m_top) return false;
m_bottom--;
*out = m_buffer[m_bottom & m_mask];
return true;
}
bool steal_top(T* out) {
std::unique_lock<std::mutex> lock(m_mtx);
if (m_bottom == m_top) return false;
*out = m_buffer[m_top & m_mask];
m_top++;
return true;
}
private:
std::vector<T> m_buffer;
std::size_t m_capacity = 0, m_mask = 0;
std::uint64_t m_top = 0, m_bottom = 0;
std::mutex m_mtx;
};
Dissecting the Code: The Circular Buffer and the Power of Two Trick
The heart of this implementation is the circular buffer, managed by the m_top and m_bottom indices. Let’s break down how this works.
A circular buffer is simply a fixed size array where we pretend the end connects back to the beginning, like a ring. This allows us to store a queue of items without ever needing to shift elements or reallocate memory.
The key insight is how we manage the indices. m_top and m_bottom are unbounded 64 bit integers. They act like odometers, they only ever increase. They do not reset to zero when they pass the end of the array. The “wrapping around” logic is handled separately.
m_bottomis the index where the next item will be pushed.m_topis the index of the first valid item in the deque.- The number of items in the deque is simply
m_bottom - m_top.
To map these ever increasing, unbounded indices to a valid slot in our fixed size m_buffer, we must use the modulo operator. For example, if our buffer has a capacity of 16, and m_bottom is currently 21, the correct physical slot is 21 % 16, which is 5.
This brings us to the curious code in the constructor:
m_capacity = next_pow2(capacity_hint);
m_mask = m_capacity - 1;
And the buffer access: m_buffer[m_bottom & m_mask]
This is a classic and critical performance optimization. On a CPU, the integer division required by the modulo operator (%) is surprisingly slow. It can take many clock cycles. However, there’s a mathematical shortcut: if your capacity is a power of two (like 16, 32, 64…), then index % capacity is mathematically identical to index & (capacity - 1).
Let’s see it in action with a capacity of 16:
capacity= 16 (binary10000)mask(capacity - 1) = 15 (binary01111)
Now, let’s map our unbounded index 21 to a physical slot:
- Slow way:
21 % 16 = 5 - Fast way:
21in binary is10101.10101 & 01111 = 00101, which is5in decimal.
A bitwise AND (&) is one of the fastest instructions a CPU can execute, often completing in a single clock cycle. By ensuring our buffer’s capacity is a power of two, we replace every expensive modulo with a lightning fast bitwise AND. We trade a small amount of memory (by rounding up the capacity) for a huge gain in throughput, a worthy trade for a core scheduling data structure.
We have successfully built a deque that is correct. Its behavior is perfectly matched to the “Task Tree” paradigm. However, it is also slow. In the next part, we will take the training wheels off by diving into the world of atomic operations and memory ordering.