Work-Stealing Deque Part 4: The Lock-Free Thief

We have the Owner’s code working. Now we need to implement the Thief’s logic, steal_top.

This function is called by other threads when they run out of work. They look at our deque and try to grab the oldest task.

The Thief’s `steal_top`

The logic here mirrors the Owner’s pop_bottom. It also needs to participate in the seq_cst handshake to handle the race for the last item.

A note on the ABA Problem: In many lock-free data structures, if a value changes from A to B and back to A, a thread might think nothing changed. Here, because our top and bottom indices are strictly increasing 64-bit integers, they will practically never wrap around (it would take centuries). This makes the ABA problem impossible for all practical purposes, simplifying our logic significantly compared to a pointer-based stack.

template<typename T>
bool WSDeque<T>::steal_top(T* out) {
    // 1. Load `top`. RELAXED is enough here: this is just
    // the index we will try to claim with the CAS below.
    std::uint64_t t = m_top.load(std::memory_order_relaxed);

    // 2. FENCE. This participates in the same seq_cst order
    // as the fence in `pop_bottom`.
    std::atomic_thread_fence(std::memory_order_seq_cst);

    // 3. Load `bottom`. ACQUIRE pairs with the RELEASE store
    // in `push_bottom`, so if we see the published bottom value,
    // we also see the data written to the buffer.
    std::uint64_t b = m_bottom.load(std::memory_order_acquire);

    if (t < b) {
        // There is at least one item.
        // It is safe to read from the buffer *before* the CAS because
        // we know the item exists at index `t`.
        T item = m_buffer[t & m_mask];
        
        // 4. CAS. We try to increment `top` to claim the item.
        // This is the final arbiter. If multiple thieves (or the owner)
        // are fighting for this item, only one CAS will succeed.
        if (m_top.compare_exchange_strong(t, t + 1,
                std::memory_order_seq_cst, std::memory_order_relaxed)) {
            // Success! We stole the item.
            *out = item;
            return true;
        }
    }
    
    // Failed. Either empty, or we lost the race.
    return false;
}

How It All Fits Together

The correctness of this deque relies on the interactions between the fences and atomic operations.

1. Push vs. Steal (Producer-Consumer)

Owner: Writes data -> release store to bottom.
Thief: acquire load of bottom -> Reads data.
Result: Safe. The thief is guaranteed to see the valid data if it sees the updated index.

2. Steal vs. Steal (Thieves Fighting)

Two thieves see the same top.
Both try to CAS top -> top + 1.
Result: The hardware guarantees only one CAS succeeds. The winner gets the task. The loser gets nothing.

3. Pop vs. Steal (The Critical Race)

Owner: Decrements bottom, runs seq_cst fence, checks top.
Thief: Runs seq_cst fence, checks bottom.
Result: The seq_cst fences protect the decision before the CAS.

The CAS is the final arbiter when the Owner and a Thief are both trying to claim the same last item. Only one top update can win.

The subtle part is that the Owner does not always use that CAS. After decrementing bottom, if the Owner reads top < bottom, it assumes there is still more than one item and takes the bottom slot without touching top. The fence is what makes that no-CAS decision safe. It prevents the Owner from deciding “I am not racing for the last item” from one view of top and bottom while Thieves, from another view, can advance top far enough to claim that same slot.

That no-CAS path matters for performance too: when the Owner can prove it is not racing for the last item, it avoids an unnecessary CAS on the common path.

The CAS protects the actual top claim; the fence protects the decision of whether the owner has to enter that CAS race at all.

Practical Tips for the Scheduler

Now that you have a deque, you need a scheduler to use it. Here are two tips for building a robust work-stealing thread pool.

1. Random Victim Selection

When a thief needs to steal, which deque should it target? Don’t use a complex heuristic. Just pick a random thread. It’s statistically robust and simple. Crucial: Do not use a shared stateful random number generator (like std::mt19937 or rand()) protected by a lock. That introduces the same contention we just fought to remove. Instead, use a thread_local generator for each thread or a stateless hash function.

2. Exponential Backoff

If a thief tries to steal from everyone and fails, the system is likely under load (or empty). Don’t spin tightly in a loop. You’ll burn CPU cycles and slow down the working threads by contending for cache lines. Implement an exponential backoff:

Yield (std::this_thread::yield()).
Sleep for 1 microsecond.
Sleep for 2 microseconds… up to a cap (e.g., 1ms). Reset the backoff as soon as you find work.

Conclusion

We started this series with a problem: a dynamic, unpredictable task tree that couldn’t be efficiently managed by a single shared queue. We saw how locks create bottlenecks and how false sharing silently kills performance.

By building this lock-free deque, we’ve created a structure that allows every thread to work independently on its own branch of that tree, only interacting when absolutely necessary. This pattern powers systems like Intel TBB and Go’s scheduler.

Thanks to Jennifer Chukwu for pointing out that the initial top load does not need acquire ordering, and for the question that helped clarify the CAS/fence distinction.