Deque Dojo Part 3: The Owner's Lock-Free Fast Path

Part 1 Part 2 Part 3 Part 4

With the theory of memory ordering from Part 2, we are now ready to begin removing the global lock from our deque. We will start with the “fast path”: the functions used exclusively by the owner thread.

The owner is the only thread that can push new tasks onto the deque (push_bottom) and the primary thread that takes tasks for itself (pop_bottom). Our goal is to make these operations lock-free while preparing for the eventual introduction of thief threads.

The New Blueprint: Preparing for Contention

First, our deque’s structure is updated. The top and bottom counters are promoted to std::atomic, which guarantees that operations on them are indivisible and allows us to specify memory ordering constraints.

More importantly, we must prevent false sharing. The bottom index is primarily written to by the owner. The top index is written to by thieves and, in the special case of popping the last item, by the owner. Since different cores may be writing to these two variables, they must not be on the same cache line. If they were, a write to bottom on Core 1 would invalidate the cache line on Core 2. When Core 2 then needs to write to top, it would cause a cache miss, forcing a slow data fetch from a higher-level cache or main memory. This constant, expensive invalidation cycle is known as cache line ping-ponging. We prevent it by using alignas to force each counter onto its own cache line.

False Sharing

template<typename T>
class WSDeque {
public:
    WSDeque(std::size_t capacity_hint) { /* ... */ }
    ~WSDeque() { delete[] m_buffer; }

    bool push_bottom(T item);
    bool pop_bottom(T* out);
    bool steal_top(T* out); // To be implemented in Part 4

private:
    std::size_t m_capacity, m_mask;
    T* m_buffer;

    alignas(std::hardware_destructive_interference_size) 
        std::atomic<std::uint64_t> m_bottom;
    alignas(std::hardware_destructive_interference_size) 
        std::atomic<std::uint64_t> m_top;
};

The Owner’s `push_bottom`: A Producer-Consumer Operation

The push_bottom function acts as a “producer,” making work available for “consumers” (the thieves). This requires a standard acquire-release handshake to ensure visibility.

template<typename T>
bool WSDeque<T>::push_bottom(T item) {
    // 1. Load `bottom`. Relaxed is sufficient because only this thread,
    // the owner, ever modifies `bottom`. We are just reading our own
    // state, so no synchronization with other threads is needed here.
    std::uint64_t b = m_bottom.load(std::memory_order_relaxed);

    // 2. Load `top`. An acquire load is necessary to ensure we see the
    // latest updates from any thief threads. This synchronizes with the
    // CAS in `steal_top`, guaranteeing we have an up-to-date view of
    // the deque's size before proceeding.
    std::uint64_t t = m_top.load(std::memory_order_acquire);

    if (b - t >= m_capacity) {
        return false; // Deque is full.
    }

    // 3. Write the item to the buffer. This is a plain write. Atomics
    // are not needed because this memory slot is effectively private;
    // no other thread can access it until we publish the change.
    m_buffer[b & m_mask] = item;

    // 4. Publish the new item. The `release` store ensures that the
    // preceding buffer write is guaranteed to happen-before this store
    // becomes visible to other threads. This makes the new task
    // available for stealing.
    m_bottom.store(b + 1, std::memory_order_release);
    return true;
}

The Owner’s `pop_bottom`: Handling a Race Condition

Popping from the bottom is more complex. While only the owner can initiate this operation, a thief could be trying to steal the very same item if the deque only has one element left. This creates a potential race condition that our code must handle robustly.

The strategy, as specified in the formal literature, is a defensive, three-step algorithm: Claim, Fence, Check.

template<typename T>
bool WSDeque<T>::pop_bottom(T* out) {
    std::uint64_t b = m_bottom.load(std::memory_order_relaxed);
    
    // An initial `acquire` load on `top` allows us to fail fast if the
    // deque appears empty. This is a quick check, not the definitive one.
    std::uint64_t t = m_top.load(std::memory_order_acquire);
    
    if (t >= b) {
        return false; // Deque appears empty.
    }

    // 1. CLAIM: Tentatively claim an item by decrementing `bottom`.
    // As confirmed by the formal proof, this store must be `relaxed`.
    // We are making a cheap, local-only claim. The powerful fence that
    // follows is what makes this claim globally visible.
    b--;
    m_bottom.store(b, std::memory_order_relaxed);

    // 2. FENCE: Establish a global ordering point. An `atomic_thread_fence`
    // with `seq_cst` acts as a strong barrier. It forces our relaxed
    // store to `bottom` to become visible to all other threads *before*
    // any subsequent loads are executed.
    std::atomic_thread_fence(std::memory_order_seq_cst);

    // 3. CHECK: Load `top` again. This second load is ESSENTIAL. The
    // value of `t` from before the fence is now stale. We need a fresh
    // value of `top` that is consistent with the memory state *after*
    // our fence, so we can correctly detect a race with a thief.
    t = m_top.load(std::memory_order_relaxed);

    if (t <= b) {
        // Our claim is valid; there is at least one item. It is now safe
        // to read from the buffer.
        T item = m_buffer[b & m_mask];

        if (t == b) { // This was the last item; we must resolve the race.
            // We use a strong `seq_cst` CAS to definitively win the race.
            // If another thread (a thief) modified `top` since our fence,
            // this operation will fail.
            if (!m_top.compare_exchange_strong(t, t + 1,
                    std::memory_order_seq_cst, std::memory_order_relaxed)) {
                // We lost the race. Restore `bottom` and fail.
                m_bottom.store(b + 1, std::memory_order_relaxed);
                return false;
            }
            // We won. Restore `bottom` to its correct now-empty state.
            m_bottom.store(b + 1, std::memory_order_relaxed);
        }
        *out = item;
        return true;
    } else {
        // A thief stole the last item after our claim. The deque is
        // now empty. Restore `bottom` and fail.
        m_bottom.store(b + 1, std::memory_order_relaxed);
        return false;
    }
}

This precise sequence of relaxed operations, fences, and a strong CAS makes the algorithm both correct and performant across different CPU architectures like x86 and ARM. We have now built the owner’s half of the lock-free deque: a fast path for pushing and a robust, race-aware path for popping. The owner is now perfectly prepared for the final lesson: unleashing the thieves.

Deque Dojo Part 3: The Owner's Lock-Free Fast Path

The New Blueprint: Preparing for Contention

The Owner’s push_bottom: A Producer-Consumer Operation

The Owner’s pop_bottom: Handling a Race Condition

The Owner’s `push_bottom`: A Producer-Consumer Operation

The Owner’s `pop_bottom`: Handling a Race Condition