Parallelization#

TreeBarrier#

template<typename CompletionFn = EmptyCompletion, class PhaseType = uint32_t> class TreeBarrier

Fairly vanilla combining tree barrier.

It is inspired by GCC 15.2’s __tree_barrier, with some important API differences:

Every thread has a unique thread ID in [0, expected-1]. This eliminates the need for hashing the pthread thread IDs and for the inner search loop to find free slots in the tree.
Wait tries to spin for a given number of iterations before falling back to a futex-based atomic wait.
The barrier phase is exposed to the user.
Custom completion functions can be provided at arrival time.
Reductions and broadcasts on small values are supported.

Public Types

enum class BarrierPhase : PhaseType: Values:

Public Functions

inline TreeBarrier(uint32_t expected, CompletionFn completion): Create a barrier with expected participating threads and a completion function that is called by the last thread that arrives at each phase.

TreeBarrier(const TreeBarrier&) = delete

TreeBarrier(TreeBarrier&&) = default

TreeBarrier &operator=(const TreeBarrier&) = delete

TreeBarrier &operator=(TreeBarrier&&) = default

template<class C> inline arrival_token arrive_with_completion(uint32_t thread_id, C &&custom_completion)

Arrive at the barrier with a custom completion function that is called by the last thread that arrives, before advancing the barrier phase and notifying all waiting threads.

The completion function of the barrier is not called in this case. Each thread should use a unique thread ID in [0, expected-1].

inline arrival_token arrive(uint32_t thread_id)

Arrive at the barrier.

The barrier’s completion function is called by the last thread that arrives, before advancing the barrier phase and notifying all waiting threads. Each thread should use a unique thread ID in [0, expected-1].

inline arrival_token arrive(uint32_t thread_id, int line)

Arrive at the barrier, recording the given line number for sanity checking to make sure that all threads arrive from the same line or statement in the source code.

This is useful for debugging purposes to detect mismatched barrier calls, but should not really be used otherwise. If CYQLONE_SANITY_CHECKS_BARRIER is disabled, the line number is ignored and this function is equivalent to arrive(uint32_t). Each thread should use a unique thread ID in [0, expected-1].

inline BarrierPhase current_phase() const

Query the current barrier phase.

May wrap around on overflow, but all threads will see the same phase values in the same order.

inline bool wait_may_block(const arrival_token &token) const noexcept

Check if wait() may block.

If it returns false, the caller can call wait() and it will return immediately without spinning or sleeping. This is useful if the caller has other non-critical work to do while waiting for other threads. Users should still call wait() before arriving again.

Note

This function does not impose any memory ordering, so even when it returns false, changes made before the arrival of other threads may not be visible yet. In contrast, wait() does ensure proper synchronization.

inline void wait(arrival_token &&token) const

Wait for the barrier to complete after an arrival, using the given token.

Separating the arrival and wait phases allows for overlapping computation with waiting, hiding the synchronization latency. Waiting on the same token multiple times is not allowed.

inline void arrive_and_wait(uint32_t thread_id): Convenience function to arrive and wait in a single call.

inline void arrive_and_wait(uint32_t thread_id, int line): Convenience function to arrive and wait in a single call (with optional sanity check).

template<class C> inline void arrive_and_wait_with_completion(uint32_t thread_id, C &&custom_completion): Convenience function to arrive and wait in a single call (with custom completion).

template<class C> inline auto arrive_and_wait_with_completion(uint32_t thread_id, C &&custom_completion)

Convenience function to arrive and wait in a single call (with custom completion).

Broadcasts the return value of the custom completion function to all threads.

template<class T, class F> inline arrival_token_typed<T> arrive_reduce(uint32_t thread_id, T x, F reduce)

Combining tree reduction across all threads.

Deterministic application order for a given number of threads.

template<class T> inline T wait_reduce(arrival_token_typed<T> &&token): Wait for the result of an arrive_reduce call and obtain the reduced value.

template<class T, class F> inline T reduce(uint32_t thread_id, T x, F reduce)

Combining tree reduction across all threads.

Deterministic application order for a given number of threads.

template<class T> inline T broadcast(uint32_t thread_id, T &&x, uint32_t src = 0)

Broadcast a value from the source thread to all other threads.

All threads must call this function with the same source thread ID.

Public Members

uint32_t spin_count = 1000: Number of spin iterations before falling back to futex-based wait.

Public Static Functions

static inline uint32_t max(): Maximum number of threads supported by this barrier implementation.

class arrival_token

Subclassed by cyqlone::TreeBarrier< CompletionFn, PhaseType >::arrival_token_typed< T >

Public Functions

inline explicit arrival_token(BarrierPhase phase)

arrival_token(const arrival_token &phase) = delete

arrival_token(arrival_token &&phase) = default

arrival_token &operator=(const arrival_token &phase) = delete

arrival_token &operator=(arrival_token &&phase) = default

inline BarrierPhase get() const noexcept

template<class T> class arrival_token_typed : public cyqlone::TreeBarrier<CompletionFn, PhaseType>::arrival_token

Public Functions

inline BarrierPhase get() const noexcept

struct EmptyCompletion

No-op completion function for the TreeBarrier.

Public Functions

inline void operator()() const noexcept: Does nothing.

SharedContext#

struct SharedContext

Abstraction for a parallel execution context: a set of threads that can synchronize and communicate with each other using barriers.

Context#

template<class SC> struct Context

Thread context for parallel execution.

Each thread has a unique thread index, and can synchronize and communicate with other threads in the same shared context.

Parallelization

Contents

Parallelization#

TreeBarrier#

SharedContext#

Context#