Parallelization

Parallelization#

TreeBarrier#

template<typename CompletionFn = EmptyCompletion, class PhaseType = uint32_t>
class TreeBarrier

Fairly vanilla combining tree barrier.

It is inspired by GCC 15.2’s __tree_barrier, with some important API differences:

  • Every thread has a unique thread ID in [0, expected-1]. This eliminates the need for hashing the pthread thread IDs and for the inner search loop to find free slots in the tree.

  • Wait tries to spin for a given number of iterations before falling back to a futex-based atomic wait.

  • The barrier phase is exposed to the user.

  • Custom completion functions can be provided at arrival time.

  • Reductions and broadcasts on small values are supported.

Public Types

enum class BarrierPhase : PhaseType

Values:

Public Functions

inline TreeBarrier(uint32_t expected, CompletionFn completion)

Create a barrier with expected participating threads and a completion function that is called by the last thread that arrives at each phase.

TreeBarrier(const TreeBarrier&) = delete
TreeBarrier(TreeBarrier&&) = default
TreeBarrier &operator=(const TreeBarrier&) = delete
TreeBarrier &operator=(TreeBarrier&&) = default
template<class C>
inline arrival_token arrive_with_completion(uint32_t thread_id, C &&custom_completion)

Arrive at the barrier with a custom completion function that is called by the last thread that arrives, before advancing the barrier phase and notifying all waiting threads.

The completion function of the barrier is not called in this case. Each thread should use a unique thread ID in [0, expected-1].

inline arrival_token arrive(uint32_t thread_id)

Arrive at the barrier.

The barrier’s completion function is called by the last thread that arrives, before advancing the barrier phase and notifying all waiting threads. Each thread should use a unique thread ID in [0, expected-1].

inline arrival_token arrive(uint32_t thread_id, int line)

Arrive at the barrier, recording the given line number for sanity checking to make sure that all threads arrive from the same line or statement in the source code.

This is useful for debugging purposes to detect mismatched barrier calls, but should not really be used otherwise. If CYQLONE_SANITY_CHECKS_BARRIER is disabled, the line number is ignored and this function is equivalent to arrive(uint32_t). Each thread should use a unique thread ID in [0, expected-1].

inline BarrierPhase current_phase() const

Query the current barrier phase.

May wrap around on overflow, but all threads will see the same phase values in the same order.

inline bool wait_may_block(const arrival_token &token) const noexcept

Check if wait() may block.

If it returns false, the caller can call wait() and it will return immediately without spinning or sleeping. This is useful if the caller has other non-critical work to do while waiting for other threads. Users should still call wait() before arriving again.

Note

This function does not impose any memory ordering, so even when it returns false, changes made before the arrival of other threads may not be visible yet. In contrast, wait() does ensure proper synchronization.

inline void wait(arrival_token &&token) const

Wait for the barrier to complete after an arrival, using the given token.

Separating the arrival and wait phases allows for overlapping computation with waiting, hiding the synchronization latency. Waiting on the same token multiple times is not allowed.

inline void arrive_and_wait(uint32_t thread_id)

Convenience function to arrive and wait in a single call.

inline void arrive_and_wait(uint32_t thread_id, int line)

Convenience function to arrive and wait in a single call (with optional sanity check).

template<class C>
inline void arrive_and_wait_with_completion(uint32_t thread_id, C &&custom_completion)

Convenience function to arrive and wait in a single call (with custom completion).

template<class C>
inline auto arrive_and_wait_with_completion(uint32_t thread_id, C &&custom_completion)

Convenience function to arrive and wait in a single call (with custom completion).

Broadcasts the return value of the custom completion function to all threads.

template<class T, class F>
inline arrival_token_typed<T> arrive_reduce(uint32_t thread_id, T x, F reduce)

Combining tree reduction across all threads.

Deterministic application order for a given number of threads.

template<class T>
inline T wait_reduce(arrival_token_typed<T> &&token)

Wait for the result of an arrive_reduce call and obtain the reduced value.

template<class T, class F>
inline T reduce(uint32_t thread_id, T x, F reduce)

Combining tree reduction across all threads.

Deterministic application order for a given number of threads.

template<class T>
inline T broadcast(uint32_t thread_id, T &&x, uint32_t src = 0)

Broadcast a value from the source thread to all other threads.

All threads must call this function with the same source thread ID.

Public Members

uint32_t spin_count = 1000

Number of spin iterations before falling back to futex-based wait.

Public Static Functions

static inline uint32_t max()

Maximum number of threads supported by this barrier implementation.

class arrival_token

Subclassed by cyqlone::TreeBarrier< CompletionFn, PhaseType >::arrival_token_typed< T >

Public Functions

inline explicit arrival_token(BarrierPhase phase)
arrival_token(const arrival_token &phase) = delete
arrival_token(arrival_token &&phase) = default
arrival_token &operator=(const arrival_token &phase) = delete
arrival_token &operator=(arrival_token &&phase) = default
inline BarrierPhase get() const noexcept
template<class T>
class arrival_token_typed : public cyqlone::TreeBarrier<CompletionFn, PhaseType>::arrival_token

Public Functions

inline BarrierPhase get() const noexcept
struct EmptyCompletion

No-op completion function for the TreeBarrier.

Public Functions

inline void operator()() const noexcept

Does nothing.

SharedContext#

struct SharedContext

Abstraction for a parallel execution context: a set of threads that can synchronize and communicate with each other using barriers.

See also

Context

Public Types

using completion_type = EmptyCompletion
using barrier_type = TreeBarrier<completion_type, uint16_t>

Public Functions

template<class F>
void run(F&&)

Execute the given function in parallel on all threads, blocking until completion.

The function will be called with a Context that contains the thread index, and that can be used to synchronize and communicate between threads.

inline uint32_t set_barrier_spin_count(uint32_t spin_count)

Configure the barrier spin count used in parallel synchronization before falling back to a futex wait.

Public Members

const index_t num_thr
barrier_type barrier = {static_cast<uint32_t>(num_thr), {}}

Context#

template<class SC>
struct Context

Thread context for parallel execution.

Each thread has a unique thread index, and can synchronize and communicate with other threads in the same shared context.

See also

SharedContext

Public Types

using shared_context_type = SC
using arrival_token = typename shared_context_type::barrier_type::arrival_token

Public Functions

inline bool is_master() const

Check if this thread is the master thread (thread index 0).

Useful for determining which thread should perform operations like printing to the console, which should be done by a single thread and does not require synchronization.

inline arrival_token arrive()

Arrive at the barrier and obtain a token that can be used to wait for completion of the current barrier phase.

Note

Token must be awaited before any other call to arrive.

inline void wait(arrival_token &&token)

Await a token returned by arrive(), waiting for the barrier phase to complete.

inline void arrive_and_wait()

Arrive at the barrier and wait for the barrier phase to complete.

This is a convenience wrapper around arrive() and wait() for the common case where the thread does not have other work to do while waiting.

inline void arrive_and_wait(int line)

Debug version of arrive_and_wait() that performs a sanity check to ensure that all threads are arriving at the same line of code.

The line parameter should be the same for all threads arriving at the same barrier. It is only verified in debug builds, and is equivalent to arrive_and_wait() in release builds.

template<class T>
inline T broadcast(T x, index_t src = 0)

Broadcast a value x from the thread with index src to all threads.

template<class F, class ...Args>
inline std::invoke_result_t<F, Args...> call_broadcast(F &&f, Args&&... args)

Call a function f with the given args on a single thread and broadcast the return value to all threads.

template<class T, class F>
inline auto arrive_reduce(T x, F func)

Perform a reduction of x across all threads using the given binary function func.

Returns a token that can be used to wait for the reduction to complete and obtain the reduced value.

template<class T>
inline T wait_reduce(shared_context_type::barrier_type::template arrival_token_typed<T> &&token)

Wait for the reduction initiated by arrive_reduce() to complete and obtain the reduced value.

template<class T, class F>
inline T reduce(T x, F func)

Perform a reduction of x across all threads using the given binary function func, and wait for the result.

template<class T>
inline T reduce(T x)

Reduction with std::plus, i.e., summation across all threads.

See also

reduce(T,F)

template<class F>
inline void run_single_sync(F &&f)

Wait for all threads to reach this point, then run the given function on a single thread before releasing all threads again.

Changes by all threads are visible during the call to f and changes made by f are visible to all threads after this function returns.

Public Members

shared_context_type &shared
const index_t index
const index_t num_thr = shared.num_thr

Friends

inline friend bool operator==(const Context &a, const Context &b)