Parallelization#
TreeBarrier#
-
template<typename CompletionFn = EmptyCompletion, class PhaseType = uint32_t>
class TreeBarrier Fairly vanilla combining tree barrier.
It is inspired by GCC 15.2’s __tree_barrier, with some important API differences:
Every thread has a unique thread ID in [0, expected-1]. This eliminates the need for hashing the pthread thread IDs and for the inner search loop to find free slots in the tree.
Wait tries to spin for a given number of iterations before falling back to a futex-based atomic wait.
The barrier phase is exposed to the user.
Custom completion functions can be provided at arrival time.
Reductions and broadcasts on small values are supported.
Public Types
-
enum class BarrierPhase : PhaseType
Values:
Public Functions
-
inline TreeBarrier(uint32_t expected, CompletionFn completion)
Create a barrier with
expectedparticipating threads and a completion function that is called by the last thread that arrives at each phase.
-
TreeBarrier(const TreeBarrier&) = delete
-
TreeBarrier(TreeBarrier&&) = default
-
TreeBarrier &operator=(const TreeBarrier&) = delete
-
TreeBarrier &operator=(TreeBarrier&&) = default
-
template<class C>
inline arrival_token arrive_with_completion(uint32_t thread_id, C &&custom_completion) Arrive at the barrier with a custom completion function that is called by the last thread that arrives, before advancing the barrier phase and notifying all waiting threads.
The completion function of the barrier is not called in this case. Each thread should use a unique thread ID in [0, expected-1].
-
inline arrival_token arrive(uint32_t thread_id)
Arrive at the barrier.
The barrier’s completion function is called by the last thread that arrives, before advancing the barrier phase and notifying all waiting threads. Each thread should use a unique thread ID in [0, expected-1].
-
inline arrival_token arrive(uint32_t thread_id, int line)
Arrive at the barrier, recording the given line number for sanity checking to make sure that all threads arrive from the same line or statement in the source code.
This is useful for debugging purposes to detect mismatched barrier calls, but should not really be used otherwise. If CYQLONE_SANITY_CHECKS_BARRIER is disabled, the line number is ignored and this function is equivalent to arrive(uint32_t). Each thread should use a unique thread ID in [0, expected-1].
-
inline BarrierPhase current_phase() const
Query the current barrier phase.
May wrap around on overflow, but all threads will see the same phase values in the same order.
-
inline bool wait_may_block(const arrival_token &token) const noexcept
Check if wait() may block.
If it returns false, the caller can call wait() and it will return immediately without spinning or sleeping. This is useful if the caller has other non-critical work to do while waiting for other threads. Users should still call wait() before arriving again.
Note
This function does not impose any memory ordering, so even when it returns false, changes made before the arrival of other threads may not be visible yet. In contrast, wait() does ensure proper synchronization.
-
inline void wait(arrival_token &&token) const
Wait for the barrier to complete after an arrival, using the given token.
Separating the arrival and wait phases allows for overlapping computation with waiting, hiding the synchronization latency. Waiting on the same token multiple times is not allowed.
-
inline void arrive_and_wait(uint32_t thread_id)
Convenience function to arrive and wait in a single call.
-
inline void arrive_and_wait(uint32_t thread_id, int line)
Convenience function to arrive and wait in a single call (with optional sanity check).
-
template<class C>
inline void arrive_and_wait_with_completion(uint32_t thread_id, C &&custom_completion) Convenience function to arrive and wait in a single call (with custom completion).
-
template<class C>
inline auto arrive_and_wait_with_completion(uint32_t thread_id, C &&custom_completion) Convenience function to arrive and wait in a single call (with custom completion).
Broadcasts the return value of the custom completion function to all threads.
-
template<class T, class F>
inline arrival_token_typed<T> arrive_reduce(uint32_t thread_id, T x, F reduce) Combining tree reduction across all threads.
Deterministic application order for a given number of threads.
-
template<class T>
inline T wait_reduce(arrival_token_typed<T> &&token) Wait for the result of an arrive_reduce call and obtain the reduced value.
Public Members
-
uint32_t spin_count = 1000
Number of spin iterations before falling back to futex-based wait.
Public Static Functions
-
static inline uint32_t max()
Maximum number of threads supported by this barrier implementation.
-
class arrival_token
Subclassed by cyqlone::TreeBarrier< CompletionFn, PhaseType >::arrival_token_typed< T >
Public Functions
-
inline explicit arrival_token(BarrierPhase phase)
-
arrival_token(const arrival_token &phase) = delete
-
arrival_token(arrival_token &&phase) = default
-
arrival_token &operator=(const arrival_token &phase) = delete
-
arrival_token &operator=(arrival_token &&phase) = default
-
inline BarrierPhase get() const noexcept
-
inline explicit arrival_token(BarrierPhase phase)
-
template<class T>
class arrival_token_typed : public cyqlone::TreeBarrier<CompletionFn, PhaseType>::arrival_token Public Functions
-
inline BarrierPhase get() const noexcept
-
inline BarrierPhase get() const noexcept
-
struct EmptyCompletion
No-op completion function for the TreeBarrier.
Public Functions
-
inline void operator()() const noexcept
Does nothing.
-
inline void operator()() const noexcept
Context#
-
template<class SC>
struct Context Thread context for parallel execution.
Each thread has a unique thread index, and can synchronize and communicate with other threads in the same shared context.
See also
Public Types
-
using shared_context_type = SC
-
using arrival_token = typename shared_context_type::barrier_type::arrival_token
Public Functions
-
inline bool is_master() const
Check if this thread is the master thread (thread index 0).
Useful for determining which thread should perform operations like printing to the console, which should be done by a single thread and does not require synchronization.
-
inline arrival_token arrive()
Arrive at the barrier and obtain a token that can be used to wait for completion of the current barrier phase.
Note
Token must be awaited before any other call to arrive.
-
inline void wait(arrival_token &&token)
Await a token returned by arrive(), waiting for the barrier phase to complete.
-
inline void arrive_and_wait()
Arrive at the barrier and wait for the barrier phase to complete.
This is a convenience wrapper around arrive() and wait() for the common case where the thread does not have other work to do while waiting.
-
inline void arrive_and_wait(int line)
Debug version of arrive_and_wait() that performs a sanity check to ensure that all threads are arriving at the same line of code.
The
lineparameter should be the same for all threads arriving at the same barrier. It is only verified in debug builds, and is equivalent to arrive_and_wait() in release builds.
-
template<class T>
inline T broadcast(T x, index_t src = 0) Broadcast a value
xfrom the thread with indexsrcto all threads.
-
template<class F, class ...Args>
inline std::invoke_result_t<F, Args...> call_broadcast(F &&f, Args&&... args) Call a function
fwith the givenargson a single thread and broadcast the return value to all threads.
-
template<class T, class F>
inline auto arrive_reduce(T x, F func) Perform a reduction of
xacross all threads using the given binary functionfunc.Returns a token that can be used to wait for the reduction to complete and obtain the reduced value.
-
template<class T>
inline T wait_reduce(shared_context_type::barrier_type::template arrival_token_typed<T> &&token) Wait for the reduction initiated by arrive_reduce() to complete and obtain the reduced value.
-
template<class T, class F>
inline T reduce(T x, F func) Perform a reduction of
xacross all threads using the given binary functionfunc, and wait for the result.
-
template<class T>
inline T reduce(T x) Reduction with
std::plus, i.e., summation across all threads.See also
-
template<class F>
inline void run_single_sync(F &&f) Wait for all threads to reach this point, then run the given function on a single thread before releasing all threads again.
Changes by all threads are visible during the call to
fand changes made byfare visible to all threads after this function returns.
Public Members
-
shared_context_type &shared
-
const index_t index
-
const index_t num_thr = shared.num_thr
-
using shared_context_type = SC