TricyqleSolver#

template<index_t VL = 4, class T = real_t, StorageOrder DefaultOrder = StorageOrder::ColMajor, class Ctx = parallel::Context<>> struct TricyqleSolver

Solver for block-tridiagonal systems using cyclic reduction (CR), parallel cyclic reduction (PCR), and preconditioned conjugate gradient (PCG) methods.

Template Parameters:

VL – Vector length.
T – Scalar type.
DefaultOrder – Storage order for the matrix workspaces (row/column major).

Parallelization and vectorization

using simd = batmat::datapar::deduced_simd<value_type, v>: Represents a SIMD vector of width v storing values of type value_type.

using vl_t = std::integral_constant<index_t, v>: Integral constant type for the vector length.

using align_t = std::integral_constant<index_t, v * alignof(value_type)>: Integral constant type for the alignment of the batched matrix data structures.

const index_t p = 8: Number of processors/threads.

static index_t v = VL: Vector length.

inline index_t lp() const: log₂(p), logarithm of the number of processors/threads p, rounded up.

inline index_t ceil_p() const: The number of processors p rounded up to the next power of two.

inline index_t ceil_P() const: The number of parallel execution units P = p * v rounded up to the next power of two.

inline std::unique_ptr<SharedContext> create_parallel_context() const: Create a new parallel execution context, storing synchronization primitives and shared data for the parallel algorithms.

static inline index_t lv(): log₂(v), logarithm of the vector length v.

Matrix data structures

template<StorageOrder O = column_major> using matrix = batmat::matrix::Matrix<value_type, index_t, vl_t, index_t, O, align_t>: Owning type for a batch of matrices (with batch size v).

template<StorageOrder O = column_major> using view = batmat::matrix::View<const value_type, index_t, vl_t, index_t, index_t, O>: Non-owning immutable view type for matrix.

template<StorageOrder O = column_major> using mut_view = batmat::matrix::View<value_type, index_t, vl_t, index_t, index_t, O>: Non-owning mutable view type for matrix.

using layer_stride = batmat::matrix::DefaultStride

template<StorageOrder O = column_major> using batch_view = batmat::matrix::View<const value_type, index_t, vl_t, vl_t, layer_stride, O>: Non-owning immutable view type for a single batch of v matrices.

template<StorageOrder O = column_major> using mut_batch_view = batmat::matrix::View<value_type, index_t, vl_t, vl_t, layer_stride, O>: Non-owning mutable view type for a single batch of v matrices.

static auto default_order = DefaultOrder: Default storage order for most matrices.

static auto column_major = StorageOrder::ColMajor: Column-major storage order for column vectors and update matrices.

Problem dimensions

const index_t block_size: Block size of the block-tridiagonal system.

const index_t max_rank = 0: Maximum update rank.

bool circular = false: Whether the block-tridiagonal system is circular (nonzero top-right & bottom-left corners).

Solver parameters

Params params = {}: Solver parameters for Tricyqle-specific settings.

inline Params get_params() const: Get the current solver parameters.

inline void update_params(const Params &new_params): Update the solver parameters.

Cyclic reduction data structures

matrix<default_order> cr_L = [this] {returnmatrix<default_order>{{.depth = p * v, .rows = block_size, .cols = block_size}};}()

Diagonal blocks of the Cholesky factor of the Schur complement (used during CR).

Batch indices correspond to block column indices of the block tridiagonal system.

matrix<default_order> cr_U = [this] {returnmatrix<default_order>{{.depth = p * v, .rows = block_size, .cols = block_size}};}()

Subdiagonal blocks U of the Cholesky factor of the Schur complement (used during CR).

Batch indices correspond to block column indices of the block tridiagonal system.

matrix<default_order> cr_Y = [this] {returnmatrix<default_order>{{.depth = p * v, .rows = block_size, .cols = block_size}};}()

Subdiagonal blocks Y of the Cholesky factor of the Schur complement (used during CR).

Batch indices correspond to block column indices of the block tridiagonal system.

matrix<column_major> work_cr = [this] {returnmatrix<column_major>{{.depth = p * v, .rows = block_size, .cols = 1}};}(): Temporary workspace for the CR solve phase.

Parallel cyclic reduction data structures

matrix<default_order> pcr_L = [this] {returnmatrix<default_order>{{.depth =v * (lv() + 1), .rows = block_size, .cols = block_size}};}(): Diagonal blocks of the PCR Cholesky factorizations.

matrix<default_order> pcr_Y = [this] {returnmatrix<default_order>{{.depth = v * lv(), .rows = block_size, .cols = block_size}};}(): Subdiagonal blocks Y of the PCR Cholesky factorizations.

matrix<default_order> pcr_U = [this] {returnmatrix<default_order>{{.depth = v * lv(), .rows = block_size, .cols = block_size}};}(): Subdiagonal blocks U of the PCR Cholesky factorizations.

matrix<default_order> pcr_M = [this] {returnmatrix<default_order>{{.depth = v, .rows = block_size, .cols = block_size}};}(): Workspace to store the diagonal blocks during the PCR factorization.

matrix<column_major> work_pcg = [this] {returnmatrix<column_major>{{.depth = v, .rows = block_size, .cols = 4}};}(): Temporary workspace for CG vectors.

Factorization update data structures

std::vector<index_t> m_update = std::vector<index_t>(p): Update rank (number of changing constraints) per thread.

index_t m_update_u0 = -1: Update rank from D(0). Negative if D(0) is not handled separately.

matrix< column_major > work_update_Σ = [this] {returnmatrix<column_major>{{.depth = v, .rows = max_rank, .cols = 1}};}(): Compressed reprentation of the nonzero diagonal elements of the matrix Σ.

matrix<column_major> work_update = [this] {returnmatrix<column_major>{{.depth = 4 * v, .rows = block_size, .cols = max_rank}};}(): Workspace to store the update matrices Ξ(Υ) for the factorization update.

matrix<column_major> work_hyh = [this] {using namespacebatmat::linalg;const auto [r, c] =hyhound_size_W(tril(cr_L.batch(0)));returnmatrix<column_major>{{.depth = p * v, .rows = r, .cols = c}};}(): Storage for the hyperbolic Householder transformations.

matrix< column_major > work_update_pcr_Σ = [this] {returnmatrix<column_major>{{.depth = v, .rows = 2 * max_rank, .cols = 1}};}(): Two copies of work_update_Σ for PCR updates.

matrix<column_major> work_update_pcr_L = [this] {returnmatrix<column_major>{{.depth = v, .rows = block_size, .cols = max_rank}};}(): Update matrices to apply to the diagonal blocks L during PCR updates.

matrix<column_major> work_update_pcr_UY = [this] {returnmatrix<column_major>{{.depth = v, .rows = block_size, .cols = 2 * max_rank}};}(): Update matrices to apply to the subdiagonal blocks U and Y during PCR updates.

Low-level PCR factorization and solve routines

static bool merge_last_level_pcr = true

void factor_pcr()

Compute the parallel cyclic reduction factorization of the final block tridiagonal system of size v.

[PCR factor serial]

The function assumes that the Cholesky factors of the diagonal blocks are stored in cr_L.batch(0), and the subdiagonal blocks are stored in cr_Y.batch(0). This variant computes both subdiagonal blocks on a single thread.

template<index_t Level> void factor_pcr_level()

Perform a single level of the PCR factorization.

This variant computes both subdiagonal blocks on a single thread.

void factor_pcr_parallel(Context &ctx)

Compute the parallel cyclic reduction factorization of the final block tridiagonal system of size v.

[PCR factor serial]

This variant computes the subdiagonal blocks in parallel on different threads.

[PCR factor]

template<index_t Level> void factor_pcr_level_parallel(Context &ctx)

Perform a single level of the PCR factorization.

This variant computes the subdiagonal blocks in parallel on different threads.

void solve_pcr (mut_batch_view<> λ, mut_batch_view<> work_pcr) const

Solve a linear system with the final block tridiagonal system of size v using the PCR factorization.

[PCR factor]

[Cyqlone solve PCR]

inline void solve_pcr (mut_batch_view<> λ)

Solve a linear system with the final block tridiagonal system of size v using the PCR factorization.

[PCR factor]

[Cyqlone solve PCR]

template<index_t Level> void solve_pcr_level (mut_batch_view<> λ, mut_batch_view<> work_pcr) const: Perform a single level of the PCR solve.

Indexing utilities

index_t ν2 (index_t i) const: 2-adic valuation ν₂.

index_t ν2p (index_t i) const: 2-adic valuation modulo p, i.e. ν2p(0) = ν2p(p) = lp().

index_t add_wrap_ceil_p(index_t a, index_t b) const: Add b to a modulo ceil_p().

index_t sub_wrap_ceil_p(index_t a, index_t b) const: Subtract b from a modulo ceil_p().

Factorization and solve routines

inline decltype(auto) init_diag(Context &ctx, auto &&func)

Initialize the diagonal blocks M of the block tridiagonal system using a user-provided function.

The function is called with an index i in [0, p) and a mutable view to the compact workspace of depth v where the diagonal blocks M(i+l*p) for l in [0, v) should be stored. Copying non-compact data to this workspace can be achieved using the cyqlone::linalg::pack function.

inline decltype(auto) init_subdiag(Context &ctx, auto &&func)

Initialize the subdiagonal blocks K of the block tridiagonal system using a user-provided function.

The function is called with an index i in [0, p) and a mutable view to the compact workspace of depth v where the subdiagonal blocks K(i+l*p) for l in [0, v) should be stored. Copying non-compact data to this workspace can be achieved using the cyqlone::linalg::pack function.

inline decltype(auto) init_rhs(Context &ctx, mut_view<> b, auto &&func) const

Initialize the right-hand side of the linear system using a user-provided function.

The function is called with an index i in [0, p) and a mutable view of the batch of b of depth v where the right-hand side blocks b(i+l*p) for l in [0, v) should be stored. Copying non-compact data to this batch can be achieved using the cyqlone::linalg::pack function.

inline decltype(auto) get_solution (Context &ctx, view<> λ, auto &&func) const

Get access to the solution computed by this thread using a user-provided function.

The function is called with an index i in [0, p) and a view to the compact batch of λ of depth v where the solution blocks λ(i+l*p) for l in [0, v) are stored. Copying this batch to a non-compact layout can be achieved using the cyqlone::linalg::unpack function.

inline decltype(auto) get_solution (Context &ctx, mut_view<> λ, auto &&func) const

Get access to the solution computed by this thread using a user-provided function.

template<class Λ, class F> inline decltype(auto) get_solution (Context &ctx, Λ &&λ, F &&func) const

Get access to the solution computed by this thread using a user-provided function.

void factor_solve (Context &ctx, mut_view<> λ, index_t stride=1)

Fused factorization and forward solve.

[Cyqlone factor Schur]

Performs CR to reduce the system down to a single block tridiagonal system of size v which is then solved using PCR or PCG.

Note that a backward solve of the final block λ.batch(0) is performed during the forward solve phase, so λ.batch(0) contains (part of) the final solution. Performing the forward and backward solves separately for this block is not possible, because it is solved using PCR or PCG. Both methods solve the full block tridiagonal system at once, rather than using a single explicit CR Cholesky factorization with distinct forward and backward solves.

Parameters:

ctx – Parallel execution context for communication/synchronization between threads.
λ – On entry, the right-hand side of the linear system. On exit, the solution of the forward solve phase.
stride – Stride (in number of batches) between batches of λ. In total, λ contains stride * p batches, but only every stride-th batch is accessed.

Pre:

The workspaces cr_L.batch(i) contain the diagonal blocks M(i:v:p) of the block tridiagonal system (see init_diag).

Pre:

If p > 1, the workspaces cr_Y.batch(i) with odd i contain the subdiagonal blocks K(i:v:p) of the block tridiagonal system (see init_subdiag).

Pre:

If p > 1, the workspaces cr_U.batch(i) with odd i contain the superdiagonal blocks K(i-1:v:p)ᵀ of the block tridiagonal system (see init_subdiag).

Pre:

If p = 1, the workspace cr_Y.batch(0) contains the subdiagonal blocks K(0:v:1) of the block tridiagonal system, and cr_U.batch(0) is not used.

Post:

The workspaces cr_L.batch(i) with i > 0 contain the diagonal blocks of the CR Cholesky factor. cr_Y.batch(i) and cr_U.batch(i) contain the subdiagonal blocks. The final batch cr_L.batch(0) contains the diagonal blocks of the Schur complement of all other blocks, which is the input to the PCR or PCG solver. The batch pcr_L.batch(0) contains its Cholesky factorizations. The subdiagonal blocks of this Schur complement are stored in pcr_Y.batch(0).

Post:

If the PCR solver was selected, the full PCR factorization is stored in pcr_L, pcr_U and pcr_Y.

void factor(Context &ctx): Perform only the factorization as described by factor_solve.

void solve_forward (Context &ctx, mut_view<> λ, index_t stride=1): Perform only the forward solve as described by factor_solve.

void solve_reverse (Context &ctx, mut_view<> λ, mut_view<> work, index_t stride=1) const

Perform the backward solve phase, after the forward solve phase has been performed by factor_solve.

Parameters:

ctx – Parallel execution context for communication/synchronization between threads.
λ – On entry, the solution of the forward solve phase. On exit, the solution of the full linear system.
work – Workspace of p * v column vectors of size block_size.
stride – Stride (in number of batches) between batches of λ. In total, λ contains stride * p batches, but only every stride-th batch is accessed.

inline void solve_reverse (Context &ctx, mut_view<> λ, index_t stride=1)

Low-level factorization and solve routines

template<bool Factor = true, bool Solve = true> void factor_solve_skip_first (Context &ctx, mut_view<> λ, index_t stride=1)

Fused factorization and forward solve.

[Cyqlone factor Schur]

Unlike factor_solve, this function assumes that the odd diagonal blocks in the first level of CR have already been factorized and solved. This allows the user to fuse the evaluation and factorization of these blocks, possibly enabling higher performance by avoiding an additional trip to memory.

Pre:: If p > 1, the workspaces cr_L.batch(i) contain the diagonal blocks M(i) for even i, and the Cholesky factors L(i) of M(i) for odd i.
Pre:: If p = 1, the workspace cr_L.batch(0) contains the diagonal block M(0), and the workspace pcr_L.batch(0) contains its Cholesky factor L(0).
Pre:: If p > 1, the workspaces cr_Y.batch(i) contain the subdiagonal blocks K(i) for odd i (uninitialized for even i).
Pre:: If p = 1, the workspace cr_Y.batch(0) contains the subdiagonal block K(0).
Pre:: If p > 1, the workspace cr_U.batch(i) contains the superdiagonal blocks K(i-1)ᵀ for odd i (uninitialized for even i).
Pre:: If p = 1, the workspace cr_U.batch(0) is not used.
Pre:: If p > 1, the right-hand sides λ.batch(stride * i) contain the right-hand sides of the system for even i, and the right-hand sides multiplied by L(i)⁻¹ for odd i.
Pre:: If p = 1, the right-hand side λ.batch(0) contains the right-hand side of the system.

inline void factor_skip_first(Context &ctx): Factorization-only variant of factor_solve_skip_first.

inline void solve_forward_skip_first (Context &ctx, mut_view<> λ, index_t stride=1): Solution-only variant of factor_solve_skip_first.

template<bool Factor = true, bool Solve = true> void factor_solve_impl (Context &ctx, mut_view<> λ, index_t stride=1)

Implementation of factor_solve.

[Cyqlone factorization and fused forward solve]

Low-level CR factorization and solve routines

index_t cr_thread_assignment(index_t l, index_t c) const

Adjust thread assignment for non-power-of-two p: The diagonal blocks M(⌊p/2⌋2) are usually mapped to increasing thread indices c as the CR level l increases, as can be seen in the functions above, where iY = c + 1 - 2^l, and from the way the path of M nodes curves to the right in the thread assignment diagram in the paper.

However, these large thread indices are not actually present if p is not a power of two, so we need to remap them, undoing the offset 1 - 2^l. We always assign the last M evaluation to the even thread ⌊p/2⌋2, since this thread is present even if p is odd. The odd thread ⌊p/2⌋2+1 is assigned an inactive index, since it never has any work during CR, as there is no coupling between the last and first stages (at least not in the scalar case).

void factor_U(index_t l, index_t iU)

Compute a block U in the Cholesky factor for the given CR level l and column index iU.

[Cyqlone factor CR helper]

void factor_Y(index_t l, index_t iY): Compute a block Y in the Cholesky factor for the given CR level l and column index iY.

void factor_L(index_t l, index_t i): Update and factorize a block L in the Cholesky factor for CR level l+1 and column index i, using the previously computed blocks U and Y in the same row at level l.

void update_K(index_t l, index_t i): Compute a subdiagonal block K of the Schur complement for CR level l+1 and column index i, using the previously computed blocks U and Y in the same row at level l.

void solve_u_forward (index_t l, index_t iU, mut_view<> λ, index_t stride) const

Update the right-hand side λ during the forward solve phase of CR after computing block iU of λ at level l, subtracting the product U(iU) λ(iU) from the block of λ in the same row as U(iU).

[Cyqlone factor CR helper]

The stride parameter specifies the stride between consecutive blocks of λ.

[Cyqlone solve CR helper]

void solve_y_forward (index_t l, index_t iY, mut_view<> λ, mut_view<> w, index_t stride) const

Update the right-hand side λ during the forward solve phase of CR after computing block iY of λ at level l, subtracting the product Y(iY) λ(iY) from the block of λ in the same row as Y(iY).

The product Y(iY) λ(iY) is stored in the workspace w to allow it to be computed concurrently with solve_u_forward, which updates the same block of λ. The stride parameter specifies the stride between consecutive blocks of λ.

void solve_λ_forward (index_t l, index_t iL, mut_view<> λ, view<> w, index_t stride) const

Apply the updates to block iL of the right-hand side from solve_u_forward and solve_y_forward, and then solve with the diagonal block L for the next level l+1.

The stride parameter specifies the stride between consecutive blocks of λ.

Low-level PCG routines

value_type mul_Mv(batch_view<> p, mut_batch_view<> Mp, batch_view<default_order> L, batch_view<default_order> K) const

Multiply a vector by the final block tridiagonal matrix of size v.

The matrix is represented by the Cholesky factors of the diagonal blocks L and the subdiagonal blocks K.

M = [ M(0)  Kᵀ(0)             ]  where M(i) = L(i) L(i)ᵀ
    [ K(0)  M(1)  Kᵀ(1)       ]
    [       K(1)  M(2)  Kᵀ(2) ]
    [             K(2)  M(3)  ]

value_type mul_precond(batch_view<> r, mut_batch_view<> z, mut_batch_view<> w, batch_view<default_order> L, batch_view<default_order> K) const

Multiply a vector by the preconditioner for the final block tridiagonal system of size v.