Data-Aware Tridiagonal Solvers for CFD Applications | #sciencefather #database #scientistawards #DataLocality #KernelFusion
Tridiagonal systems frequently arise in the numerical solution of partial differential equations (PDEs) using structured grids, particularly in high-order compact finite difference schemes and alternating direction implicit (ADI) methods. These systems appear as independent batches along different spatial directions and form the computational backbone of many fluid dynamics simulations. However, performance bottlenecks—especially in solving x-directional systems—are common due to memory layout inefficiencies and limited parallelism. This work introduces a direction-agnostic data structure and a new distributed-memory solver (DistD2-TDS) that address these limitations by optimizing memory access patterns and reducing inter-process communication.
Introduction to Tridiagonal Systems in PDE Discretization
Tridiagonal systems appear naturally when discretizing PDEs using high-order compact finite difference methods. These discretizations provide improved accuracy without widening the computational stencil but require solving a linear system at each grid line along a given spatial direction. For 3D problems, this results in large batches of tridiagonal systems in the x-, y-, and z-directions. These systems must be solved at every time step, making their efficiency crucial in large-scale simulations.
Performance Challenges in Traditional Solvers
Standard implementations of the Thomas algorithm are inherently sequential and do not scale well across distributed-memory environments. Moreover, memory layout plays a critical role in determining performance. In structured Cartesian grids, data along the x-direction is typically stored contiguously, while data along the y- and z-directions is scattered in memory. This discrepancy limits vectorization and makes it difficult for both CPU and GPU architectures to utilize their memory hierarchies efficiently. The result is poor cache utilization and suboptimal thread parallelism.
Compact Finite Difference Schemes and Their Properties
Compact finite difference schemes use implicit formulations to achieve high-order accuracy with narrow stencils. These schemes often lead to tridiagonal or narrow banded systems. Their implicit nature makes them suitable for capturing fine-scale features in computational fluid dynamics (CFD) applications, such as turbulence or boundary layers, with fewer grid points. An important property of the resulting matrices is diagonal dominance, which simplifies the stability analysis and reduces communication overhead in distributed algorithms.
Proposed Data Structure and Direction-Agnostic Layout
To overcome directional performance disparities, we propose a novel data layout that aligns memory access patterns for all spatial directions. This direction-agnostic layout improves data locality, enabling efficient vectorization on CPUs and better thread utilization on GPUs. It also reduces the need for costly transpose operations, commonly used in multi-rank FFT and Thomas algorithm implementations. By storing neighboring elements together in memory regardless of direction, the new layout ensures consistent performance across x-, y-, and z-directional tridiagonal solves.
Distributed-Memory Solver: DistD2-TDS Algorithm
The DistD2-TDS algorithm is introduced as a scalable solution for solving batches of tridiagonal systems in large domains across multiple compute nodes. It splits individual tridiagonal systems into subdomains, allowing for true 3D decomposition without requiring state transpositions. This reduces all-to-all communications and enables better scalability on modern supercomputers. By leveraging diagonal dominance, communication is limited to neighboring subdomains, resulting in significant performance gains. The algorithm is designed to be compatible with both CPUs and GPUs and supports hybrid parallelism using MPI and OpenMP or CUDA.
Kernel Fusion and Memory Bandwidth Optimization
Modern GPU architectures are often bandwidth-bound, meaning performance is limited by the speed of memory access rather than computation. To address this, we implement kernel fusion strategies where multiple mathematical operations—such as computing derivatives, applying filters, and solving tridiagonal systems—are combined within a single kernel. This reduces the number of global memory accesses, lowers latency, and increases arithmetic intensity. The proposed data structure supports such fusion by allowing linear, predictable access patterns across operations.
Applications and Real-World Case Study
To validate the approach, we apply the new solver framework to a 3D nonlinear PDE model representative of CFD problems, using sixth-order compact finite difference schemes. The test domain includes up to trillions of degrees of freedom, demonstrating both accuracy and scalability. The new strategy is integrated into Xcompact3d, a CFD solver framework optimized for high-order methods, showing up to 2x improvement in tridiagonal solver performance and significant reductions in total runtime per time step.
This study presents a significant step forward in the efficient numerical solution of tridiagonal systems on modern parallel architectures. By redesigning data structures and algorithmic approaches, the solution achieves directional uniformity, improved scalability, and high performance across CPU and GPU platforms. Future work will extend the method to non-diagonally dominant and more general banded systems and explore adaptive precision strategies for further optimization.
#HighPerformanceComputing #NumericalMethods #ScientificComputing #ParallelComputing #TridiagonalSolver #PDEs #CompactFiniteDifference #CFD #DistributedComputing #GPUComputing #CPUSolvers #MatrixAlgorithms #HPC #DataLocality #KernelFusion
Comments
Post a Comment