Structure-based Optimizations for Sparse Matrix-Vector Multiply

Abstract

class StackedSolver { public: // Perform initialization void (*init[MAX_STACKDEPTH]) (); // Perform one iteration step and return // the number of converged problems int (*iterate[MAX_STACKDEPTH]) (); // Learn maximum number of iterations int get_max_iterations(); // Eject converged problems void eject_converged_problems(); } Figure 3.7: Abstract class StackedSolver, which stacked solver implementations must implement. void osf_iterator_engine(StackedSolver *solver, int stackdepth) { solver->init[stackdepth](); int nr_iters = 0; while ((stackdepth > 0) && (nr_iters < solver->get_max_iterations())) { nr_iters++; int nr_converged = solver->iterate[stackdepth](); if (nr_converged > 0) { solver->eject_converged_problems(); stackdepth -= nr_converged; } } } Figure 3.8: OSF iteration engine. CHAPTER 3. OPERATION STACKING FRAMEWORK 33 class CG:StackedSolver { private: int n, nnz; // Interleaved arrays interleaved_array<double> x[n * MAX_STACKDEPTH], p[n * MAX_STACKDEPTH]; // Non-interleaved arrays double A[nnz * MAX_STACKDEPTH], r[n * MAX_STACKDEPTH]; double b[n * MAX_STACKDEPTH], q[n * MAX_STACKDEPTH]; double alpha[MAX_STACKDEPTH], ro[MAX_STACKDEPTH], beta[MAX_STACKDEPTH]; } Figure 3.9: Data structures maintained by a stacked conjugate gradient (CG) solver. has been exceeded. Different stacked solvers need to keep track of different algorithm-specific arrays for the intermediate values they compute. For instance, Figure 3.9 shows the variables and arrays comprising the state of a stacked CG implementation, which correspond to the variables used to implement the algorithm in Figure 3.5. The OSF iteration engine provides the necessary runtime support to keep track of interleaved and non-interleaved vectors. It records which problems had converged and which are still active, and maintains a map of their positions in each stacked vector, which is accessible to the stacked solver code. The stacked solver code can use this map to extract data of converged problems in interleaved vectors and to find the addresses of active segments in non-interleaved vectors. When problems converge and are ejected from the stack, the engine updates this map and automatically performs the necessary compression of all interleaved vectors. Consequently, developers of stacked solvers can focus on the iterative algorithm and do not need to reimplement suitable representations of their data. CHAPTER 3. OPERATION STACKING FRAMEWORK 34 3.2 Multi-Process OSF Implementation Rewriting existing codes to solve multiple problems, for all possible stack depths, would require a significant amount of code modification, including changes to APIs and the representations of internal data structures. To avoid this burden, we implemented OSF as a multi-process framework in which multiple processes, each solving a single problem, combine to solve these problems in a stacked operation. This coordination takes place when each program calls into a stacked solver and is entirely transparent to the application code. As most existing scientific codes solve a single problem at a time, integration of a stacked version of solvers into existing codes becomes straightforward, requiring only renaming the call sites to refer to the stacked version of a function. For instance, a call to a conjugate gradient solver, cg, with signature cg(double *AA, double *b, double *x,...) must be replaced with a call to its stacked counterpart, osf cg, with identical signature osf_cg(double *AA, double *b, double *x, ...) and does not require changes to the representations of AA, b, or x. CHAPTER 3. OPERATION STACKING FRAMEWORK 35 3.2.1 Collective Operation by Multiple Processes To initiate operation stacking users execute a driver program (osfrun), which simultaneously starts stackdepth number of participating processes for stackdepth individual problems. Each process inputs its own data set. When the participating processes call a stacked solver function, they cooperate and synchronize to perform a collective computation. The data segments contributed by each of the processes must be arranged and combined according to the requirements of the solver implementation. This arrangement requires data to be copied from the private memory of processes to a shared memory location, some in interleaved form. Since we implement stacked solvers, rather than stacked SMVM kernels, the cost of this data copying and synchronization is paid only once for each invocation of a stacked solver. Though OSF involves multiple processes, all stacked computations are single-threaded. A leader process is elected to perform the stacked computation, e.g., the stacked CG algorithm in Figure 3.2, on behalf of the remaining processes. Before the stacked computation, each process contributes its data into a shared memory area on which the leader operates. Processes that are not the leader wait on a per-process semaphore for the leader to complete the computation of their problem, without consuming any computational resources themselves. OSF starts operations by assigning ranks to all participating processes. The rank of a process determines the position of its data in each stacked array. When one or more problems converge, the leader wakes up the corresponding processes, which then copy the solutions to their private memory spaces. To ensure that the leader can safely continue with the remaining problems in the stack, the leader and the converged processes synchronize using an n-way barrier when the solutions have been copied out. At this point, the leader ejects CHAPTER 3. OPERATION STACKING FRAMEWORK 36 converged problems from the stack, compresses the stacked data as necessary, and continues iterations at the reduced stack depth. If the leader’s problem is amongst the converged, then a new leader must be selected to continue the stacked iterations before the outgoing leader ejects itself from the stack. The current OSF implementation chooses the non-converged process with the highest rank as the new leader. The outgoing leader signals the semaphore of this new leader, which then wakes up and takes over the stacked operation. OSF keeps problem-specific critical information, such as the iteration step and the current stack depth, in an internal struct, which is passed from one leader to the next. Our implementation relies on the shared memory and semaphore capabilities described by the POSIX 1003.1b standard [Ame94]. We also use shared pthread barriers, which are an optional part of the same standard. Recent versions of Linux support this feature; on other platforms, we could implement barriers using shared semaphores or condition variables. 3.2.2 Example Scenario Figure 3.10 illustrates an example OSF scenario for the stacked CG algorithm with four processes. At time step 1, all processes have entered into a stacked function. In this example, the participating processes’ problems converge in the following order: first, process #1 and #2 converge at the same step, then #4, then #3. The first process to arrive at time step 1 creates shared memory sections that are later filled with data from participating processes. All other processes copy their data into the shared CHAPTER 3. OPERATION STACKING FRAMEWORK 37 Initialization Phase Process 2

64 Figures and Tables

Cite this paper

@inproceedings{Belgin2011StructurebasedOF, title={Structure-based Optimizations for Sparse Matrix-Vector Multiply}, author={Mehmet Belgin and Kirk W. Cameron and Serkan Gugercin and Adrian Sandu and M{\"{u}zeyyen Erg{\"{u}n}, year={2011} }