29. PERFORMANCE MODELING TOOLS FOR PARALLEL LINEAR ALGEBRA COMPUTATIONS

Department: Computer Science & Engineering
Faculty Advisor(s): Scott B. Baden

Primary Student
Name: Pietro Cicotti
Email: pcicotti@ucsd.edu
Phone: 858-534-9916
Grad Year: 2009

Abstract
The performance of algorithms for solving irregular numerical problems depends on both the system and the input characteristics. These characteristics are hard to represent with analytical performance models. We consider parallel sparse linear algebra computations which often include both memory bound phases and processor bound phases. In addition, the sparsity pattern is particularly important in determining the amount and the granularity of communication as well as the work load distribution.

Simulation-based performance models offer the potential to take into account input characteristics and to reproduce relevant system behaviors. We developed a library of Performance Modeling Tools (PMT) for simulation-based performance models. In models built using PMT, the computation is represented by a sequence of memory operations, calls to linear algebra routines (or more in general sequential compute intensive routines), and communication primitives. A model simulates the steps of the algorithm as driven by the input and charging for the cost of each operation.

We used PMT to model the parallel sparse LU factorization implemented in the SuperLU_DIST package. The model is based on BLAS calls, memory accesses to retrieve blocks of data, and point-to-point data communication. We validated the model on a test suite of 8 matrices from the University of Florida Sparse Matrix Collection. We run our experiments on a Power5 machine at the National Energy Research Scientific Computing center using up to 128 processors. Results show that in most of the cases the predicted factorization time is accurate within a 10% error.

We used the model to predict the optimal shape of the processor grid when using 16 processors. In most of the cases the model was able to correctly sort the processor grid shapes by execution time. Finally we modeled a variation of the algorithm that uses a different communication strategy. The original algorithm performs a series of broadcasts during each panel factorization. We propose to use asynchronous point-to-point communication instead of the broadcasts. We modeled the different strategy and the model suggested an improvement in the running time. When implemented the optimization resulted in a 10.5% average speedup on our test suite.

« Back to Posters or Search Results