# Invited Speakers

** Christophe Monat** , ** Christian Bertin** (STMicroelectronics)
& **
Claude-Pierre Jeannerod,
** (INRIA),
Grenoble and Lyon, France.

* Techniques and tools for implementing IEEE 754 floating-point arithmetic on VLIW integer processors. *

*Abstract:*
Recently, some high-performance IEEE-754 single precision floating-point software has been designed, that aims at best exploiting some features (integer arithmetic, parallelism) of the STMicroelectronics ST200

Very Long Instruction Word (VLIW) processor. We review here the techniques and soft- ware tools used or developed for this design and its imple- mentation, and how they allowed very high instruction-level parallelism (ILP) exposure. Those key points include a hi- erarchical description of function evaluation algorithms, the exploitation of the standard encoding of floating-point data, the automatic generation of fast and accurate polynomial evaluation schemes, and some compiler optimizations.
... [Show abstract]
**Erich L. Kaltofen**, North Carolina State University, Raileigh, USA.

*15 years after DSC and WLSS2: what parallel computations I do today.*

*Abstract:* A second wave of parallel and distribute computing research is rolling in.
Today's multicore / multiprocessor computers facilitate everyone's parallel
execution. In the mid 1990s, manufactures of expensive main-frame parallel

computers faltered and computer science focused on the Internet and the
computing grid. After a ten year hiatus, the Parallel Symbolic Computation
Conference (PASCO) is awakening with new vigor.

I shall look back on the highlights of my own research on theoretical and
practical aspects of parallel and distributed symbolic computation, and forward
to what is to come by example of several current projects. An important
technique in symbolic computation is the evaluation / interpolation paradigm,
and multivariate sparse polynomial parallel interpolation constitutes a
keystone operation, for which we present a new algorithm. Several
embarrassingly parallel searches for special polynomials and exact
sum-of-squares certificates have exposed issues in even today's multiprocessor
architectures. Solutions are in both software and hardware. Finally, we
propose the paradigm of interactive symbolic supercomputing, a symbolic
computation environment analog of the STAR-P Matlab platform.
... [Show abstract]
**Stephen T. Lewin-Berlin**,
Quanta Research Cambridge, USA.

* Exploiting Multicore Systems with Cilk*

*Abstract:*
Cilk is a language extension to C and C++ designed to simplify
programming shared-memory multiprocessor systems. By maintaining
serial semantics, Cilk programs are relatively easy to test and debug,
and they yield well to formal analysis of performance and scalability.
A novel construct

called "reducer hyperobjects" provides an elegant
mechanism for removing race conditions without introducing locks while
preserving serial semantics.

This talk introduces Cilk programming for C++ programmers. I will
describe the three new keywords used to parallelize C/C++ code. I
will explain how to develop efficient Cilk programs and how to analyze
parallel performance using the Cilkview scalability analyzer. I will
describe the Cilkscreen race detector which tests for the presence of
race conditions. I will discuss the various options provided by Cilk
for eliminating races, including using hyperobjects. In addition, I
will overview Cilk's work-stealing scheduler.
... [Show abstract]
# Tutorials

**
Jeremy Johnson**, Drexel University, Philadelphia, USA.

*Automatic Performance Tuning.*

*Abstract:* This tutorial presents automated techniques for implementing and
optimizing numeric and symbolic libraries on modern computing platforms
including SSE, multicore, and GPU. Obtaining high performance requires
effective use of the memory hierarchy, short vector instructions, and
multiple cores. Highly tuned implementations are difficult to obtain

and are platform dependent. For example, Intel Core i7 980 XE has a
peak floating point performance of over 100 GFLOPS and the NVIDIA
Tesla C870 has a peak floating point performance of over 500 GFLOPS,
however, achieving close to peak performance on such platforms is
extremely difficult. Consequently automated techniques are now
being used to tune and adapt high performance libraries such as ATLAS
(math-atlas.sourceforge.net) for dense linear algebra, OSKI
(bebop.cs.berkeley.edu/oski) for sparse linear algebra, FFTW
(www.fftw.org) for the fast Fourier transform (FFT), and SPIRAL
(www.spiral.net) for wide class of digital signal processing (DSP)
algorithms. Intel currently uses SPIRAL to generate parts of their
MKL and IPP libraries.

The first part of the tutorial will review SSE, multicore, and GPU
processors and survey techniques for automated performance tuning of
linear algebra and FFT kernels. The second part of the tutorial will
present a detailed case study using the Walsh-Hadamard transform,
a simple DSP transform related to the FFT.
... [Show abstract]
**Daniel Kunkle**,
Northeastern University, Boston, USA.

*Roomy: A System for Space Limited Computations.*

*Abstract:* There are numerous examples of problems in symbolic algebra in which the
required storage grows far beyond the limitations even of the distributed RAM
of a cluster. Often this limitation determines

how large a problem one can
solve in practice. Roomy provides a minimally invasive system to modify the
code for such a computation, in order to use the local disks of a cluster or a
SAN as a transparent extension of RAM.

Key to this capability is the observation that 50~parallel disks have
approximately the same bandwidth as a single RAM subsystem. Thus, Roomy uses a
cluster not for the sake of a parallel speedup, but rather to continue
computing at a reasonable speed even when the working set of the computation
has grown far beyond the aggregate RAM of a cluster. To compensate for the
significantly higher latency of disks, Roomy delays random access operations
and performs them efficiently in batch.

Roomy is implemented as a C/C++ library. It provides some simple data
structures (arrays, unordered lists, and hash tables). Some typical programming
constructs that one might employ in Roomy are: map, reduce, duplicate
elimination, chain reduction, pair reduction, and breadth-first search. All
aspects of parallelism and remote I/O are hidden within the Roomy library.

This tutorial will provide: a brief motivation for using disks as main working
memory; the Roomy programming model for parallel disk-based computation; and
some example Roomy programs.
... [Show abstract]
**Sivan Toledo **,
Tel-Aviv University, Tel-Aviv, Israel.

*Combinatorial Scientific Computing*

*Abstract:*
Many large-scale scientific simulations face challenges that are
combinatorial in nature. These include graph and hypergraph partitioning
(e.g., for parallel sparse matrix-vector multiplication

and for sparse
matrix factorizations), graph coloring (used in automatic differentiation),
matching (used in static pivoting), and more. Combinatorial Scientific
Computing (CSC) is the study of these problems and of the algorithms that
solve them.

The tutorial will provide a quick overview of CSC, but most of it will be
devoted to three related problems within CSC that I have worked on. In each
problem area, I will describe some easy-to-follow basic results, some
results that I worked on, and some remaining challenges.

The first problem that I will talk about is sparse matrix factorization.
This is a classical problem in which the connections between the numerics
and the combinatorial structures have been studied since the 1970s. I will
explain the basics of sparse elimination and fill-reducing orderings, as
well as topics of current interest: sparse symmetric indefinite
factorizations, static pivoting techniques, and sparse singular-value
decompositions.

The next part of the talk will focus on combinatorial preconditioners. I
will describe the relationships between iterative solvers for sparse systems
of equations and pairs of graphs, the algorithms that exploit these
relationships, and applications of these techniques.

The last part of the talk will focus on recently-discovered uses of
randomization in numerical linear algebra. This part of the talk will again
cover some basic notions but will then move on to state-of-the-art results
and recent solvers that my group has produced.
... [Show abstract]