Colloquium Details

Fault Tolerance for High Performance Computing: Is the Sky Falling?

Author:	Kathryn Mohror Lawrence Livermore National Laboratory
Date:	December 03, 2015
Time:	15:30
Location:	220 Deschutes

Abstract

High-end supercomputing systems generally achieve increased computing speeds by increasing the number of computing cores in the system. While FLOP goals can be reached with this strategy, the consequence of a larger number of system components is a higher failure rate. Today, systems experience failures on the order of hours or days; however, on future exascale systems, failures could occur on the order of minutes or several hours.

In this talk, I will give an overview of the problem of fault tolerance on high performance computing systems, including current methods for mitigating failures. Then, I’ll discuss how we expect these failure mitigation methods to evolve and perform on future systems.

Biography

Kathryn Mohror is a computer scientist on the Scalability Team at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory. Kathryn’s research on high-end computing systems is currently focused on scalable fault tolerant computing and I/O for extreme scale systems. Her other research interests include scalable performance analysis and tuning, and parallel programming paradigms. Kathryn has been working at LLNL since 2010.

Kathryn’s current research focuses primarily on the Scalable Checkpoint/Restart Library (SCR), a multilevel checkpointing library that has been shown to significantly reduce checkpointing overhead. She also leads the Tools Working Group for the MPI Forum.

Kathryn received her Ph.D. in Computer Science in 2010, an M.S. in Computer Science in 2004, and a B.S. in Chemistry in 1999 from Portland State University (PSU) in Portland, OR.