Online Monitoring for High-Performance Computing Systems
Chad Daniel Wood
Committee: Allen Malony (chair), Boyana Norris, Hank Childs
Area Exam(Feb 2021)
Keywords: online, runtime, monitoring, observability, introspection, high-performance computing, HPC, in situ, scalability, exporting, storage, provenance, logging, ensembles, code-coupling, workflows, interactivity, optimization, tuning

In this work we explore the area of online monitoring systems in high-performance computing. This area of research is increasingly important as software and machines grow in scale and architectural complexity. We begin by outlining the terms of the art and scope of the area being considered. We provide a high-level overview of on-line monitoring within the context of high-performance computing, including various subtopics. Significant features of each subtopic are discussed, as well as the reasoning behind the integration of these topics into a holistic area of research. This leads into a deeper discussion of the special constraints imposed by high-performance computing, and how various solutions have evolved along with this unique computational landscape. We then provide a survey of the current and prior tools and techniques for online monitoring. Finally, we end this work with a brief discussion of open research areas for significant future efforts in this domain.