SparkGalaxy: Workflow-based Big Data Processing
Sara Riazi
Committee: Boyana Norris (chair), Stephen Fickas, Dejing Dou
Directed Research Project(Mar 2016)
Keywords: Big data processing, workflow system, Spark, Galaxy

We introduce SparkGalaxy, a big data processing toolkit that is able to encode complex data science experiments as a set of high-level workflows. SparkGalaxy combines the Spark big data processing platform and the Galaxy workflow management system to offer a set of tools for graph processing and machine learning using a novel interaction model for creating and using complex workflows. SparkGalaxy contributes an easy-to-use interface and scalable algorithms for data science. We demonstrate SparkGalaxy use in large social network analysis and other case studies.