FLAMENCO: A Programmatic Labeling and Sharing Framework for Internet Data Science
Yukhe Lavinia
Committee: Ram Durairajan (chair), Reza Rejaie (chair), Daniel Lowd, Thien Nguyen
Directed Research Project(Dec 2021)
Keywords: training labels, labeling function, data programming, internet data science, unlabeled training data, weak supervision, machine learning, weak supervised learning, networking data, ddos ntp, latency measurement

The success of Internet Data Science depends on the availability of high-quality labeled data (e.g., onset of a DDoS in NetFlow log). Equally critical is the ability to share the data with others, respecting the data owners’ privacy concerns. Unfortunately, short of applying the data-to-code paradigm (i.e., actual sharing of data), researchers lack a systematic framework for working with or benefiting from data while being mindful of privacy concerns. As a result, Internet DataScience as practiced today is not amenable to leveraging the collective domain knowledge of the community for important ML-related activities such as (i) high-quality data labeling at scale, (ii) sharing of domain knowledge in a privacy-preserving manner, and (iii) creating a viable roadmap for their adoption by operators due to lack of capabilities to interpret the trained ML models. We propose a novel code-to-data approach whose goals are to benefit data ownership, preserve privacy in collaborations, facilitate independent validation of each others’ findings, and enable the interpretability of trained ML models. Here, code refers to labeling functions, which we view as programmatic representations of operators’ domain knowledge to identify events of interest in the network data. The key novelty of our approach is that it entails only the sharing of code and no sharing of any raw or curated data or trained ML models. We substantiate our approach by building FLAMENCO—a novel weak supervision-based framework to collaboratively label network data at scale while being mindful of data owner’s privacy concerns. We demonstrate the efficacy of FLAMENCO by labeling diverse networking data programmatically, enabling privacy-preserving collaboration among researchers using those data, and facilitating the interpretability of models trained on those data.