A Graph-based Approach for Semantic Data Mining
Haishan Liu
Committee: Dejing Dou (chair), Arthur Farley, Allen Malony, Daniel Lowd, Don Tucker
Dissertation Defense(May 2024)
Keywords:

Data mining is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. It is widely acknowledged that the role of domain knowledge in the discovery process is essential. However, the synergy between domain knowledge and data mining is still at a rudimentary level. This motivates me to develop a framework for explicit incorporation of domain knowledge in a data mining system so that insights can be drawn from both data and domain knowledge. I call such technology "semantic data mining."

Recent research in knowledge representation has led to mature standards such as the Web Ontology Language (OWL) by the W3C’s Semantic Web initiative. Semantic Web ontologies have become a key technology for knowledge representation and processing. The OWL ontology language is built on the W3C’s Resource Description Framework (RDF) that provides a simple model to describe information resources as a graph. On the other hand, there has been a surge of interest in tackling data mining problems where objects of interest can be best described as a graph of interrelated nodes. I notice that the interface between domain knowledge and data mining can be achieved by using graph representations. Therefore I explore a graph-based approach for modeling both knowledge and data and for analyzing the combined information source from which insight can be drawn systematically.

In summary, I make three main contributions in this dissertation to achieve semantic data mining. First, I develop an information integration solution based on metaheuristic optimization when data mining task require accessing heterogeneous data sources. Second, I describe how a graph interface for both domain knowledge and data can be structured by employing the RDF model and its graph representations. Finally, I describe several graph theoretic analysis approaches for mining the combined information source. I showcase the utility of the proposed methods on finding semantically associated itemsets, a particular case of the frequent pattern mining. I believe these contributions in semantic data mining can provide a novel and useful way to incorporate domain knowledge.

This dissertation includes published and unpublished coauthored material.