Empowering Graph Neural Networks from a Data-Centric View

Many learning tasks in Artificial Intelligence (AI) require dealing with graph data, ranging from biology and chemistry to finance and education. As powerful learning tools for graph inputs, graph neural networks (GNNs) have demonstrated remarkable performance in various applications such as recommender systems and drug discovery. Recent research has primarily focused on model-centric approaches to enhance GNN performance by modifying model architectures while keeping the dataset fixed. However, these approaches have limitations, particularly in terms of robustness and scalability. For example, these approaches often yield suboptimal performance when confronted with limited high-quality data. Moreover, training GNNs is often computationally expensive on large-scale data; and such cost becomes even prohibitive when we need to train numerous models on the same dataset, such as hyper-parameter and architecture search. Given the challenges arising from data, a crucial question arises: Can we address these problems directly from a data perspective?This dissertation presents a data-centric view that directly optimizes the given dataset to improve the performance of imperfect GNN models. Instead of modifying the model architectures, the data-centric view advocates for a set of techniques in graph dataset optimization to enhance the effectiveness and efficiency of GNNs. First, we demonstrate the potential to improve the quality of a graph dataset, enabling GNNs to exhibit robustness against severe noise and attacks. Furthermore, we showcase the possibility of substantially reducing the size of a graph dataset while preserving its information, thereby significantly decreasing the training cost. Unlike model-centric approaches that are typically specific to a single model, data-centric approaches yield improved datasets that benefit various existing models simultaneously. By embracing this data-centric perspective, this dissertation not only addresses crucial challenges associated with data quality and efficiency but also unlocks new opportunities for next-generation AI systems.

Read