DATA-CENTRIC PRIVACY-PRESERVING MACHINE LEARNING

With the rapid popularization of machine learning (ML), privacy emerges as a critical obstacle for extracting knowledge from sensitive data into models. Traditional machine learning methods industriously curate data from millions of clients (e.g., edge devices) and train models upon the data, which causes tremendous risks of leaking data providers' sensitive information. In this thesis, we are devoted to exploiting learning algorithms that protect data providers' privacy based on three different ways of data use: from central to distributed settings. First, we investigate private data-centralized learning (CL) in the rigorous notion of differential privacy (DP), where we search for the iterative dynamic privacy allocation in gradient descent toward higher model utility. Our theoretic work shows that optimal privacy allocation can improve the sample efficiency of DP learning. Though data can be protected in learning, CL has to make the assumption that the data management institute can be trusted, which does not have technical guarantees, though. Rather than aggregating data, federated learning (FL) coordinates clients to periodically share models trained on local data. Confronting the great data and device heterogeneity from clients in FL, we propose novel algorithms that can effectively train models from clients with heterogeneous data distributions and device capabilities. One of our methods enhances the knowledge transfer from one supervised domain to an unsupervised domain, and can reduce the performance gap between clients from different social groups. We also developed a hardware-adaptive learning algorithm that makes FL inclusive for devices with various capabilities. Not only during training, our algorithm also enables models to be customizable at test time for facilitating dynamic computation budgets. Though FL mitigates the risks by distributed data, the local training of large models could still be a significant burden for resource-limited edge devices. Last, observing the limitations of CL in the privacy of data storage and FL in computation, we propose a new computation paradigm, outsourcing training without uploading data (OT). To learn effective knowledge about private data, we sample the proximal proxy dataset from the open-source data for cloud training. Our method can efficiently and effectively spot similar samples from privacy-free open-source data, and therefore can transfer the computation costs of training to the cloud server.

Read

In Collections: Electronic Theses & Dissertations

Copyright Status: In Copyright

Material Type: Theses

Authors: Hong, Junyuan

Thesis Advisors: Zhou, Jiayu

Committee Members: Jain, Anil
Wang, Zhangyang
Liu, Sijia

Date Published: 2023

Subjects: Computer science

Program of Study: Computer Science - Doctor of Philosophy

Degree Level: Doctoral

Language: English

Pages: 139 pages

Permalink: https://doi.org/doi:10.25335/m5tw-ym70

DATA-CENTRIC PRIVACY-PRESERVING MACHINE LEARNING

Full text