DATA-CENTRIC PRIVACY-PRESERVING MACHINE LEARNING
With the rapid popularization of machine learning (ML), privacy emerges as a critical obstacle for extracting knowledge from sensitive data into models. Traditional machine learning methods industriously curate data from millions of clients (e.g., edge devices) and train models upon the data, which causes tremendous risks of leaking data providers' sensitive information. In this thesis, we are devoted to exploiting learning algorithms that protect data providers' privacy based on three different ways of data use: from central to distributed settings. First, we investigate private data-centralized learning (CL) in the rigorous notion of differential privacy (DP), where we search for the iterative dynamic privacy allocation in gradient descent toward higher model utility. Our theoretic work shows that optimal privacy allocation can improve the sample efficiency of DP learning. Though data can be protected in learning, CL has to make the assumption that the data management institute can be trusted, which does not have technical guarantees, though. Rather than aggregating data, federated learning (FL) coordinates clients to periodically share models trained on local data. Confronting the great data and device heterogeneity from clients in FL, we propose novel algorithms that can effectively train models from clients with heterogeneous data distributions and device capabilities. One of our methods enhances the knowledge transfer from one supervised domain to an unsupervised domain, and can reduce the performance gap between clients from different social groups. We also developed a hardware-adaptive learning algorithm that makes FL inclusive for devices with various capabilities. Not only during training, our algorithm also enables models to be customizable at test time for facilitating dynamic computation budgets. Though FL mitigates the risks by distributed data, the local training of large models could still be a significant burden for resource-limited edge devices. Last, observing the limitations of CL in the privacy of data storage and FL in computation, we propose a new computation paradigm, outsourcing training without uploading data (OT). To learn effective knowledge about private data, we sample the proximal proxy dataset from the open-source data for cloud training. Our method can efficiently and effectively spot similar samples from privacy-free open-source data, and therefore can transfer the computation costs of training to the cloud server.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- In Copyright
- Material Type
-
Theses
- Authors
-
Hong, Junyuan
- Thesis Advisors
-
Zhou, Jiayu
- Committee Members
-
Jain, Anil
Wang, Zhangyang
Liu, Sijia
- Date Published
-
2023
- Subjects
-
Computer science
- Program of Study
-
Computer Science - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- 139 pages
- Permalink
- https://doi.org/doi:10.25335/m5tw-ym70