Privacy Characterization and Quantification in Data Publishing
The increasing interest in collecting and publishing large amounts of individuals' data to public for purposes such as medical research, market analysis and economical measures has created major privacy concerns about their sensitive information. To deal with these concerns, many Privacy-Preserving Data Publishing (PPDP) schemes have been proposed in literature. However, they lack a proper privacy characterization. As a result, the existing schemes fail to provide reliable privacy loss quantification metrics and thus fail to correctly model the utility-privacy tradeoff. In this thesis, we first present a novel multi-variable privacy characterization model. Based on this model, we are able to analyze the prior and posterior adversarial beliefs about attribute values of individuals. Then we show that privacy should not be measured based on one metric. We demonstrate how this could result in privacy misjudgment. We propose two different metrics for quantification of privacy loss. Using these metrics and the proposed framework, we evaluate some of the most well-known PPDP techniques. The proposed metrics and data publishing framework are then used to build a negotiation-based data disclosure model to jointly address the utility requirements of the Data User (DU) and the privacy and, possibly, the monetary requirements of the Data Owner (DO). The data utility is re-defined based on the DU's rather than the DO's perspective. Based on the proposed model, we present two data disclosure scenarios that satisfy a given privacy constraint while achieving the DU's required data utility level. The variation in a DO's flat or variable monetary rate objective motivates the data disclosure scenarios. This model fills the gap between the existing theoretical work and the ultimate goal of practicality.The data publisher is required to provide guarantees that users' records cannot be de-identified from datasets. This reflects directly on the levels of data generalization and techniques by which data is anonymized. While Machine Learning (ML), one of the most revolutionary technologies nowadays, relies mainly on data, it is unfortunate that the more generalized the data is, the less accurate the ML model becomes. Although this is a well understood fact, we lack a model that quantifies such degradation in ML models' accuracy, as a consequence to the privacy constraints. To model this tradeoff, we provide the first framework to quantify, not only the privacy losses in data publishing, but also the utility losses in machine learning applications as a result of meeting the privacy constraints. To further expand our research and reflect its applicability to real industry applications, the proposed tradeoff management framework is then applied on a large-scale employee dataset from Barracuda Networks, a leader cybersecurity company. A privacy-preserving Account Takeover (ATO) detection algorithm is then proposed to predict the fraudulence of email account logins and thus detect possible ATO attacks. The results express variations in models' accuracy in binary classification of logins when trained on different datasets that satisfy different privacy constraints. The proposed framework enables a data owner to quantitatively manage the utility-privacy tradeoff and provide deeper insights about the value of the released data as well as the potential privacy losses upon publishing.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- Attribution 4.0 International
- Material Type
-
Theses
- Authors
-
Ibrahim, Mohamed Hossam Afifi
- Thesis Advisors
-
Ren, Jian
- Committee Members
-
Deb, Kalyanmoy
Li, Tongtong
Enbody, Richard J.
- Date Published
-
2021
- Program of Study
-
Electrical Engineering - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- 138 pages
- Permalink
- https://doi.org/doi:10.25335/wwzj-sz04