PRIVACY AND INTEGRITY PRESERVING COMPUTATION IN DISTRIBUTED SYSTEMS By Fei Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science 2011 ABSTRACT PRIVACY AND INTEGRITY PRESERVING COMPUTATION IN DISTRIBUTED SYSTEMS By Fei Chen Preserving privacy and integrity of private data has become core requirements for many distributed systems across different parties. In these systems, one party may try to compute or aggregate useful information from the private data of other parties. However, this party is not be fully trusted by other parties. Therefore, it is important to design security protocols for preserving such private data. Furthermore, one party may want to query the useful information computed from such private data. However, query results may be modified by a malicious party. Thus, it is important to design query protocols such that query result integrity can be verified. In this dissertation, we study four important privacy and integrity preserving problems for different distributed systems. For two-tiered sensor networks, where storage nodes serve as an intermediate tier between sensors and a sink for storing data and processing queries, we proposed SafeQ, a protocol that prevents compromised storage nodes from gaining information from both sensor collected data and sink issued queries, while it still allows storage nodes to process queries over encrypted data and the sink to detect compromised storage nodes when they misbehave. For cloud computing, where a cloud provider hosts the data of an organization and replies query results to the customers of the organization, we propose novel privacy and integrity preserving schemes for multi-dimensional range queries such that the cloud provider can process encoded queries over encoded data without knowing the actual values, and customers can verify the integrity of query results with high probability. For distributed firewall policies, we proposed the first privacy-preserving protocol for cross-domain firewall policy optimization. For any two adjacent firewalls belonging to two different administrative domains, our protocol can identify in each firewall the rules that can be removed because of the other firewall. For network reachability, one of the key factors for capturing end-to-end network behavior and detecting the violation of security policies, we proposed the first cross-domain privacy-preserving protocol for quantifying network reachability. ACKNOWLEDGMENTS I am extremely grateful to my advisor Dr. Alex X. Liu. He is not only an excellent advisor, an outstanding researcher, but also a great friend. This thesis would not be possible without his tremendous help. He always tries his best to guide me through every perspective of my graduate study, and give tremendous support to make me successful in both study and research. He not only teaches me how to identify problems, how to solve problems, and how to build systems, but also helps me improve my writing skills, speaking skills, and communication skills. Numerous times, he helps me set milestones together, and guides me to make progress steadily. When I meet difficult problems in my research, his encouragement help me to gain more confidence, and to come up with solutions. I would like to thank other members in my thesis committee, Dr. Eric Torng, Dr. Richard Enbody, and Dr. Hayder Radha. Dr. Torng not only guided my thesis, but also gave me great help on finding travel funds such that I can present my papers in many conferences. Dr. Enbody and Dr. Radha gave me many valuable suggestions and feedbacks on my qualifying report and comprehensive proposal, which helped improve my thesis significantly. I thank my collaborator and friend Bezawada Bruhadeshwar. He makes significant contributions to this thesis work. My work with him is both happy and fruitful. He actively discussed with me the works of privacy preserving firewall optimization and privacy preserving network qualifications and provide significant help to move the project forward. I really thank my wife Xiaoshu Wu and my parents for her great love and tremendous support in every aspect of my life in Michigan State University. They support me to pursue my own goals with confidence and to overcome obstacles with courage. I extremely grateful for my wife taking great care of me when I had serious health problems and stayed in hospital. This thesis is dedicated to my wife and my parents. iv TABLE OF CONTENTS LIST OF TABLES ix LIST OF FIGURES x 1 Introduction 1.1 Privacy and Integrity Preserving Queries for Sensor Networks . . . . . . . . . . . . . . . . . 1.2 Privacy and Integrity Preserving Queries for Cloud Computing . . . . . . . . . . . . . . . . 1.3 Privacy Preserving Optimization for Firewall Policies . . . . . . . . . . . . . . . . . . . . . . 1.4 Privacy Preserving Quantification of Network Reachability . . . . . . . . . . . . . . . . . . . 1.5 Structure of the paper . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 2 Privacy and Integrity Preserving Range Queries in Sensor Networks 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Technical Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Limitations of Prior Art . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Our Approach and Key Contributions . . . . . . . . . . . . . . . 2.1.5 Summary of Experimental Results . . . . . . . . . . . . . . . . . . 2.2 Models and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Privacy for 1-dimensional Data . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Prefix Membership Verification . . . . . . . . . . . . . . . . . . . 2.3.2 The Submission Protocol . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 The Query Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Integrity for 1-dimensional Data . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Integrity Scheme Using Merkle Hash Trees . . . . . . . . . . . . . 2.4.2 Integrity Scheme Using Neighborhood Chains . . . . . . . . . . . 2.5 Queries over Multi-dimensional Data . . . . . . . . . . . . . . . . . . . . 2.5.1 Privacy for Multi-dimensional Data . . . . . . . . . . . . . . . . . 2.5.2 Integrity for Multi-dimensional Data . . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 8 9 9 10 10 10 12 13 13 14 16 17 18 19 20 23 24 25 26 2.6 2.7 2.8 2.9 SafeQ Optimization . . . . . . . . Queries in Event-driven Networks Complexity and Security Analysis 2.8.1 Complexity Analysis . . . 2.8.2 Privacy Analysis . . . . . 2.8.3 Integrity Analysis . . . . . Experimental Results . . . . . . . 2.9.1 Evaluation Methodology . 2.9.2 Evaluation Setup . . . . . 2.9.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Privacy and Integrity Preserving Range Queries for Cloud Computing 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Technical Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Limitations of Previous Work . . . . . . . . . . . . . . . . . . . . . 3.1.4 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.6 Summary of Experimental Results . . . . . . . . . . . . . . . . . . . 3.2 Models and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Privacy Preserving for 1-dimensional Data . . . . . . . . . . . . . . . . . . 3.3.1 The Order-Preserving Hash-based Function . . . . . . . . . . . . . 3.3.2 The Privacy-Preserving Scheme . . . . . . . . . . . . . . . . . . . . 3.3.3 Optimization of the Order-Preserving Hash-based Function . . . . . 3.3.4 Analysis of Information Leakage . . . . . . . . . . . . . . . . . . . . 3.4 Integrity Preserving for 1-dimensional Data . . . . . . . . . . . . . . . . . 3.4.1 Bit Matrices and Local Bit Matrices . . . . . . . . . . . . . . . . . 3.4.2 The Integrity-Preserving Scheme . . . . . . . . . . . . . . . . . . . 3.5 Finding Optimal Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Detection Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Optimal Bucket Partition . . . . . . . . . . . . . . . . . . . . . . . 3.6 Query Over Multi-dimensional Data . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Privacy for Multi-dimensional Data . . . . . . . . . . . . . . . . . . 3.6.2 Integrity for Multi-dimensional Data . . . . . . . . . . . . . . . . . 3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . 29 32 34 34 35 36 36 36 37 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 47 47 48 49 49 50 50 51 51 51 52 52 54 56 57 58 59 61 62 64 65 66 68 68 69 72 72 3.7.2 3.7.3 Results for 1-dimensional Data . . . . . . . . . . . . . . . . . . . . . Results for Multi-dimensional Data . . . . . . . . . . . . . . . . . . . 4 Privacy Preserving Cross-Domain Cooperative Firewall Optimization 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Limitation of Prior Work . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Cross-domain Inter-firewall Optimization . . . . . . . . . . . . . . . 4.1.4 Technical Challenges and Our Approach . . . . . . . . . . . . . . . 4.1.5 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 System and Threat Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Privacy-Preserving Inter-Firewall Redundancy Removal . . . . . . . . . . . 4.3.1 Privacy-Preserving Range Comparison . . . . . . . . . . . . . . . . 4.3.2 Processing Firewall F W1 . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Processing Firewall F W2 . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Single-Rule Coverage Redundancy Detection . . . . . . . . . . . . . 4.3.5 Multi-Rule Coverage Redundancy Detection . . . . . . . . . . . . . 4.3.6 Identification and Removal of Redundant Rules . . . . . . . . . . . 4.4 Firewall Update After Optimization . . . . . . . . . . . . . . . . . . . . . . 4.5 Security and Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Effectiveness and Efficiency on Real Policies . . . . . . . . . . . . . 4.6.4 Efficiency on Synthetic Policies . . . . . . . . . . . . . . . . . . . . 73 75 . . . . . . . . . . . . . . . . . . . . . . . . . 78 78 78 79 80 81 82 83 83 84 84 85 87 90 93 95 98 100 100 100 102 103 103 104 105 111 5 Privacy Preserving Cross-Domain Network Reachability Quantification 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Limitation of Prior Art . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Cross-Domain Quantification of Network Reachability . . . . . . . . . 5.1.4 Technical Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.5 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.6 Summary of Experimental Results . . . . . . . . . . . . . . . . . . . . 5.1.7 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 113 113 114 116 117 118 119 119 vii 5.2 5.3 5.4 5.5 5.6 Problem Statement and Threat Model . . . . . . . . . . . 5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . 5.2.2 Threat Model . . . . . . . . . . . . . . . . . . . . . Privacy-Preserving Quantification of Network Reachability 5.3.1 Privacy-Preserving Range Intersection . . . . . . . 5.3.2 ACL Preprocessing . . . . . . . . . . . . . . . . . . 5.3.3 ACL Encoding and Encryption . . . . . . . . . . . 5.3.4 ACL Comparison . . . . . . . . . . . . . . . . . . . Security and Complexity Analysis . . . . . . . . . . . . . . 5.4.1 Security Analysis . . . . . . . . . . . . . . . . . . . 5.4.2 Complexity Analysis . . . . . . . . . . . . . . . . . Optimization . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . 5.6.1 Efficiency on Real ACLs . . . . . . . . . . . . . . . 5.6.2 Efficiency on Synthetic ACLs . . . . . . . . . . . . 6 Related Work 6.1 Secure Multiparty Computation . . . . . . . . . . . . . . 6.2 Privacy and Integrity Preserving in WSNs . . . . . . . . 6.3 Privacy and Integrity Preserving in DAS . . . . . . . . . 6.4 Firewall Redundancy Removal and Collaborative Firewall 6.5 Network Reachability Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 119 121 121 122 124 126 130 133 133 135 136 137 137 138 145 . . . . . . . . . . . 145 . . . . . . . . . . . 146 . . . . . . . . . . . 147 Enforcement in VPN150 . . . . . . . . . . . 151 7 Conclusions and Future Work 154 APPENDICES A Analysis of SafeQ Optimization . . . ∗ B Properties of fk and Their Proof . . C Calculation of Detection Probability D Proof of Theorems 3 and 4 . . . . . 156 156 159 160 162 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 viii LIST OF TABLES 2.1 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Complexity analysis of SafeQ . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 Redundancy ratios for 5 real firewall groups . . . . . . . . . . . . . . . . . . 105 ix LIST OF FIGURES 2.1 Architecture of two-tired sensor networks . . . . . . . . . . . . . . . . . . . . 11 2.2 The idea of SafeQ for preserving privacy . . . . . . . . . . . . . . . . . . . . 14 2.3 Prefix membership verification . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Merkle hash tree for 8 data items . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Data integrity verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 An example neighborhood chain . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Merkle hash trees for two-dimensional data . . . . . . . . . . . . . . . . . . . 27 2.8 A 2-dimensional neighborhood chain . . . . . . . . . . . . . . . . . . . . . . 28 2.9 An example Bloom filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.10 Example idle periods and data submissions . . . . . . . . . . . . . . . . . . . 34 2.11 Average power consumption per submission for a sensor (A) . . . . . . . . . 41 2.12 Average power consumption per submission for a sensor (B) . . . . . . . . . 42 2.13 Average power consumption per query response for a storage node (A) . . . 43 2.14 Average power consumption per query response for a storage node (B) . . . 44 2.15 Average space consumption for a storage node (A) . . . . . . . . . . . . . . . 45 2.16 Average space consumption for a storage node (B) . . . . . . . . . . . . . . . 46 3.1 The DAS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Basic idea of privacy-preserving scheme . . . . . . . . . . . . . . . . . . . . . 54 3.3 Basic idea of integrity-preserving scheme . . . . . . . . . . . . . . . . . . . . 60 3.4 Example bit matrix and local bit matrices . . . . . . . . . . . . . . . . . . . 61 3.5 The example 2-dimensional bit matrix and local bit matrices . . . . . . . . . 70 3.6 Effectiveness of optimal partition algorithm . . . . . . . . . . . . . . . . . . 74 3.7 Correctness of integrity-preserving scheme . . . . . . . . . . . . . . . . . . . 74 3.8 Data processing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 x 3.9 Space cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.10 Query processing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 Effect of the number of rules on the throughput with frame size 128 bytes [2] 79 4.2 Example inter-firewall redundant rules . . . . . . . . . . . . . . . . . . . . . 81 4.3 Prefix membership verification . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4 The Conversion of F W1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5 The Conversion of F W2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Comparison of Two Firewalls . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.7 Three largest rules generated from Figure 4.4(d) . . . . . . . . . . . . . . . . 96 4.8 Identification of redundant rules in F W2 . . . . . . . . . . . . . . . . . . . . 99 4.9 Processing F W1 on real firewalls . . . . . . . . . . . . . . . . . . . . . . . . 106 4.10 Processing F W2 on real firewalls . . . . . . . . . . . . . . . . . . . . . . . . 107 4.11 Comparing two real firewalls . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.12 Processing F W1 on synthetic firewalls . . . . . . . . . . . . . . . . . . . . . . 109 4.13 Processing F W2 on synthetic firewalls . . . . . . . . . . . . . . . . . . . . . . 110 4.14 Comparing two synthetic firewalls . . . . . . . . . . . . . . . . . . . . . . . . 112 5.1 An example of end-to-end network reachability . . . . . . . . . . . . . . . . . 115 5.2 Three resulting ACLs converted from Figure 5.1 5.3 Privacy-preserving range intersection . . . . . . . . . . . . . . . . . . . . . . 124 5.4 The Conversion of A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.5 The example three adjacent ACLs . . . . . . . . . . . . . . . . . . . . . . . . 126 5.6 Encoding and encryption of ACL A1 . . . . . . . . . . . . . . . . . . . . . . 128 5.7 Encoding and encryption of ACL A3 . . . . . . . . . . . . . . . . . . . . . . 130 5.8 Comparison of ACLs A2 and A3 . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.9 Decryption process of the comparison result . . . . . . . . . . . . . . . . . . 133 . . . . . . . . . . . . . . . 117 5.10 Comp. & comm. costs for processing real ACL Ai (1≤i≤n−1) . . . . . . . . 139 xi 5.11 Comp. & comm. costs for processing real ACL An . . . . . . . . . . . . . . 140 5.12 Comp. & comm. costs for processing synthetic ACL Ai (1≤i≤n−1) . . . . . 142 5.13 Comp. & comm. costs for processing synthetic ACL An . . . . . . . . . . . . 143 5.14 Comparison time of synthetic ACLs Ai and An . . . . . . . . . . . . . . . . 144 xii CHAPTER 1 Introduction For distributed systems across different parties, preserving privacy and integrity of private data has become core requirements in the recent decade. In these systems, one party may not be fully trusted by other parties, but tries to compute or aggregate useful information from private data of other parties. Thus, it is very important to design security communication and storage protocols for preventing the party from revealing such data of other parties. Otherwise, one party would be reluctant to share its private data. For example, Google and Facebook collect significant amount of personal data every day through the Internet, while many persons want to preserve the privacy of such data due to the security concern. Furthermore, one party may want to query the useful information computed from the private data of other parties. However, query results may be modified by a malicious party. Thus, it is very important to design query protocols such that the integrity of query results can be verified. In this dissertation, we study four important, yet under-investigated, privacy and integrity preserving problems for different distributed systems, privacy and integrity preserving range queries in sensor networks, privacy and integrity preserving range queries for cloud computing, privacy preserving cross-domain cooperative firewall optimization, and privacy pre- 1 serving cross-domain network reachability quantification. Next, for each problem, we first describe the motivation, present the challenges, and propose our solution. 1.1 Privacy and Integrity Preserving Queries for Sensor Networks Two-tiered sensor networks, where storage nodes gather data from nearby sensors and answer queries from the sink, has been widely adopted due to the benefits of power and storage saving for sensors and the efficiency of query processing. However, a compromised storage node could disclose all sensitive data collected from nearby sensors and return forged data to the sink. Therefore, on a storage node, the data received from sensors should have been encrypted and the queries received from the sink should also have been encrypted. Without decrypting the data and queries, the storage node needs to process encrypted queries over encrypted data correctly and send encrypted data that satisfy the queries back to the sink. Moreover, the sink needs to be able to verify the integrity of the query results received from storage nodes. Seemingly impossible, this problem is very challenging to solve. It is even more challenging to solve efficiently. To preserve privacy, I proposed a novel technique to encode both data and queries such that a storage node can correctly process encoded queries over encoded data without knowing their values. To preserve integrity, I proposed two schemes, one using Merkle hash trees and another using a new data structure called neighborhood chains, to generate integrity verification information so that a sink can use this information to verify whether the result of a query contains exactly the data items that satisfy the query. To improve performance, I proposed an optimization technique using Bloom filters to reduce the communication cost between sensors and storage nodes. 2 1.2 Privacy and Integrity Preserving Queries for Cloud Computing Outsourced database systems are one of the most important work in cloud computing, where a cloud provider hosts the private databases of an organization and replies query results to the customers on behalf of the organization. However, the inclusion of the outsourced database systems also brings significant security and privacy challenges. As cloud providers cannot be fully trusted and the data of an organization are typically confidential, the organization always encodes the data before storing them in a cloud to prevent the cloud provider from revealing the data. However, it is difficult to process queries over encoded data. Furthermore, since cloud providers serve as an important role for answering queries from customers, they may misbehave the query results, such as returning forged data for the query or not returning all data that satisfy the query. I proposed the basic idea of the privacy and integrity preserving protocol for processing range queries in outsourced databases. To preserve privacy, I proposed an order preserving hash function to encode the data items from the data owner and the queries from its customers such that the cloud provider can use the encoded queries and encoded data items to find out the query results without knowing the actual values. To preserve integrity, I propose the first probabilistic integrity preserving scheme for range queries in outsourced database systems. This scheme allows a customer to verify the integrity of a query result with a high probability. 3 1.3 Privacy Preserving Optimization for Firewall Policies Firewalls have been widely deployed on the Internet for securing private networks by checking whether to accept or discard packets based on its policy. Optimizing firewall policies is crucial for improving network performance. Prior work on firewall optimization focuses on either intra-firewall or inter-firewall optimization within one administrative domain where the privacy of firewall policies is not a concern. I explored inter-firewall optimization across administrative domains for the first time. The key technical challenge is that firewall policies cannot be shared across domains because a firewall policy contains confidential information and even potential security holes, which can be exploited by attackers. In this work, I proposed the first cross-domain privacy-preserving cooperative firewall policy optimization protocol. Specifically, for any two adjacent firewalls belonging to two different administrative domains, the protocol can identify in each firewall the rules that can be removed because of the other firewall. The optimization process involves cooperative computation between the two firewalls without any party disclosing its policy to the other. 1.4 Privacy Preserving Quantification of Network Reachability Network reachability is one of the key factors for capturing end-to-end network behavior and detecting the violation of security policies. Many approaches were proposed to address the network reachability problem. The main assumption in all these approaches is that the reachability restriction information of each network device and other configuration state is 4 known to a central network analyst, who is quantifying the network reachability. However, in reality, it is common that the network devices along a given path belong to different administrative domains where the reachability restriction information cannot be shared with others including the network analyst. In this work, I will try to design the first cross-domain privacy-preserving protocol for quantifying network reachability. The protocol enables the network analyst to accurately determine the network reachability along a network path through different administrative domains without knowing the the reachability restriction information of the other domains. 1.5 Structure of the paper In Chapter 2, we present SafeQ, a protocol that prevents attackers from gaining information from both sensor collected data and sink issued queries. we start with the system model, threat model, and problem statement. Then, we present our schemes for preserving data privacy and query result integrity, respectively. We also propose a solution to adapt SafeQ for event-driven sensor networks. We show that SafeQ excels state-of-the-art scheme in both privacy and performance. In Chapter 3, we first propose an efficient privacy-preserving scheme that can process multidimensional range queries without false positives. Then, we propose the first probabilistic scheme for verifying the integrity of range query results. This scheme employs a new data structure, local bit matrices, which enables customers to verify query result integrity with high probability. We show that our scheme is effectiveness and efficiency on both real and synthetic datasets. In Chapter 4, we first introduce the problem, system model, and threat model. Then, we present our privacy-preserving protocol for detecting inter-firewall redundant rules. Finally, we give security analysis results of our protocol and present our experimental results. 5 In Chapter 5, we first propose the cross-domain privacy-preserving protocol to quantify network reachability across multiple parties. Then, we propose an optimization technique to reduce computation and communication costs. Finally, we show that our protocol is efficient and suitable for real applications. In Chapter 6, we first survey related work of secure multiparty computation, one of the fundamental cryptographic primitives for designing privacy-preserving protocols. Then, we discuss related work for each specific problem we investigate. in two parts. Finally, in Chapter 7, we summarize this dissertation, discuss limitations, and outline future research directions. 6 CHAPTER 2 Privacy and Integrity Preserving Range Queries in Sensor Networks 2.1 2.1.1 Introduction Motivation Wireless sensor networks have been widely deployed for various applications, such as environment sensing, building safety monitoring, and earthquake predication, etc.. In this work, we consider a two-tiered sensor network architecture in which storage nodes gather data from nearby sensors and answer queries from the sink of the network. The storage nodes serve as an intermediate tier between the sensors and the sink for storing data and processing queries. Storage nodes bring three main benefits to sensor networks. First, sensors save power by sending all collected data to their closest storage node instead of sending them to the sink through long routes. Second, sensors can be memory limited because data are mainly stored on storage nodes. Third, query processing becomes more efficient because the sink only communicates with storage nodes for queries. The inclusion of storage nodes in sensor networks 7 was first introduced in [65] and has been widely adopted [26, 82, 70, 71, 69]. Several products of storage nodes, such as StarGate [7] and RISE [6], are commercially available. However, the inclusion of storage nodes also brings significant security challenges. As storage nodes store data received from sensors and serve as an important role for answering queries, they are more vulnerable to be compromised, especially in a hostile environment. A compromised storage node imposes significant threats to a sensor network. First, the attacker may obtain sensitive data that has been, or will be, stored in the storage node. Second, the compromised storage node may return forged data for a query. Third, this storage node may not include all data items that satisfy the query. Therefore, we want to design a protocol that prevents attackers from gaining information from both sensor collected data and sink issued queries, which typically can be modeled as range queries, and allows the sink to detect compromised storage nodes when they misbehave. For privacy, compromising a storage node should not allow the attacker to obtain the sensitive information that has been, and will be, stored in the node, as well as the queries that the storage node has received, and will receive. Note that we treat the queries from the sink as confidential because such queries may leak critical information about query issuers’ interests, which need to be protected especially in military applications. For integrity, the sink needs to detect whether a query result from a storage node includes forged data items or does not include all the data that satisfy the query. 2.1.2 Technical Challenges There are two key challenges in solving the privacy and integrity preserving range query problem. First, a storage node needs to correctly process encoded queries over encoded data without knowing their actual values. Second, a sink needs to verify that the result of a query contains all the data items that satisfy the query and does not contain any forged data. 8 2.1.3 Limitations of Prior Art Although important, the privacy and integrity preserving range query problem has been under-investigated. The prior art solution to this problem was proposed by Sheng and Li in their recent seminal work [69]. We call it S&L scheme. This scheme has two main drawbacks: (1) it allows attackers to obtain a reasonable estimation on both sensor collected data and sink issued queries, and (2) the power consumption and storage space for both sensors and storage nodes grow exponentially with the number of dimensions of collected data. 2.1.4 Our Approach and Key Contributions In this work, we propose SafeQ, a novel privacy and integrity preserving range query protocol for two-tiered sensor networks. The ideas of SafeQ are fundamentally different from S&L scheme. To preserve privacy, SafeQ uses a novel technique to encode both data and queries such that a storage node can correctly process encoded queries over encoded data without knowing their actual values. To preserve integrity, we propose two schemes, one using Merkle hash trees and another using a new data structure called neighborhood chains, to generate integrity verification information such that a sink can use this information to verify whether the result of a query contains exactly the data items that satisfy the query. We also propose an optimization technique using Bloom filters to significantly reduce the communication cost between sensors and storage nodes. Furthermore, we propose a solution to adapt SafeQ for event-driven sensor networks, where a sensor submits data to its nearby storage node only when a certain event happens and the event may occur infrequently. SafeQ excels state-of-the-art S&L scheme [69] in two aspects. First, SafeQ provides significantly better security and privacy. While prior art allows a compromised storage node to obtain a reasonable estimation on the value of sensor collected data and sink issued queries, SafeQ makes such estimation very difficult. Second, SafeQ delivers orders of magnitude bet- 9 ter performance on both power consumption and storage space for multi-dimensional data, which are most common in practice as most sensors are equipped with multiple sensing modules such as temperature, humidity, pressure, etc. 2.1.5 Summary of Experimental Results We performed side-by-side comparison with prior art over a large real-world data set from Intel Lab [4]. Our results show that the power and space savings of SafeQ over prior art grow exponentially with the number of dimensions. For power consumption, for threedimensional data, SafeQ consumes 184.9 times less power for sensors and 76.8 times less power for storage nodes. For space consumption on storage nodes, for three-dimensional data, SafeQ uses 182.4 times less space. Our experimental results conform with the analysis that the power and space consumption in S&L scheme grow exponentially with the number of dimensions, whereas those in SafeQ grow linearly with the number of dimensions times the number of data items. 2.2 2.2.1 Models and Problem Statement System Model We consider two-tired sensor networks as illustrated in Figure 2.1. A two-tired sensor network consists of three types of nodes: sensors, storage nodes, and a sink. Sensors are inexpensive sensing devices with limited storage and computing power. They are often massively distributed in a field for collecting physical or environmental data, e.g., temperature. Storage nodes are powerful mobile devices that are equipped with much more storage capacity and computing power than sensors. Each sensor periodically sends collected data to its nearby storage node. The sink is the point of contact for users of the sensor network. Each time the 10 sink receives a question from a user, it first translates the question into multiple queries and then disseminates the queries to the corresponding storage nodes, which process the queries based on their data and return the query results to the sink. The sink unifies the query results from multiple storage nodes into the final answer and sends it back to the user. Sensor Sensor Data Data Data Query Storage Node Result Sink Data Sensor Sensor For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. Figure 2.1. Architecture of two-tired sensor networks For the above network architecture, we assume that all sensor nodes and storage nodes are loosely synchronized with the sink. With loose synchronization in place, we divide time into fixed duration intervals and every sensor collects data once per time interval. From a starting time that all sensors and the sink agree upon, every n time intervals form a time slot. From the same starting time, after a sensor collects data for n times, it sends a message that contains a 3-tuple (i, t, {d1 , · · · , dn }), where i is the sensor ID and t is the sequence number of the time slot in which the n data items {d1 , · · · , dn } are collected by sensor si . We address privacy and integrity preserving ranges queries for event-driven sensor networks, where a sensor only submits data to a nearby storage node when a certain event happens, in Section 2.7. We further assume that the queries from the sink are range queries. A range query “finding all the data items, which are collected at time slot t and whose value is in the range [a, b]” is denoted as {t, [a, b]}. Note that the queries in most sensor network 11 applications can be easily modeled as range queries. For ease of presentation, Table 2.1 shows the notation used in this chapter. si ki t d1 , · · · , dn D1 , · · · , D n H, G, E F (x) S([d1 , d2 ]) N HMACg QR VO A sensor with ID i The secret key of sensor si The sequence number of a time slot n 1-dimensional data items n z-dimensional data items Three “magic” functions The prefix family of x The minimum set of prefixes converted from [d1 , d2 ] A prefix numericalization function An HMAC function with key g A query result An integrity verification object Table 2.1. Summary of notation 2.2.2 Threat Model For a two-tiered sensor network, we assume that the sensors and the sink are trusted but the storage nodes are not. In a hostile environment, both sensors and storage nodes can be compromised. If a sensor is compromised, the subsequent collected data of the sensor will be known to the attacker and the compromised sensor may send forged data to its closest storage node. It is extremely difficult to prevent such attacks without the use of tamper proof hardware. However, the data from one sensor constitute a small fraction of the collected data of the whole sensor network. Therefore, we mainly focus on the scenario where a storage node is compromised. Compromising a storage node can cause much greater damage to the sensor network than compromising a sensor. After a storage node is compromised, the large quantity of data stored on the node will be known to the attacker and upon receiving a query from the sink, the compromised storage node may return a falsified result formed by 12 including forged data or excluding legitimate data. Therefore, attackers are more motivated to compromise storage nodes. 2.2.3 Problem Statement The fundamental problem for a two-tired sensor network is: how can we design the storage scheme and the query protocol in a privacy and integrity preserving manner ? A satisfactory solution to this problem should meet the following two requirements: (1) Data and query privacy: Data privacy means that a storage node cannot know the actual values of sensor collected data. This ensures that an attacker cannot understand the data stored on a compromised storage node. Query privacy means that a storage node cannot know the actual value of sink issued queries. This ensures that an attacker cannot understand, or deduce useful information from, the queries that a compromised storage node receives. (2) Data integrity: If a query result that a storage node sends to the sink includes forged data or excludes legitimate data, the query result is guaranteed to be detected by the sink as invalid. Besides these two hard requirements, a desirable solution should have low power and space consumption because these mobile devices have limited resources. 2.3 Privacy for 1-dimensional Data To preserve privacy, it seems natural to have sensors encrypt data and the sink encrypt queries; however, the key challenge is how a storage node processes encrypted queries over encrypted data. The idea of our solution for preserving privacy is illustrated in Figure 2.2. We assume that each sensor si in a network shares a secret key ki with the sink. For the n data items d1 , · · · , dn that a sensor si collects in time slot t, si first encrypts the data items using key ki , the results of which are represented as (d1 )k , · · · , (dn )k . Then, si applies a i i 13 “magic” function H to the n data items and obtains H(d1 , · · · , dn ). The message that the sensor sends to its closest storage node includes both the encrypted data and the associative information H(d1 , · · · , dn ). When the sink wants to perform query {t, [a, b]} on a storage node, the sink applies another “magic” function G on the range [a, b] and sends {t, G([a, b])} to the storage node. The storage node processes the query {t, G([a, b])} over encrypted data (d1 )ki , · · · , (dn )ki collected at time slot t using another “magic” function E. The three “magic” functions H, G, and E satisfy the following three conditions: (1) A data item dj (1 ≤ j ≤ n) is in range [a, b] if and only if E(j, H(d1 , · · · , dn ), G([a, b])) is true. This condition allows the storage node to decide whether (dj )k should be included in the query i result. (2) Given H(d1 , · · · , dn ) and (dj )ki , it is computationally infeasible for the storage node to compute dj . This condition guarantees data privacy. (3) Given G([a, b]), it is computationally infeasible for the storage node to compute [a, b]. This condition guarantees query privacy. Sensor si Sink Storage Node (d1)ki,…,(dn)ki, H(d1,…,dn) {t, G([a, b])} dj[a, b] iff E(j, H(d1, …, dn), G([a, b])) is true Figure 2.2. The idea of SafeQ for preserving privacy 2.3.1 Prefix Membership Verification The building block of our privacy preserving scheme is the prefix membership verification scheme first introduced in [23] and later formalized in [48]. The idea of this scheme is to convert the verification of whether a number is in a range to several verifications of whether two numbers are equal. A prefix {0, 1}k {∗}w−k with k leading 0s and 1s followed by w − k ∗s is called a k−prefix. For example, 1*** is a 1-prefix and it denotes the range [1000, 1111]. 14 If a value x matches a k−prefix (i.e., x is in the range denoted by the prefix), the first k bits of x and the k−prefix are the same. For example, if x ∈ 1*** (i.e., x ∈ [1000, 1111]), then the first bit of x must be 1. Given a binary number b1 b2 · · · bw of w bits, the prefix family of this number is defined as the set of w + 1 prefixes {b1 b2 · · · bw , b1 b2 · · · bw−1 ∗, · · · , b1 ∗ · · · ∗, ∗ ∗ ...∗}, where the i-th prefix is b1 b2 · · · bw−i+1 ∗ · · · ∗. The prefix family of x is denoted as F (x). For example, the prefix family of number 12 is F (12) = F (1100) ={1100, 110*, 11**, 1***, ****}. Prefix membership verification is based on the fact that for any number x and prefix P , x ∈ P if and only if P ∈ F (x). To verify whether a number a is in a range [d1 , d2 ], we first convert the range [d1 , d2 ] to a minimum set of prefixes, denoted S([d1 , d2 ]), such that the union of the prefixes is equal to [d1 , d2 ]. For example, S([11, 15]) ={1011,11**}. Given a range [d1 , d2 ], where d1 and d2 are two numbers of w bits, the number of prefixes in S([d1 , d2 ]) is at most 2w − 2 [39]. Second, we compute the prefix family F (a) for number a. Thus, a ∈ [d1 , d2 ] if and only if F (a) ∩ S([d1 , d2 ]) = ∅. To verify whether F (a) ∩ S([d1 , d2 ]) = ∅ using only the operations of verifying whether two numbers are equal, we convert each prefix to a corresponding unique number using a prefix numericalization function. A prefix numericalization function N needs to satisfy the following two properties: (1) for any prefix P , N (P ) is a binary string; (2) for any two prefixes P1 and P2 , P1 = P2 if and only if N (P1 ) = N (P2 ). There are many ways to do prefix numericalization. We use the prefix numericalization scheme defined in [20]. Given a prefix b1 b2 · · · bk ∗ · · · ∗ of w bits, we first insert 1 after bk . The bit 1 represents a separator between b1 b2 · · · bk and ∗ · · · ∗. Second, we replace every * by 0. Note that if there is no * in a prefix, we add 1 at the end of this prefix. For example, 11 ∗ ∗ is converted to 11100. Given a set of prefixes S, we use N (S) to denote the resulting set of numericalized prefixes. Therefore, a ∈ [d1 , d2 ] if and only if N (F (a)) ∩ N (S([d1 , d2 ])) = ∅. Figure 2.3 illustrates the process of verifying 12 ∈ [11, 15]. 15 [11, 15] ⇓ Prefix conversion 1011 11** ⇓ Prefix numericalization 10111 11100 12 (=1100) ⇓ Prefix family construction 1100 11** **** 110* 1*** ⇓ Prefix numericalization 11001 11100 10000 11010 11000 (a) (b) Figure 2.3. Prefix membership verification 2.3.2 The Submission Protocol The submission protocol concerns how a sensor sends its data to a storage node. Let d1 , · · · , dn be data items that sensor si collects at a time slot. Each item dj (1≤j ≤n) is in the range (d0 , dn+1 ), where d0 and dn+1 denote the lower and upper bounds, respectively, for all possible data items that a sensor may collect. The values of d0 and dn+1 are known to both sensors and the sink. After collecting n data items, si performs six steps: 1) Sort the n data items in an ascending order. For simplicity, we assume d0 c + 2k(n + 1)q⌈log2 (n + 1)⌉ (2.2) Based on Formula 2.1 and 2.2, assuming wh = 128 and n ≥ 3, to achieve reduction on the communication cost of sensors and the small false positive rate of ǫ < 1%, we choose 128 1 c = ln 2 k(n + 1)q and k to be 4 ≤ k < 1.44+2⌈log (n+1)⌉ . Note that only when n ≥ 215 , 2 which is unlikely to happen, such k does not exist. 2.7 Queries in Event-driven Networks So far we have assumed that at each time slot, a sensor sends to a storage node the data that it collected at that time slot. However, this assumption does not hold for event-driven networks, where a sensor only reports data to a storage node when certain event happens. If we directly apply our solution here, then the sink cannot verify whether a sensor collected data at a time slot. The case that a sensor did not submit any data at time slot t and the case that the storage node discards all the data that the sensor collected at time slot t are not distinguishable for the sink. 32 We address the above challenge by sensors reporting their idle period to storage node each time when they submit data after an idle period or when the idle period is longer than a threshold. Storage nodes can use such idle period reported by sensors to prove to the sink that a sensor did not submit any data at any time slot in that idle period. Next, we discuss the operations carried on sensors, storage nodes and the sink. Sensors: An idle period for a sensor is a time slot interval [t1 , t2 ], which indicates that the sensor has no data to submit from t1 to t2 , including t1 and t2 . Let γ be the threshold of a sensor being idle without reporting to a storage node. Suppose the last time that sensor si submitted data or reported idle period is time slot t1 − 1. At any time slot t ≥ t1 , si acts based on three cases: 1. t = t1 : In this case, if si has data to submit, then it just submits the data; otherwise it takes no action. 2. t1 < t < γ + t1 − 1: In this case, if si has data to submit, then it submits data along with encrypted idle period [t1 , t − 1]k ; otherwise it takes no action. We call [t1 , t − 1]k i i an idle proof. 3. t = γ + t1 − 1: In this case, if si has data to submit, then it submits data along with the idle proof [t1 , t − 1]k ; otherwise, it submits the idle proof [t1 , t]k . i i Figure 2.10 illustrates some idle periods for sensor si , where each unit in the time axis is a time slot, a grey unit denotes that si has data to submit at that time slot, and a blank unit denotes that si has no data to submit at that time slot. According to the second case, at time slot t2 + 1, si submits data along with the idle proof [t1 , t2 ]k . According to the third i case, at time slot t4 , si submits the idle proof [t3 , t4 ]k . i Storage nodes: When a storage node receives a query {t, G([a, b])} from the sink, it first checks whether si has submitted data at time slot t. If si has, then the storage node sends the query result as discussed in Section 2.3. Otherwise, the storage node checks whether si 33 Ȗ Time axis … … t1 t2 t3 [t1, t2]ki t4 [t3, t4]ki Figure 2.10. Example idle periods and data submissions has submitted an idle proof for an idle period containing time slot t. If true, then it sends the idle proof to the sink as V O. Otherwise, it replies to the sink saying that it does not have the idle proof containing time slot t at this moment, but once the right idle proof is received, it will forward to the sink. The maximum number of time slots that the sink may need to wait for the right idle proof is γ − 1. Here γ is a system parameter trading off efficiency and the amount of time that sink may have to wait for verifying data integrity. Smaller γ favors the sink for integrity verification and larger γ favors sensors for power saving because of less communication cost. The Sink: Changes on the sink side are minimal. In the case that V O lacks the idle proof for verifying the integrity of QR, it will defer the verification for at most γ − 1 time slots, during which benign storage nodes are guaranteed to send the needed idle proof. 2.8 2.8.1 Complexity and Security Analysis Complexity Analysis Assume that a sensor collects n z-dimensional data items in a time slot, each attribute of a data item is a wo -bit number, and the HMAC result of each numericalized prefix is a wh number. The computation cost, communication cost, and storage space of SafeQ are described in the following table. Note that the communication cost denotes the number of 34 bytes sent for each submission or query, and the storage space denotes the number of bytes stored in a storage node for each submission. Sensor Storage node Sink Computation O(wozn) hash O(n) encryption Communication Space O(wowh zn) – O(woz) hash O(zn) O(wo wh zn) O(woz) hash O(woz) – Table 2.2. Complexity analysis of SafeQ 2.8.2 Privacy Analysis In a SafeQ protected two-tiered sensor network, compromising a storage node does not allow the attacker to obtain the actual values of sensor collected data and sink issued queries. The correctness of this claim is based on the fact that the hash functions and encryption algorithms used in SafeQ are secure. In the submission protocol, a storage node only receives encrypted data items and the secure hash values of prefixes converted from the data items. Without knowing the keys used in the encryption and secure hashing, it is computationally infeasible to compute the actual values of sensor collected data and the corresponding prefixes. In the query protocol, a storage node only receives the secure hash values of prefixes converted from a range query. Without knowing the key used in the secure hashing, it is computationally infeasible to compute the actual values of sink issued queries. Next, we analyze information leaking if HMACg () does not satisfy the one-wayness property. More formally, given y, where y = HMACg (x) and x is a numericalized prefix, suppose that a storage node takes O(T ) steps to compute x. Recall that the number of HMAC hashes sent from a sensor is O(wozn). To reveal a data item dj , the storage node needs to reveal all the numericalized prefixes in HMAC g (N (S([dj−1 , dj ]))). Thus, to reveal n data items, the storage node would take O(woznT ) steps. Here T = 2128 for HMAC. 35 2.8.3 Integrity Analysis For our scheme using Merkle hash trees, the correctness of this claim is based on the property that any change of leaf nodes in a Merkle hash tree will change the root value. Recall that the leaf nodes in a Merkle hash tree are sorted according to their values. In a query response, the left bound of the query result (if it exists), the query result, and the right bound of the query result (if it exists) must be consecutive leaf nodes in the Merkle hash tree. If the storage node includes forged data in the query result or excludes a legitimate data item from the query result, the root value computed at the sink will be different from the root value computed at the corresponding sensor. For our scheme using neighborhood chains, the correctness is based on the following three properties that QR and V O should satisfy for a query. First, items in QR ∪ V O form a chain. Excluding any item in the middle or changing any item violates the chaining property. Second, the first item in QR ∪ V O contains the value of its left neighbor, which should be out of the range query on the smaller end. Third, the last item in QR ∪ V O contains the value of its right neighbor, which should be out of the range query on the larger end. 2.9 Experimental Results 2.9.1 Evaluation Methodology To compare SafeQ with the state-of-the-art, which is represented by S&L scheme, we implemented both schemes and performed side-by-side comparison on a large real data set. We measured average power and space consumption for both the submission and query protocols of both schemes. 36 2.9.2 Evaluation Setup We implemented both SafeQ and S&L scheme using TOSSIM [8], a widely used wireless sensor network simulator. We measured the efficiency of SafeQ and S&L scheme on 1, 2, and 3 dimensional data. For better comparison, we conducted our experiments on the same data set that S&L used in their experiment [69]. The data set was chosen from a large real data set from Intel Lab [4] and it consists of the temperature, humidity, and voltage data collected by 44 nodes during 03/01/2004-03/10/2004. Each data attribute follows Gaussian distribution. Note that S&L only conducted experiments on the temperature data, while we experimented with both SafeQ and S&L schemes on 1-dimensional data (of temperature), 2-dimensional data (of temperature and humidity) and 3-dimensional data (of temperature, humidity, and voltage). In implementing SafeQ, we used HMAC-MD5 [46] with 128-bit keys as the hash function for hashing prefix numbers. We used the DES encryption algorithm in implementing both SafeQ and S&L scheme. In implementing our Bloom filter optimization technique, we chose the number of hash functions to be 4 (i.e., k = 4), which guarantees that the false positive rate induced by the Bloom filter is less than 1%. In implementing S&L scheme, we used the parameter values (i.e., VAR p = 0.4 and EN p = 1), which are corresponding to the minimum false positives of query results in their experiments, for computing optimal bucket partitions as in [69], and we used HMAC-MD5 with 128-bit keys as the hash function for computing encoding number. For multi-dimensional data, we used their optimal bucket partition algorithm to partition multi-dimensional data along each dimension. In our experiments, we experimented with different sizes of time slots ranging from 10 minutes to 80 minutes. For each time slot, we generated 1,000 random range queries in the form of ([a1 , b1 ], [a2 , b2 ], [a3 , b3 ]), where a1 , b1 are two random values of temperature, a2 , b2 are two random values of humidity, and a3 , b3 are two random values of voltage. 37 2.9.3 Evaluation Results The experimental results from our side-by-side comparison show that SafeQ significantly outperforms S&L scheme for multi-dimensional data in terms of power and space consumption. For the two integrity preserving schemes, the neighborhood chaining technique is better than Merkle hash tree technique in terms of both power and space consumption. The rationale for us to include the Merkle hash tree based scheme is that Merkle hash trees are the typical approach to achieving integrity. We use SafeQ-MHT+ and SafeQ-MHT to denote our schemes using Merkle hash trees with and without Bloom filters, respectively, and we use SafeQ-NC+ and SafeQ-NC to denote our schemes using neighborhood chains with and without Bloom filters, respectively. Figures 2.11(a), 2.11(b), and 2.12(a), show the average power consumption of sensors for 3-dimensional, 2-dimensional, and 1-dimensional data, respectively, versus different sizes of time slots. Figures 2.13(a), 2.13(b), and 2.14(a), show the average power consumption of storage nodes for 3-dimensional, 2-dimensional, and 1-dimensional data, respectively, versus different sizes of time slots. We observe that the power consumption of both sensors and storage nodes grows linearly with the number of data items, which confirms our complexity analysis in Section 4.5.2. Note that the number of collected data items is in direct proportion to the size of time slots. For power consumption, in comparison with S&L scheme, our experimental results show that for 3-dimensional data, SafeQ-NC+ consumes 184.9 times less power for sensors and 76.8 times less power for storage nodes; SafeQ-MHT+ consumes 171.4 times less power for sensors and 46.9 times less power for storage nodes; SafeQ-NC consumes 59.2 times less power for sensors and 76.8 times less power for storage nodes; SafeQ-MHT consumes 57.9 times less power for sensors and 46.9 times less power for storage nodes. For 2-dimensional data, SafeQ-NC+ consumes 10.3 times less power for sensors and 9.0 times less power for storage nodes; SafeQ-MHT+ consumes 9.5 times less power for 38 sensors and 5.4 times less power for storage nodes; SafeQ-NC consumes 2.7 times less power for sensors and 9.0 times less power for storage nodes; SafeQ-MHT consumes 2.6 times less power for sensors and 5.4 times less power for storage nodes. Our experimental results conform with the theoretical analysis that the power consumption in S&L scheme grows exponentially with the number of dimensions, whereas in SafeQ it grows linearly with the number of dimensions times the number of data items. Figures 2.12(b) and 2.14(b) show the average power consumption for a 10-minute slot for a sensor and a storage node, respectively, versus the number of dimensions of the data. We observe that there are almost linear correlation between the average power consumption for both sensors and storage nodes and the number of dimensions of the data, which also confirms our complexity analysis in Section 4.5.2. Our experimental results also show that SafeQ is comparable to S&L scheme for 1dimensional data in terms of power and space consumption. For power consumption, SafeQNC+ consumes about the same power for sensors and 0.7 times less power for storage nodes; SafeQ-MHT+ consumes about the same power for sensors and 0.3 times less power for storage nodes; SafeQ-NC consumes 1.0 times more power for sensors and 0.7 times less power for storage nodes; SafeQ-MHT consumes 1.0 times more power for sensors and 0.3 times less power for storage nodes. For space consumption on storage nodes, SafeQ-NC+ and SafeQMHT+ consume about the same space, and SafeQ-NC and SafeQ-MHT consume about 1.0 times more space. Figures 2.15(a), 2.15(b) and 2.16(a) show the average space consumption of storage nodes for 3, 2 and 1 dimensional data, respectively. For space consumption on storage nodes, in comparison with S&L scheme, our experimental results show that for 3-dimensional data, SafeQ-NC+ consumes 182.4 times less space; SafeQ-MHT+ consumes 169.1 times less space; SafeQ-NC consumes 58.5 times less space; SafeQ-MHT consumes 57.2 times less space. For 2-dimensional data, SafeQ-NC+ consumes 10.2 times less space; SafeQ-MHT+ consumes 9.4 39 times less space; SafeQ-NC consumes 2.7 times less space; SafeQ-MHT consumes 2.6 times less space. The results conform with the theoretical analysis that the space consumption in S&L scheme grows exponentially with the number of dimensions, whereas in SafeQ it grows linearly with the number of dimensions and the number of data items. Figure 2.16(b) shows the average space consumption of storage nodes for each data item versus the number of dimensions of the data item. For each 3-dimensional data item, S&L consumes about over 104 bytes, while SafeQ-NC+ and SafeQ-MHT+ consume only 40 bytes. 40 Power consumption (mW) SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 1e+6 1e+4 1e+2 1e+0 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 (a) 3-dimensional data Power consumption (mW) 600 SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 400 200 0 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 (b) 2-dimensional data Figure 2.11. Average power consumption per submission for a sensor (A) 41 Power consumption (mW) 100 SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 80 60 40 20 0 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 (a) 1-dimensional data Power consumption (mW) 1e+4 SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 1e+3 1e+2 1e+1 1e+0 1 2 Number of dimensions 3 (b) for 10 minutes Figure 2.12. Average power consumption per submission for a sensor (B) 42 Power consumption (mW) SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 1e+4 1e+3 1e+2 1e+1 1e+0 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 (a) 3-dimensional data Power consumption (mW) 90 SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 (b) 2-dimensional data Figure 2.13. Average power consumption per query response for a storage node (A) 43 Power consumption (mW) 16 SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 14 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 (a) 1-dimensional data Power consumption (mW) 1e+3 SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 1e+2 1e+1 1e+0 1 2 Number of dimensions 3 (b) for 10 minutes Figure 2.14. Average power consumption per query response for a storage node (B) 44 Space Consumption (kB) SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 1e+7 1e+6 1e+5 1e+4 1e+3 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 Space Consumption (kB) (a) 3-dimensional data SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 1e+5 1e+4 1e+3 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 (b) 2-dimensional data Figure 2.15. Average space consumption for a storage node (A) 45 Space Consumption (kB) 10000 SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 8000 6000 4000 2000 0 10 20 30 40 50 60 70 80 Time slot size (minutes) 90 (a) 1-dimensional data Space consumption (Bytes) 1e+3 SafeQ−NC+ SafeQ−MHT+ SafeQ−NC SafeQ−MHT S&L 1e+2 1e+1 1e+0 1e−1 1 2 Number of dimensions 3 (b) Each data item Figure 2.16. Average space consumption for a storage node (B) 46 CHAPTER 3 Privacy and Integrity Preserving Range Queries for Cloud Computing 3.1 3.1.1 Introduction Motivation Cloud computing has become a new computing paradigm of internet services, where cloud providers host numerous hardware, software, and network resources, to store organizations’ data and perform computation over the data on demand of customers’ queries. Cloud computing has three major advantages. First, organizations can instantly open business and provide products or services to their customers without building and maintaining their computer infrastructure, which significantly reduces costs. Second, the data stored in a cloud are more reliable and can be accessed whenever a customer has internet connection. Third, cloud providers have powerful computation capacity to process queries, which provides better experience to customers. Many clouds have been successfully built, e.g., Amazon EC2 and S3 [1], Google App Engine [3], and Microsoft Azure [5]. 47 The database-as-a-service (DAS) model, first introduced by Hacigumus et al. [40], is one of the most important works in cloud computing. In the DAS model, a cloud provider hosts the data of an organization and replies query results to the customers on behalf of the organization. However, the inclusion of the DAS model brings significant security and privacy challenges. As cloud providers cannot be fully trusted and the data of an organization are typically confidential, the organization needs to encrypt the data before storing it in a cloud to prevent the cloud provider from revealing the data. However, it is difficult to process queries over encrypted data. Furthermore, since cloud providers serve as an important role for answering queries from customers, they may return forged data for the query or may not return all the data items that satisfy the query. Therefore, we want to design a protocol for the DAS model that supports multi-dimensional range queries while preserving the privacy of both data and queries and the integrity of query results. Range queries are one of the most important queries for various database systems and have wide applications. For data privacy, cloud providers cannot reveal the organization data and customer queries. Note that the customer queries also need to be kept confidential from cloud providers because such queries may leak critical information about query results. For query result integrity, customers need to detect whether a query result includes forged data or does not include all the data items that satisfy the query. 3.1.2 Technical Challenges There are three challenges in solving secure multi-dimensional range queries problem in the DAS model. First, a cloud provider needs to correctly process range queries over encrypted data without knowing the values of both data and queries. Second, customers need to verify whether a query result contains all the data items that satisfy the query and does not contain any forged data. Third, supporting multi-dimensional range queries is difficult. 48 3.1.3 Limitations of Previous Work Privacy and integrity preserving range queries have received much attention in database and security community (e.g., [40, 41, 10, 31, 72, 17, 27, 62, 22]). Four main techniques have been proposed in the privacy-preserving schemes: bucket partitioning (e.g., [40, 41]), order-preserving hash functions (e.g., [31]), order-preserving encryptions (e.g., [10]), and public-key encryption (e.g., [17, 72]). However, bucket partitioning leads to false positives in query results, i.e., a query result includes data items that do not satisfy the query. Existing order-preserving hash functions and order-preserving encryptions require large amount of shared secret information between an organization and its customers. The public-key cryptography is too expensive to be applied in realistic applications. Three main techniques have been proposed in the integrity-preserving schemes: Merkle hash trees (e.g., [27, 63]), signature aggregation and chaining (e.g., [62, 59]), and spatial data structures (e.g., [22]). However, Merkle hash trees cannot support multi-dimensional range queries. Signature aggregation and chaining requires a cloud provider to reply to the customer the boundary data items of the query that do not satisfy the query. Spatial data structures are computationally expensive because constructing such structures is complicated. Furthermore, it is not clear how to search query results over such structures in a privacy preserving manner. 3.1.4 Our Approach In this work, we propose novel privacy and integrity preserving schemes for the DAS model. To preserve privacy, we propose an order-preserving hash-based function to encode both data and queries such that a cloud provider can correctly process encoded queries over encoded data without knowing their values. To preserve integrity, we present the first probabilistic integrity-preserving scheme for multi-dimensional range queries. In this scheme, we propose 49 a new data structure, called local bit matrices, to encode neighborhood information for each data item from an organization, such that a customer can verify the integrity of a query result with a high probability. Comparing with the state-of-the-art, our schemes achieve both security and efficiency. In terms of security, our schemes not only enable a cloud provider to correctly process queries over encrypted data, but also leak only the minimum privacy information as we will discuss in Section 3.3.4. In terms of efficiency, our schemes are much more efficient due to the use of the hash function and symmetric encryption. 3.1.5 Key Contributions We make three major contributions. First, we propose an efficient privacy-preserving scheme that can process multi-dimensional range queries without false positives. Second, we propose the first probabilistic scheme for verifying the integrity of range query results. This scheme employs a new data structure, local bit matrices, which enables customers to verify query result integrity with high probability. Third, we conduct extensive experiments on real and synthetic datasets to evaluate the effectiveness and efficiency of our scheme. 3.1.6 Summary of Experimental Results We performed extensive experiments on synthetic datasets and the Adult dataset [32]. Our experimental results show that our schemes are efficient for preserving privacy and integrity of multi-dimensional range queries in cloud computing. For a synthetic dataset with one million 1-dimensional data items, the one-time offline data processing time is about 50 minutes, the space cost is 33MB, and the query processing time is 2 milliseconds. For the Adult dataset with 45222 3-dimensional data items, the data processing time is 104 seconds, the space cost is 1.5MB, and the query processing time is 3.5 milliseconds. 50 3.2 3.2.1 Models and Problem Statement System Model We consider the database-as-a-service (DAS) model as illustrated in Figure 3.1. The DAS model consists of three parties: organizations, a cloud provider, and customers. Organizations outsource their private data to a cloud provider. A cloud provider hosts outsourced data from organizations and processes the customers’ queries on behalf of the organizations. Customers are the clients (of organizations) that query a cloud provider and retrieve query results from the outsourced data in the cloud provider. Query Customer Result Query Result Organization Cloud Provider Customer Query Result Customer Figure 3.1. The DAS model 3.2.2 Threat Model In the DAS model, we assume that organizations and their customers are trusted but the cloud provider is not. In a hostile environment, both customers and cloud providers may not be trusted. If a customer is malicious, it may retrieve all the organization’s data and distribute to other unauthorized users. Such attack is very difficult to be prevented and is out of the scope of this work. In this work, we mainly focus on the scenario where a cloud 51 provider is not trusted and it may try to reveal organizations’ data and falsify the query results. In reality, cloud providers and organizations typically belong to different parties, i.e., different companies. The organizations cannot share their private data with untrusted cloud providers. A malicious cloud provider may try to reveal the private data of organizations, and return falsified query results that include forged data or exclude legitimate data. In such case, a cloud provider can disrupt the business and cause great lost of organizations. We also assume that there are secure channels between the organization and the cloud provider, and between the cloud provider and each customer, which could be achieved using protocols such as SSL. 3.2.3 Problem Statement The fundamental problem for the DAS model is: how can we design the storage scheme and the query protocol in a privacy and integrity preserving manner ? A satisfactory solution to this problem should meet the following three requirements. (1) Data and query privacy: Data privacy means that a cloud provider cannot reveal any data item from organizations. Query privacy means that a cloud provider cannot reveal any query from customers. (2) Data integrity: If a cloud provider returns forged data or does not return all the data items that satisfy the query, such misbehavior should be detected by the customer. (3) Range Query Processing: The encoded data from organizations and encoded queries from customers should allow a cloud provider to correctly process range queries. 3.3 Privacy Preserving for 1-dimensional Data In this section, we present our privacy-preserving scheme for 1-dimensional data. To preserve privacy, it is natural to have an organization to encrypt its data items. Let d1 , · · · , dn denote n data items from the organization, the encryption results can be denoted as (d1 )k , · · · , (dn )k , 52 where k is the shared secret key between the organization and its customers. However, the key challenge is how can a cloud provider process queries over encrypted data without knowing the values of data items. The basic idea of our scheme is to design an order-preserving hash-based function to encode the data items from the organization and the queries from its customers such that the cloud provider can use the encoded queries and encoded data items to find out the query results without knowing the actual values. More formally, let fk () denote the orderpreserving hash-based function, where k is the shared secret key between the organization and its customers. To compute fk (), the organization and its customers also need to share the secret information, the domain of the data items [x1 , xN ]. This function fk () satisfies the following property: the condition fk (xi1 ) < fk (xi2 ) holds if x1 ≤ xi1 < xi2 ≤ xN . To submit n data items d1 , · · · , dn to a cloud provider, the organization first encrypts each data item with the secret key k, i.e., (d1 )k , · · · , (dn )k . Second, the organization applies the function fk () to each data item, i.e., fk (d1 ), · · · , fk (dn ). Finally, the organization sends the encrypted data items (d1 )k , · · · , (dn )k as well as the encoded data items fk (d1 ), · · · , fk (dn ) to the cloud provider. To perform a range query [a, b], the customer applies the orderpreserving hash-based function fk () to the lower and upper bounds of the query, i.e., fk (a) and fk (b), and then sends [fk (a), fk (b)] as the query to the cloud provider. Finally, the cloud provider can find out whether the data item dj (1≤j≤n) satisfies the query [a, b] by checking whether the condition fk (a)≤fk (dj )≤fk (b) holds. Figure 3.2 shows the idea of our privacy-preserving scheme. In this section, we first present our order-preserving hash-based function and then discuss its properties. Second, we propose the privacy-preserving scheme by employing this function. Third, we propose an optimization technique to reduce the size of the results after applying the hash-based function. Fourth, we analyze the minimum information leakage for any precise privacy-preserving scheme of range queries and demonstrate that our scheme leaks 53 Organization Cloud Provider Customer [fk(a), fk(b)] (d1)k,…,(dn)k fk(d1),…, fk(dn) dj[a, b] if and only if fk(a) ” fk(dj) ” fk(b) Figure 3.2. Basic idea of privacy-preserving scheme only the minimum information. 3.3.1 The Order-Preserving Hash-based Function Without loss of generalization, we assume that all possible data items are integers within domain [x1 , xN ]. The order-preserving hash-based function fk () is in the form j fk (xi ) = hk (xq ) (3.1) q=1 where xi ∈ [x1 , xN ] and hk () is a keyed hash function, such as keyed HMAC-MD5 and keyed HMAC-SHA1. The intuition of designing such order-preserving hash-based function is twofold. First, we leverage a normal hash function hk () as the basic building block such that the one-way property of hk () prevents the cloud provider from revealing the data items. Second, we consider the result of hk () as a positive integer and then calculate fk (xi ) by summing the hash results of all values that are less than or equal to xi in the domain [x1 , xN ] such that if x1 ≤ xi1 < xi2 ≤ xN , fk (xi1 ) is less than fk (xi2 ). In other words, fk () is an order-preserving function for the values in [x1 , xN ]. More formally, the order-preserving hash-based function fk () satisfies two properties. Order Preserving: Assume that any hk (xq ) (xq ∈ [x1 , xN ]) is a positive integer. The condition fk (xi1 ) < fk (xi2 ) holds if and only if xi1 < xi2 . Proof. We first prove that if the condition fk (xi1 ) < fk (xi2 ) holds, then xi1 < xi2 . We prove it by contradiction. If xi1 ≥ xi2 , we have 54 i1 fk (xi1 ) = fk (xi2 ) + q=i2+1 hk (xq ) ≥ fk (xi2 ) (3.2) Second, we prove that if the condition xi1 < xi2 holds, then fk (xi1 ) < fk (xi2 ). Similar as the proof of the property collision resistance, we have i2 fk (xi2 ) = fk (xi1 ) + q=i1 +1 hk (xq ) > fk (xi1 ) Collision Resistance: Assume that any hk (xq ) (xq ∈ [x1 , xN ]) is a positive integer. It is impossible to find xi1 and xi2 where xi1 =xi2 such that fk (xi1 )=fk (xi2 ). Proof. Without loss of generalization, we assume i1 < i2 . Because any hk (xq ) (xq ∈ [x1 , xN ]) is a positive integer, we have i2 fk (xi2 ) = fk (xi1 ) + q=i1 +1 hk (xq ) > fk (xi1 ) In fact, the hash-based function fk () can preserve any given arbitrary order of values in the domain [x1 , xN ] no matter whether the condition x1 < · · · < xN holds. For example, if the order of 3 data items 3, 5, 7 is defined as 5, 3, 7, then fk (5) < fk (3) < fk (7). This property allows an organization to revise any data item xi (xi ∈ [x1 , xN ]) arbitrarily while still preserving the order. We will discuss how to leverage this property to prevent the statistical analysis of multi-dimensional data in Section 3.6.1. These two properties and the one-way property of hk () allow the cloud provider to process the encoded range queries over the encoded data without revealing the values of the data and queries. 55 3.3.2 The Privacy-Preserving Scheme The privacy-preserving scheme includes three phases, data submission, query submission, query processing. The data submission phase concerns how an organization sends its data to a cloud provider. Let d1 , · · · , dn denote the data items of an attribute in the private data of the organization. Recall that [x1 , xN ] is the domain of the attribute and is the shared secret between the organization and its customers. For simplicity, we assume d1 < d2 < · · · < dn . If some data items have the same value, the organization can simply represent them as one data item annotated with the number of items that share this value. To preserve data privacy, for each dj (1 ≤ j ≤ n), the organization first encrypts it with its secret key k, i.e., (dj )k , and then applies the order-preserving hash-based function, i.e., fk (dj ). Finally, the organization sends the encrypted data (d1 )k , · · · , (dn )k as well as the hash results fk (d1 ), · · · , fk (dn ) to the cloud provider. The query submission phase concerns how a customer sends a range query to the cloud provider. When a customer wants to perform a range query [a, b] on the cloud provider, it first applies the order-preserving hash-based function to the lower and upper bounds of the query, i.e., fk (a) and fk (b). Note that a and b are also two values in [x1 , xN ]. Finally, the customer sends [fk (a), fk (b)] as a query to the cloud provider. Upon receiving the query [fk (a), fk (b)], the cloud provider processes this query on the n data items (d1 )k , · · · , (dn )k by checking which fk (dj ) (1 ≤ j ≤ n) satisfies the condition fk (a) ≤ fk (dj ) ≤ fk (b). Based on the order preserving property of the function fk (), dj ∈ [a, b] if and only if fk (a) ≤ fk (dj ) ≤ fk (b). Thus, the cloud provider only needs to return all encrypted data items whose hash values fk () are in the range [fk (a), fk (b)]. 56 3.3.3 Optimization of the Order-Preserving Hash-based Function We propose an optimization technique to reduce the communication cost for sending the encoded data fk (d1 ), · · · , fk (dn ) to the cloud provider. This cost can be significant for two reasons. First, the result of the hash function hk () is long. For example, the result of the keyed HMAC-MD5 is 128 bits, and the result of the keyed HMAC-SHA1 is 160 bits. Second, the result of our order-preserving hash-based function fk () is even longer due to the sum of multiple hash values of hk (). Let w denote the bit length of hk (). For any xi (xi ∈ [x1 , xN ]), the condition hk (xi ) ≤ 2w − 1 holds. Thus, we have N N fk (xi ) ≤ fk (xN ) = hk (xq ) ≤ q=1 (2w − 1) < 2w N (3.3) q=1 Therefore, the bit length of fk (xi ) (1 ≤ i ≤ N) is less than log2 2w N = w+log2 N. Assume that we use the keyed HMAC-MD5 as the hash function hk () and the number of possible values is one million, i.e., N = 106 . Any fk (xi ) can be expressed by a 128 + log2 106 = 148 bit number. To reduce the bit length of fk (xi ) (xi ∈ [x1 , xN ]), the idea is to divide every fk (xi ) by a ′ ′ value 2w if 2w is less than or equal to any hk (xq ) (xq ∈ [x1 , xN ]). Then, our order-preserving hash-based function becomes ∗ fk (xi ) = i q=1 hk (xq ) ′ 2w (3.4) ′ where 2w ≤ hk (xq ) for any xq ∈ [x1 , xN ]. Such division can be easily done by deleting the right w ′ bits of fk (xi ). ∗ Similar as the analysis of the bit length of fk (xi ), the bit length of any fk (xi ) (xi ∈ [x1 , xN ]) can be reduced to w − w ′ + log2 N by the following calculation. ∗ fk (xi ) ≤ ∗ fk (xN ) ≤ N w q=1 (2 ′ 2w 57 − 1) ′ < (2w−w )N (3.5) ∗ The order-preserving function fk () also satisfies the two properties, order preserving and collision resistance, the proof of which is in Appendix B. 3.3.4 Analysis of Information Leakage Given n data items d1 , · · · , dn and a range query [a, b], any precise privacy-preserving scheme should enable the cloud provider to find out all the data items that satisfy the query [a, b] without revealing the values of the data items from the organization and the query from its customer. According to this requirement, we have the following theorem. Theorem 2. Given any precise privacy-preserving scheme, if all possible results of range queries have been found during the query processing phase, the cloud provider can reveal the order of the encrypted data items. Proof. Without loss of generalization, we assume d1 < d2 < · · · < dn . First, we show that how to reveal the order of three consecutive data items dj−1 , dj , dj+1 . Among all possible results of range queries, there should be two query results, QR1 = {(dj−1 )k , (dj )k } and QR2 = {(dj )k , (dj+1)k }. Assume that [a1 , b1 ] and [a2 , b2 ] are the two range queries whose results are QR1 and QR2 , respectively. Obviously, [a1 , b1 ], [a2 , b2 ], and dj−1 , dj , dj+1 satisfy two conditions. 1. dj−2 < a1 ≤ dj−1 < dj ≤ b1 < dj+1 . 2. dj−1 < a2 ≤ dj < dj+1 ≤ b2 < dj+2 . Based on QR1 , the cloud provider knows that (dj−1 )k and (dj )k are two consecutive data items. Similarly, based on QR2 , the cloud provider knows that (dj )k and (dj+1)k are two consecutive data items. Based on the common encrypted data item (dj )k in QR1 and QR2 , the cloud provider knows that dj is between dj−1 and dj+1 . Thus, the cloud provider knows 58 the order of these three encrypted data items is either dj−1 < dj < dj+1 or dj−1 > dj > dj+1 . Repeat sorting other three consecutive items. Finally, the cloud provider knows the order of all the encrypted data items is either d1 < · · · < dn or d1 > · · · > dn . Theorem 2 describes the minimum information leakage for any precise privacy-preserving scheme of range queries. That is, the cloud provider will reveal the order of the data items received from the organization. We argue that our privacy-preserving scheme achieves the minimum information leakage for two reasons. First, the cloud provider cannot reveal the values of the data and queries from the hash results due to the one-way property of hk (). Second, the cloud provider cannot reveal these values by launching statistical analysis because for any two data items dj1 and dj2 (1 ≤ j1 < j2 ≤ n), (dj1 )k = (dj2 )k and fk (dj1 ) = fk (dj2 ). Recall that if some data items have the same value, the organization represents them as one data item with the number of items that share this value. 3.4 Integrity Preserving for 1-dimensional Data In this section, we present the first probabilistic integrity-preserving scheme for 1-dimensional data. This scheme allows a customer to verify the integrity of a query result with a high probability. The meaning of integrity preserving is two-fold. First, a customer can verify whether the cloud provider forges some data items in the query result. Second, a customer can verify whether the cloud provider deletes data items that satisfies the query. The basic idea of the integrity-preserving scheme is to encrypt neighborhood information for each data item such that the neighborhood information of the data items in a query result can be used to verify the integrity of the query result. More formally, let (M(dj ))k denote the encrypted neighborhood information for each data item dj (1 ≤ j ≤ n). To submit n data items d1 , · · · , dn to a cloud provider, the organization not only sends the 59 encrypted data items (d1 )k , · · · , (dn )k and the encoded data items fk (d1 ), · · · , fk (dn ), but also sends the encrypted neighborhood information (M(d1 ))k , · · · , (M(dn ))k . Upon receiving a query [fk (a), fk (b)] from a customer, the cloud provider first finds out the query result based on the privacy-preserving scheme. Suppose that the data items dj1 , · · · , dj2 (1 ≤ j1 ≤ j2 ≤ n) satisfy the query. The cloud provider not only replies to the customer the query result (dj1 )k , · · · , (dj2 )k , but also replies the encrypted neighborhood information (M(dj1 ))k , · · · , (M(dj2 ))k . For ease of presentation, let QR denote the query result, which includes all the encrypted data items that satisfy the query, i.e., QR = {(dj1 )k , · · · , (dj2 )k }, and V O denote the verification object, which includes the information for the customer to verify the integrity of QR, i.e., V O = {(M(dj1 ))k , · · · , (M(dj2 ))k }. To verify the integrity, the customer first decrypts the query result and verification object, i.e., computes dj1 , · · · , dj2 and M(dj1 ), · · · , M(dj2 ). Second, the organization checks whether dj1 , · · · , dj2 satisfy the query and the overlapping parts of the neighborhood information from every two adjacent data items exactly match. If so, the customer concludes that the query result includes all the data items that satisfy the query. Otherwise, the customer concludes that some data items in the query result are forged or deleted by the cloud provider. Figure 3.3 shows the basic idea of our integrity-preserving scheme. Organization Cloud Provider Customer [fk(a), fk(b)] (d1)k,…,(dn)k (M(d1))k,…,(M(dn))k Assume dj1,…,dj2[a, b] QR = {(dj1)k,…,(dj2)k} VO = {(M(dj1))k,…, (M(dj2))k} Figure 3.3. Basic idea of integrity-preserving scheme Our integrity-preserving scheme can guarantee to detect the misbehavior of forging data items because the cloud provider cannot insert fake data items into a query result without knowing the secret key k. This scheme also allows a customer to detect the misbehavior of 60 deleting data items in a query result with a high probability. Next we present new data structures, called bit matrices and local bit matrices, and then discuss their usage in integrity verification. 3.4.1 Bit Matrices and Local Bit Matrices To define bit matrices and local bit matrices, we need to first partition the data domain into multiple non-overlapping buckets. For example in Figure 3.4, we partition domain [1, 15] to five buckets, B1 = [1, 3], B2 = [4, 6], B3 = [7, 10], B4 = [11, 12], and B5 = [13, 15]. Second, we distribute the data items into the corresponding buckets. Third, we assign a bit value 1 to the buckets which includes data items, and assign a bit value 0 to the buckets which does not include data items. Let B(dj ) denote the bucket that includes dj . A bucket is called the left nonempty bucket of data item dj if the bucket is the left nearest bucket of B(dj ) that includes data items. Similarly, a bucket is called the right nonempty bucket of data item dj if the bucket is the right nearest bucket of B(dj ) that includes data items. For example, for data item 7 in Figure 3.4, B2 and B5 are the left and right nonempty buckets of data item 7, respectively. 5 1 B1 3 B2 7 9 6 B3 13 14 10 1 1 M= 0 M(5)=2|011|1|1 M(7)=3|1101|2|1 M(9)=3|1101|2|2 B4 12 B5 15 0 1 M(13)=5|101|2|1 M(14)=5|101|2|2 Figure 3.4. Example bit matrix and local bit matrices Based on the above concepts, we define bit matrices and local bit matrices as follows. The bit matrix of all data items, M, is formed by the bit values of all buckets. In Figure 3.4, the bit matrix of the five data items is 01101, i.e., M = 01101. The local bit matrix of a data item dj , M(dj ), consists of four parts: (1) the bucket id of B(dj ); (2) a subset of the bit 61 matrix, which is formed by the bit values from its left nonempty bucket to its right nonempty bucket; (3) the number of data items in bucket B(dj ); (4) a distinct integer to distinguish the local bit matrix of dj from other data items in bucket B(dj ). In Figure 3.4, the local bit matrix of data item 7 is 3|1101|2|1, i.e., M(7) = 3|1101|2|1, where 3 is the bucket id, 1101 is the subset of the bit matrix, 2 is the number of data items in bucket B3 , and 1 is the integer to distinguish M(7) from M(9). Intuitively, the bit matrix denotes the abstract information of all the data items, and the local bit matrix of a data item dj denotes the abstract neighborhood information of dj . Note that the usage of bucket partition in this work is different from that in previous work (e.g., [40, 41, 10]). They leverage bucket partition to achieve privacy-preserving query processing. While we use the bit values of buckets for verifying the integrity of query results. 3.4.2 The Integrity-Preserving Scheme Our integrity-preserving scheme includes four phases, data submission, query submission, query processing, and query result verification. Let d1 , · · · , dn denote the data items of an attribute from the organization. The organization first partitions the data domain to m non-overlapping buckets B1 , · · · , Bm , and then distributes d1 , · · · , dn to these buckets. The bucket partition is a shared secret between the organization and its customers. Second, the organization computes the local bit matrix for each data item and then encrypts them with its secret key k, i.e., computes (M(d1 ))k , · · · , (M(dn ))k . Third, the organization sends to the cloud provider the encrypted local bit matrices (M(d1 ))k , · · · , (M(dn ))k as well as encrypted data items (d1 )k , · · · , (dn )k and the encoded data items fk (d1 ), · · · , fk (dn ). To perform a range query [a, b], a customer sends [fk (a), fk (b)] to the cloud provider. Upon receiving [fk (a), fk (b)], the cloud provider computes QR as in Section 3.3.2. Here 62 we consider how to compute V O. If QR = {(dj1 )k , · · · , (dj2 )k } (1 ≤ j1 ≤ j2 ≤ n), V O = {(M(dj1 ))k , · · · , (M(dj2 ))k }; if QR = ∅, which means that there is a data item dj1 (1 ≤ j1 ≤ n) such that dj1 < a ≤ b < dj1 +1 , then V O = {(M(dj1 ))k , (M(dj1 +1 ))k }. Finally, the cloud provider replies QR and V O. Upon receiving the query result QR and the verification object V O, the customer decrypts them, and then verifies the integrity of QR as follows. First, the customer verifies whether each item in QR satisfies the query [a, b]. Second, the customer verifies whether the cloud provider deletes any data item that satisfies the query. Let {(dj1 )k , · · · , (dj2 )k } be the correct query result and Bg1 , · · · , Bgt be the buckets which include at least one data item in the query result. Let QR be the query result from the cloud provider. Suppose the cloud provider deletes a data item (dj )k that satisfies the query, i.e., (dj )k ∈ {(dj1 )k , · · · , (dj2 )k }, and dj ∈ Bgs (1 ≤ s ≤ t). We consider the following four cases. Case 1: When QR = ∅, if Bgs ⊆ [a, b], the deletion can be detected for two reasons. First, if Bgs only includes one data item dj , deleting (dj )k can be detected because the local bit matrices of data items in Bgs−1 or Bgs+1 show that Bgs should include at least one data item. Second, if Bgs includes multiple data items, deleting (dj )k can be detected because the local bit matrices of other data items in Bgs have the number of data items in Bgs . In Figure 3.4, given a range query [4,11], the correct query result is {(5)k , (7)k , (9)k }, and the verification object is {(M(5))k , (M(7))k , (M(9))k }. Deleting (7)k in B3 can be detected because based on M(9), the customer knows that B3 includes two data items. Case 2: When QR = ∅, if Bgs ⊆ [a, b], the deletion cannot be detected because the customer does not know whether Bgs ∩ [a, b] includes data items. Considering the same example in Case 1, deleting (5)k cannot be detected because the customer does not know whether B2 ∩ [4, 11] includes data items. Case 3: When QR = ∅, if B(dj1 ) ∩ [a, b] = ∅ and B(dj1 +1 ) ∩ [a, b] = ∅, the deletion can be detected because M(dj1 ) or M(dj1 +1 ) shows that bucket Bgs between B(dj1 ) and B(dj1 +1 ) 63 includes data items, and hence, condition dj1 < a ≤ b < dj1 +1 does not hold. In Figure 3.4, given a range query [3,5], the correct query result is {(5)k }. If the cloud provider replies QR = ∅ and V O = {(M(7))k }, deleting (5)k can be detected because the customer knows that B2 includes data items based on M(7), and these data items are closer to the query [3,5] than 7. Case 4: When QR = ∅, if B(dj1 ) ∩ [a, b] = ∅ or B(dj1 +1 ) ∩ [a, b] = ∅, the deletion cannot be detected because the customer does not know whether B(dj1 ) ∩ [a, b] or B(dj1 +1 ) ∩ [a, b] includes data items. In Figure 3.4, given a range query [9,12], the correct query result is {(9)k }. If the cloud provider replies QR = ∅ and V O = {(M(7))k , (M(13))k }, deleting (9)k cannot be detected because the customer does not know whether B(3) ∩[9, 12] includes data. The cloud provider can break integrity verification if and only if it can distinguish Cases 2 and 4 from other two cases. Distinguishing these two cases is equivalent to knowing which data items belong to the same bucket. However, such information cannot be revealed by analyzing the encrypted data items (d1 )k , · · · , (dn )k , the encoded data items fk (d1 ), · · · , fk (dn ), and the encrypted local bit matrices (M(d1 ))k , · · · , (M(dn ))k . Because for any two data items dj1 and dj2 (1 ≤ j1 < j2 ≤ n), (dj1 )k = (dj2 )k , fk (dj1 ) = fk (dj2 ), and (M(dj1 ))k = (M(dj2 ))k . Thus, the cloud provider can only randomly delete the data items in the query result and hope that this deletion operation will not be detected. 3.5 Finding Optimal Parameters Our integrity-preserving scheme needs to partition the data domain into multiple nonoverlapping buckets. However, how to partition the domain is still a problem. In this section, we formalize the problem as an optimization problem and present an algorithm to solve it. To simplify the problem, we assume that queries from customers follow uniform distribution, i.e., all queries are equi-probable. Considering the N possible values in [x1 , xN ], 64 there are N (N +1) 2 possible range queries. Thus, the possibility of any query from customers 2 is equal to N (N +1) . 3.5.1 Detection Probability We first consider the detection probability of randomly deleting data items in a query result by a cloud provider. This metric is very important to evaluate the effectiveness of our integrity-preserving scheme. Let B1 , · · · , Bm denotes the multiple non-overlapping buckets, and [li , hi ] denotes a bucket Bi (1 ≤ i ≤ m). A bucket Bi is called a single-value bucket if li = hi . Let e(x) denote the frequency of the data item with value x. The probability that a deletion operation of the cloud provider can be detected is Pr = m i=1 hi x=li (li − x1 + 1)(xN − hi + 1)e(x) n j=1 (dj − x1 + 1)(xN − dj + 1) (3.6) The calculation of this probability is in Appendix C. Next, we discuss the two theorems regarding to P r, the proofs of these two theorems are in Appendix D. Theorem 3. Given any n data items d1 , · · · , dn , the maximum detection probability of a deletion operation is P rmax = 100% (3.7) if and only if each data item dj (1 ≤ j ≤ n) forms a single-value bucket, i.e., [dj , dj ]. Theorem 4. Given n data items d1 , · · · , dn , the minimum detection probability of a deletion operation is P rmin = n j=1 (dj n − x1 + 1)(xN − dj + 1) (3.8) if and only if there is only one bucket [x1 , xN ]. The intuition behind Theorems 3 and 4 is that the more secret information of data items the organization shares with its customers, the higher detection probability customers 65 achieve. For Theorem 3, if each data item forms a single-value bucket, the customers know all the data items before querying the cloud provider. Of course they can detect any deletion operation and it is meaningless for organizations to outsource their data. For Theorem 4, recall our privacy preserving scheme in Section 3.3. Customers need to know [x1 , xN ] for converting a query [a, b] to [fk (a), fk (b)]. Thus, in our context, x1 and xN are the minimum secret information needed to be shared. Knowing only x1 and xN allows customers to detect deletion operations with the minimum probability. If a cloud provider conducts t deletion operations, the probability that at least one of the t deletion operations can be detected is P rt = 1 − (1 − P r)t (3.9) In Figure 3.4, the probability that a deletion operation can be detected is 60.48%. If the cloud provider conducts 5 deletion operations over the query results, customers can detect at least one deletion with 99% probability. 3.5.2 Optimal Bucket Partition We define the optimization problem as follows. Given n data items from the organization and the domain of these items [x1 , xN ], we want to find out the optimal partition with at most m buckets B1 , · · · , Bm such that the detection probability P r is maximized. More formally, this problem can be defined as follows. Input: (1) d1 , · · · , dn (2) [x1 , xN ] (3) m Output: B1 , · · · , Bm Objective: max P r This problem has the optimal substructure property [25]. Therefore, we can express the optimal solution of the original problem as the combination of the optimal solutions of 66 two sub-problems. Let H(N, m) denote the problem of optimally partitioning the domain [x1 , xN ] using at most m buckets. Let δ(i, j) denote the probability contributed by a bucket [xi , xj ]. We can compute δ(i, j) as follows. xj (xi − x1 + 1)(xN − xj + 1) x=xi e(x) δ(i, j) = n j=1 (dj − x1 + 1)(xN − dj + 1) (3.10) The optimal problem can be expressed as follows. H(N, m) = max [H(N − i, m − 1) + δ(N − i + 1, N)] (3.11) Algorithm 1: Optimal Bucket Partition Input: (1) n data items d1 , · · · , dn ; (2) The domain [x1 , xN ]; (2) m. Output: B1 , · · · , Bm . 1 Initialize each element in matrices H and P to 0; 2 for i := 2 to N do 3 H[i][2] = max1≤k≤i−1 [δ(1, k) + δ(k + 1, i)]; 4 Store the left boundary value of the second bucket in P [i][2]; 5 6 for j := 3 to m do for i := j to N do 7 H[i][j] = maxj−1≤k≤i−1 [H[k][j − 1] + δ(k + 1, i)]; 8 Store the left boundary value of the last bucket in P [i][j]; 9 Find the maximum value in H and output the corresponding partition in P ; We use dynamic programming to solve the problem. We first solve and store solutions of the smaller sub-problems. Then, we employ their optimal solutions to solve the larger 67 problems. Finally, we solve the optimal problem of maximizing H(N, m). All intermediate solutions are stored in an N × m matrix H. The row indices of H are from 1, · · · , N and the column indices are from 1, · · · , m. Note that H(i, j) = H[i][j]. Both the time and space complexities of the computation of the matrix H are O(Nm). Along with the optimal value H(i, j) (1 ≤ i ≤ N, 1 ≤ j ≤ m), we also store the lower bounds of its last bucket for each sub-problem in another N × m matrix P . Finally, we use the matrix P to reconstruct the optimal bucket partition in O(m) time. This algorithm is shown in Algorithm 1. 3.6 3.6.1 Query Over Multi-dimensional Data Privacy for Multi-dimensional Data Organizations’ data and customers’ queries are typically multi-dimensional. For example, a medical record typically includes patient’s name, birthday, age, etc. A z-dimensional data item D is a z-tuple (d1 , · · · , dz ) where each dr (1 ≤ r ≤ z) is the value for the r-th dimension (i.e., attribute). A z-dimensional range query consists of z sub-queries [a1 , b1 ], · · · , [az , bz ] where each sub-query [ar , br ] (1 ≤ r ≤ z) is a range over the r-th dimension. We extend our privacy-preserving scheme for one-dimensional data to multi-dimensional data as follows. Let D1 , · · · , Dn denote n z-dimensional data items, where Dj = (d1 , · · · , dz ) j j (1 ≤ j ≤ n). First, the organization encrypts these data with its secret key k, i.e., computes (D1 )k , · · · , (Dn )k . Second, for each dimension r, it applies our order-preserving hash-based function fkr (), i.e., computes fkr (dr ), · · · , fkr (dr ), where kr is the secret key of the ordern 1 preserving hash-based function for the r-th dimension. Last, it sends the encrypted data items (D1 )k , · · · , (Dn )k , and fk1 (d1 ), · · · , fk1 (d1 ), · · · , fkz (dz ), · · · , fkz (dz ) to the cloud n n 1 1 provider. When a customer wants to perform a query ([a1 , b1 ], · · · , [az , bz ]), it applies the order-preserving hash-based function fkr () on the lower and upper bounds of each sub-query 68 [ar , br ] and sends [fk1 (a1 ), fk1 (b1 )], · · · , [fkz (az ), fkz (bz )] to the cloud provider. The cloud provider then compares fkr (dr ), · · · , fkr (dr ) with [fkr (ar ), fkr (br )] for each dimension r, to n 1 find out the query result QR. Considering 5 two-dimensional data items (1,11), (3,5), (6,8), (7,1) and (9,4), given a range query ([2,7],[3,8]), the query result QR is {(3, 5)k , (6, 8)k }. To prevent the attack of statistical analysis, the data sent from the organization to the cloud provider should satisfy the following two conditions. First, for any 1 ≤ j1 = j2 ≤ n, (Dj1 )k = (Dj2 )k . To satisfy this condition, if multiple data items have the same value for each dimension, the organization can simply represent them as one data item annotated with the number of these items. Second, along each dimension r, for any 1 ≤ j1 = j2 ≤ n, fkr (dr ) = fkr (dr ). To satisfy this condition, the organization needs to revise the data items j j 1 2 with the same value for the dimension r. Recall the arbitrary order-preserving property of fkr (). It allows the organization to arbitrarily revise data items while still preserving the order of these items in the hash results. In our context, if dr = dr , the organization can j1 j2 concatenate a distinct number for each of them, i.e., dr |0 and dr |1, and then apply the j1 j2 hash-based function fkr (). 3.6.2 Integrity for Multi-dimensional Data To preserve the integrity of multi-dimensional data, the organization builds multidimensional local bit matrices. We first present the data structures, multi-dimensional bit matrices and local bit matrices, and then discuss the usage in integrity verification for multidimensional data. Considering the example in Figure 3.5(a), we partition the data domain into 4 × 6 = 24 buckets. Then, we distribute the data items, D1 , · · · , D5 , into the corresponding buckets. We assign a bit value 1 or 0 to each bucket to indicate whether the bucket includes data items or not. Let B(dj ) denote the bucket that includes dj . A bucket is called the r-th left nonempty bucket of data item dj (1 ≤ r ≤ z) if the bucket is the left nearest 69 0 0 1D1 0 0 0 0 0 0 0 1D1 0 0 0 0 0 0 0 1D1 0 0 0 0 0 0 0 0 D41 0 0 0 D41 0 0 0 D41 0 0 1 D3 0 0 0 1 D3 0 0 0 1 D3 0 0 0 1D2 0 0 0 (a) D 1 5 0 0 0 1D2 0 0 0 (b) D 1 5 0 0 0 1D2 0 0 0 (c) D 1 5 0 Figure 3.5. The example 2-dimensional bit matrix and local bit matrices bucket of B(dj ) that includes data items for the r-th dimension. Similarly, a bucket is called the r-th right nonempty bucket of data item dj if the bucket is the right nearest bucket of B(dj ) that includes data items for the r-th dimension. In Figure 3.5(a), B(D2 ) is the 1-th left nonempty bucket of data item D3 . Based on the above concepts, we define bit matrices and local bit matrices as follows. The bit matrix of all data items, M, is formed by the bit values of all buckets. In Figure 3.5(a), the bit matrix of the five data items      M =    0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0          The local bit matrix of a data item Dj , M(Dj ), consists of four parts: (1) the bucket id of B(Dj ); (2) a subset of the bit matrix, which is formed by the bit values of the buckets within a rectangle, which includes its left and right nonempty buckets for each dimension; (3) the number of data items in bucket B(Dj ); (4) a distinct integer to distinguish the local bit matrix of Dj from other data items in bucket B(Dj ). In Figure 3.5(b), the local bit 70 matrix of D3 is   0 0 1   M(D3 ) = ID|  0 1 0  |1|1 1 0 1 where ID is the bucket id of B(D3 ). The integrity-preserving scheme for z-dimensional data (z that for 1-dimensional data. > Here we only show an example. data items D1 :(d1 , d2 ), · · · , D5 :(d1 , d2 ) in Figure 3.5. 1 1 5 5 1) is similar as Consider the five The organization sends to the cloud provider the encrypted data items (D1 )k , · · · , (D5 )k , encrypted local bit matrices (M(D1 ))k , · · · , (M(D5 ))k , and the encoded data items fk1 (d1 ), · · · , fk1 (d1 ), 1 5 fk2 (d2 ), · · · , fk2 (d2 ). Given a range query that includes two data items D2 and D3 in Figure 1 5 3.5(c), the cloud provider replies to the customer the query result QR = {(D2 )k , (D3 )k } and the verification object V O = {(M(D2 ))k , (M(D3 ))k }. Next, we analyze the detection probability for multi-dimensional data. Let B1 , · · · , Bm de1 z note the multiple non-overlapping buckets, and ([li , h1 ], · · · , [li , hz ]) denote a z-dimensional i i bucket Bi (1 ≤ i ≤ m). A bucket Bi is called a single-value bucket if for each dimension r r (1 ≤ r ≤ z), li = hr . Let [xr , xr r ] denote the domain for each dimension r. Let e(X) de1 N i note the frequency of the data item X : (x1 , · · · , xz ). The detection probability of a deletion operation by cloud providers can be computed as Pr = m i=1 z r r r r r=1 (li − x1 + 1)(xNr − hi + 1) X∈Bi e(X) n z r r r r j=1 r=1 (dj − x1 + 1)(xNr − dj + 1) (3.12) Theorems 3 and 4 can also be extended for multi-dimensional data. We have the following two theorems. Theorem 5. Given any n z-dimensional data items D1 , · · · , Dn , the maximum detection probability of a deletion operation is P rmax = 100% 71 (3.13) if and only if each data item Dj (1 ≤ j ≤ n) forms a single-value bucket, i.e., ([d1 , d1 ], · · · , [dz , dz ]). j j j j Theorem 6. Given n z-dimensional data items D1 , · · · , Dn , the minimum detection probability of a deletion operation is P rmin = n j=1 z r r=1 (dj n − xr + 1)(xr − dr + 1) 1 j Nr (3.14) if and only if there is only one bucket ([x1 , x1 ], · · · , [xz , xz z ]). 1 N1 1 N The calculation of the detection probability in Equation 3.12 and the proofs of Theorems 5 and 6 are similar to the 1-dimensional case. Finding optimal bucket partition for multidimensional data is an interesting yet difficult problem and will be discussed in further work. 3.7 Evaluation We evaluated the efficiency and effectiveness of our privacy and integrity preserving schemes for both 1-dimensional and multi-dimensional data. In terms of efficiency, we measured the data processing time for organizations, and the space cost and query processing time for cloud providers. In terms of effectiveness, we measured whether the experimental detection probability of deletion operations by cloud providers is consistent with the theoretically analysis discussed in Sections 3.5.1 and 3.6.2. Our experiments were implemented in Java 1.6.0 and carried out on a PC running Linux with 2 Intel Xeon cores and 16GB of memory. 3.7.1 Evaluation Setup We conducted our experiments on a real data set, Adult, and five synthetic datasets. The Adult dataset is from the UCI Machine Learning Repository [32] and has been widely used in previous studies. It contains 45222 records. We chose three attributes in this dataset, 72 Age, Education, and Hours-per-week. Note that Education is a categorical attribute and we mapped each Education value to a distinct integer. The domains of these three attributes are [17, 90], [1, 16], and [1, 99], respectively. The five synthetic datasets are generated by randomly choosing 102 , 103, · · · , 106 data items from five domains [0, 103], [0, 104 ], · · · , [0, 107 ], respectively. For our order-preserving hash-based function fk (), we used HMAC-MD5 with 128-bit keys as the basic hash function hk (). We used the DES encryption algorithm to encrypt both data items and local bit matrices. 3.7.2 Results for 1-dimensional Data We employed the synthetic datasets to evaluate the efficiency and effectiveness of our schemes for 1-dimensional data. For each synthetic dataset, given different number of buckets, we first computed the optimal partition and the maximum detection probability, and then we implemented our schemes using the optimal partition. We also generated 1,000 random range queries to measure the total processing time for cloud providers and verify query result integrity for customers. To process the query, we used the binary search algorithm to find out the query result. Let n denote the number of data items in a dataset and m denote the given number of buckets. According to Theorem 3, if all data items form single-value buckets, the detection probability is 100%. The rest buckets are empty buckets and the number of these buckets is n + 1. Thus, the total number of buckets can be computed as 2n + 1. In other words, given m = 2n + 1, the output of our optimal algorithm should be these 2n + 1 buckets. Based on this observation, we define partition ratio as m/(2n + 1). The partition ratio helped us to normalize the results of our optimal partition algorithm for different datasets. Figure 3.6 shows the normalized results for the five synthetic datasets. We observed that the detection probability increases with the partition ratio and if partition ratio is equal to 1, i.e., m = 2n+1, the probability becomes 1, which confirms our discussion. 73 Detection probability (%) 100 80 60 n=1e+2 n=1e+3 n=1e+4 n=1e+5 n=1e+6 40 20 0.2 0.4 0.6 0.8 Partition ratio 1 Experiment detection probability Figure 3.6. Effectiveness of optimal partition algorithm 100 Experimental results Theoretical line 80 60 40 20 20 40 60 80 100 Theoretic detection probability Figure 3.7. Correctness of integrity-preserving scheme 74 To check whether the experimental detection probability is consistent with the theoretical analysis, for each dataset, we randomly deleted a data item in each query result and then computed the percentage of query results that were detected by our integrity-preserving scheme. Note that this percentage is the experimental detection probability. Figure 3.7 shows that the experimental detection probability is close to the theoretical line, which demonstrates the correctness of our analysis. Figures 3.8 and 3.9 show the data processing time and space cost for the five synthetic datasets, respectively. Note that the horizonal and vertical axes in these figures are in logarithmic scales. In Figure 3.8, we observed that the data processing time is less than 300 seconds for 105 data items. Although for one million data items, the data processing time is about 50 minutes, which is reasonable for real applications because the data processing is a one-time offline procedure. In Figure 3.9, we observed that the space cost grows linearly with the number of data items in a dataset. A cloud provider needs 33MB to store one million data items from an organization. Figure 3.10 shows the total processing time of 1,000 queries for the five synthetic datasets. Processing 1,000 queries over one million data items only takes 2 seconds. 3.7.3 Results for Multi-dimensional Data We employed the Adult dataset to evaluate the efficiency and effectiveness of our schemes for multi-dimensional data. The experimental results show that the data processing time for this dataset is 104 seconds, the space cost is 1.5MB, and the total processing time of 1,000 random queries is 3.5 seconds. Due to the absence of the optimal partition algorithm for multi-dimensional data, we arbitrarily partitioned the Adult dataset to different sets of buckets. The results show that the experimental detection probability is consistent with the theoretical analysis for multi-dimensional range queries. 75 Data processing time (s) 1e+4 1e+3 1e+2 1e+1 1e+0 1e−1 1e+2 1e+3 1e+4 1e+5 Number of data items 1e+6 Figure 3.8. Data processing time Space cost (KB) 1e+5 1e+4 1e+3 1e+2 1e+1 1e+0 1e+2 1e+3 1e+4 1e+5 Number of data items Figure 3.9. Space cost 76 1e+6 Query processing time (s) 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 1e+2 1e+3 1e+4 1e+5 Number of data items Figure 3.10. Query processing time 77 1e+6 CHAPTER 4 Privacy Preserving Cross-Domain Cooperative Firewall Optimization 4.1 4.1.1 Introduction Background and Motivation Firewalls are critical in securing private networks of businesses, institutions, and home networks. A firewall is often placed at the entrance between a private network and the external network so that it can check each incoming or outgoing packet and decide whether to accept or discard the packet based on its policy. A firewall policy is usually specified as a sequence of rules, called Access Control List (ACL), and each rule has a predicate over multiple packet header fields (i.e., source IP, destination IP, source port, destination port, and protocol type) and a decision (i.e., accept and discard) for the packets that match the predicate. The rules in a firewall policy typically follow the first-match semantics where the decision for a packet is the decision of the first rule that the packet matches in the policy. Each physical interface of a router/firewall is configured with two ACLs: one for filtering outgoing packets and the 78 other one for filtering incoming packets. In this work, we use firewalls, firewall policies, and ACLs, interchangeably. The number of rules in a firewall significantly affects its throughput. Figure 4.1 shows the result of the performance test of iptables conducted by HiPAC [2]. It shows that increasing the number of rules dramatically reduces the firewall throughput. Unfortunately, with explosive growth of services deployed on the Internet, firewall policies are growing rapidly in size. Thus, optimizing firewall policies is crucial for improving network performance. Throughput (Percentage) 100 90 80 70 60 50 40 30 20 10 0 25 50 100 200 400 800 1600 3200 6400 Number of rules Figure 4.1. Effect of the number of rules on the throughput with frame size 128 bytes [2] 4.1.2 Limitation of Prior Work Prior work on firewall optimization focuses on either intra-firewall optimization [28, 50, 51, 52, 53, 55, 56, 57] or inter-firewall optimization [11, 81] within one administrative domain where the privacy of firewall policies is not a concern. Intra-firewall optimization means optimizing a single firewall. It is achieved by either removing redundant rules [50, 52] or rewriting rules [28, 51, 53, 55, 56, 57]. Prior work on inter-firewall optimization requires 79 two firewall policies without any privacy protection, and thus can only be used within one administrative domain. However, in reality, it is common that two firewalls belong to different administrative domains where firewall policies cannot be shared with each other. Keeping firewall policies confidential is important for two reasons. First, a firewall policy may have security holes that can be exploited by attackers. Quantitative studies have shown that most firewalls are misconfigured and have security holes [76]. Second, a firewall policy often contains private information, e.g., the IP addresses of servers, which can be used by attackers to launch more precise and targeted attacks. 4.1.3 Cross-domain Inter-firewall Optimization To our best knowledge, no prior work focuses on cross-domain privacy-preserving interfirewall optimization. This work represents the first step in exploring this unknown space. Specifically, we focus on removing inter-firewall policy redundancies in a privacy-preserving manner. Consider two adjacent firewalls 1 and 2 belonging to different administrative domains Net1 and Net2 . Let F W1 denote the policy on firewall 1’s outgoing interface to firewall 2 and F W2 denote the policy on firewall 2’s incoming interface from firewall 1. For a rule r in F W2 , if all the packets that match r but do not match any rule above r in F W2 are discarded by F W1 , rule r can be removed because such packets never come to F W2 . We call rule r an inter-firewall redundant rule with respect to F W1 . Note that F W1 and F W2 only filter the traffic from F W1 to F W2 ; the traffic from firewall 2’s outgoing interface to firewall 1’s incoming interface is guarded by other two separate policies. For simplicity, we assume that F W1 and F W2 have no intra-firewall redundancy as such redundancy can be removed using the proposed solutions [50, 52]. Figure 4.2 illustrates inter-firewall redundancy, where two adjacent routers belong to different administrative domains CSE and EE. The physical interfaces connecting two routers 80 I1 CSE subnet SIP DIP I2 SP DP PR Dec SIP EE subnet DIP SP DP PR Dec r1' 1.2.*.* 192.168.*.* * * TCP d r1 1.2.1.* 192.168.1.* * 25 TCP a r2' 2.3.*.* 192.168.*.* * * TCP a r2 1.2.1.* 192.168.*.* 80 * TCP d r3 ' * d r3 * a * * * * * * * * FW2: filtering I2’s incoming packets FW1: filtering I1’s outgoing packets Figure 4.2. Example inter-firewall redundant rules are denoted as I1 and I2 , respectively. The rules of the two firewall policies F W1 and F W2 , that are used to filter the traffic flowing from CSE to EE, are listed in two tables following the format used in Cisco Access Control Lists. Note that SIP, DIP, SP, DP, PR, and Dec denote source IP, destination IP, source port, destination port, protocol type, and decision, ′ respectively. Clearly, all the packets that match r1 and r2 in F W2 are discarded by r1 in ′ F W1 . Thus, r1 and r2 of F W2 are inter-firewall redundant with respect to r1 in F W1 . 4.1.4 Technical Challenges and Our Approach The key challenge is to design a protocol that allows two adjacent firewalls to identify the inter-firewall redundancy with respect to each other without knowing the policy of the other firewall. While intra-firewall redundancy removal is complex [50, 52], inter-firewall redundancy removal with the privacy-preserving requirement is even harder. To determine whether a rule in F W2 is inter-firewall redundant with respect to F W1 , Net2 certainly needs some information about F W1 ; yet, Net2 cannot reveal F W1 from such information. A straightforward solution is to perform a privacy preserving comparison between two rules from two adjacent firewalls. Particularly, for each rule r in F W2 , this solution checks whether 81 all possible packets that match rule r in F W2 match a rule r ′ with the discard decision in F W1 . If rule r ′ exists, r is inter-firewall redundant with respect to r ′ in F W1 . However, because firewalls follow the first-match semantics and the rules in a firewall typically overlap, this solution is not only incorrect but also incomplete. Incorrect means that wrong redundant rules could be identified in F W2 . Suppose this solution identifies r as a redundant rule in ′ F W2 with respect to r2 in F W1 . However, if some packets that match rule r also match rule ′ ′ ′ r1 (r1 is above r2 ) with the accept decision in F W1 , these packets will pass through F W1 and then F W2 needs to filter them with r. In this case, r is actually not redundant. Incomplete means that a portion of redundant rules could be identified in F W2 . If all possible packets that match rule r in F W2 are discarded by not only one rule but multiple rules in F W1 , r is also redundant. However, the direct comparison solution cannot identify such redundancies. Our basic idea is as follows. For each rule r in F W2 , we first compute a set of compact predicates representing the set of packets that match r but do not match the rules above r in F W2 . Then, for each predicate, we check whether all the packets that match the predicate are discarded by F W1 . If this condition holds for all the predicates computed from rule r, then rule r is redundant. To efficiently compute these predicates, we convert firewalls to firewall decision diagrams [52]. To allow the two firewalls to detect the redundant rules in F W2 in a privacy-preserving manner, we develop a protocol so that two firewalls can detect such redundant rules without disclosing their policies to each other. 4.1.5 Key Contributions We make two key contributions. First, we propose a novel privacy-preserving protocol for detecting inter-firewall redundant rules in one firewall with respect to another firewall. This work represents the first effort along this unexplored direction. Second, we implemented our protocol and conducted extensive experiments on both real and synthetic firewall policies. 82 The results on real firewall policies show that our protocol can remove as many as 49% of the rules in a firewall whereas the average is 19.4%. The communication cost is less than a few hundred KBs. Our protocol incurs no extra online packet processing overhead and the offline processing time is less than a few hundred seconds. 4.2 4.2.1 System and Threat Models System Model A firewall F W is an ordered list of rules. Each rule has a predicate over d f ields F1 , · · · , Fd and a decision for the packets that match the predicate. Firewalls usually check five fields: source IP, destination IP, source port, destination port, and protocol type. The length of these fields are 32, 32, 16, 16, and 8 bits, respectively. A predicate defines a set of packets over the d fields, and is specified as F1 ∈T1 ∧· · · ∧Fd ∈Td where each Ti is a subset of Fi ’s domain D(Fi ). A packet over the d fields F1 ,· · · ,Fd is a d-tuple (p1 , · · · , pd ) where each pi (1≤i≤d) is an element of D(Fi ). A packet (p1 , · · · , pd ) matches a rule F1 ∈T1 ∧· · · ∧Fd ∈Td → decision if and only if the condition p1 ∈T1 ∧· · · ∧pd ∈Td holds. Typical firewall decisions include accept, discard, accept with logging, and discard with logging. Without loss of generality, we only consider accept and discard in this work. We call a rule with the accept decision an accepting rule and a rule with the discard decision a discarding rule. In a firewall policy, a packet may match multiple rules whose decisions are different. To resolve these conflicts, firewalls typically employ a first-match semantics where the decision for a packet p is the decision of the first rule that p matches. A matching set of ri , M(ri ), is the set of all possible packets that match the rule ri [50]. A resolving set of ri , R(ri ), is the set of packets that match ri but do not match any rule rj above ri (j j2 − 1)) = P r(X (v1 ) = j2 − 1) = α[1 − (n − j2 i=2 + 1)α]wo −1 Given a query {HMAC g (N (F (a))), HMAC g (N (F (b)))}, if the storage node can find Y = j1 − 1 or Z = j2 − 1 where 0 ≤ j1 < n1 ≤ n2 < j2 ≤ n, the query result has false positives. Therefore, the average false positive rate can be computed as follows: n+1 n1 −1 n+1 n+1 { ǫ = (j − j1 ) − (n2 − n1 ) [ 2 × n − (n2 − n1 ) n1 =1 n2 =n1 j1 =1 j2 =n2 +1 P r(Y = j1 − 1)P r(Z = j2 − 1)] n1 −1 + [ j1 =1 n1 − j1 × n − (n2 − n1 ) P r(Y = j1 − 1)P r(Z < n2 or Z = null )] n+1 + [ j2 =n2 +1 j2 − n2 × n − (n2 − n1 ) P r(Y > n1 − 2 or Y = null )P r(Z = j2 − 1)]} 157 (7.2) n −j j −n 1 1 2 2 Because [1−(j1 −1)α]wo −1 ≤ 1, [1−(n−j2 +1)α]wo −1 ≤ 1, n−(n −n ) ≤ 1, n−(n −n ) ≤ 1, 2 1 2 1 (j2 −j1 )−(n2 −n1 ) and ≤ 1, we derive Formula 2.1 from the following calculation. n−(n −n ) 2 1 n+1 n+1 [n − (n2 − n1 )]α ǫ < n1 =1 n2 =n1 = 1 (n + 2)(n + 3) (1 − e−k(n+1)q/c )k 3 (n + 1)k−1 1 Typically, we choose the value c = ln 2 k(n+1)q ≈ 1.44k(n+1)q to minimize the probability of false positive for Bloom filters. Thus, Formula 2.1 becomes 1 1 (n + 2)(n + 3) ǫ < ( )k 3 2 (n + 1)k−1 Next, we discuss under what condition our optimization technique reduces the communication cost between sensors and storage nodes. To represent data in the n + 1 sets HMAC g (N (S([d0 , d1 ]))), · · · , HMAC g (N (S([dn , dn+1]))), without Bloom filters, the total number of bits required is wh (n + 1)q; with Bloom filters, the total number of bits required is at most c + 2k(n+ 1)q⌈log2 (n + 1)⌉. Note that the number of bits for representing array A is c, the number of bits for representing array B is at most 2k(n + 1)q⌈log2 (n + 1)⌉. Therefore, we derive Formula 2.2. wh (n + 1)q > c + 2k(n + 1)q⌈log2 (n + 1)⌉ 1 In case that c = ln 2 k(n + 1)q, Formula 2.2 becomes wh wh k≤ 1 ≈ 1.44 + 2⌈log2 (n + 1)⌉ ln 2 + 2⌈log2 (n + 1)⌉ 158 B ∗ Properties of fk and Their Proof ′ ∗ Order Preserving: Assume any hk (xq ) ≥ 2w (xq ∈ [x1 , xN ]). The condition fk (xi1 ) < ∗ fk (xi2 ) holds if and only if xi1 < xi2 . ∗ ∗ Proof. We first prove that if the condition fk (xi1 ) < fk (xi2 ) holds, then xi1 < xi2 . We prove it by contradiction. If xi1 ≥ xi2 , we have ∗ fk (xi1 ) = ∗ fk (xi2 ) + i1 hk (xq ) ′ 2w q=i2 +1 ∗ ≥ fk (xi2 ) ∗ ∗ Second, we prove that if the condition xi1 < xi2 holds, then fk (xi1 ) < fk (xi2 ). Similar as the proof of the property collision resistance, we have ∗ fk (xi2 ) = ∗ fk (xi1 ) + i2 hk (xq ) ′ 2w q=i1+1 ∗ > fk (xi1 ) ′ Collision Resistance: Assume any hk (xq ) ≥ 2w (xq ∈ [x1 , xN ]). It is impossible to find ∗ ∗ xi1 and xi2 where xi1 = xi2 such that fk (xi1 ) = fk (xi2 ). Proof. Without loss of generalization, we assume i1 < i2 . Hence, we have ∗ fk (xi2 ) = = i1 q=1 hk (xq ) ′ 2w ∗ fk (xi1 ) + + i2 i2 q=i1 +1 hk (xq ) ′ 2w hk (xq ) w q=i1 +1 2 ′ . ′ h (x ) ∗ Because for any hk (xq ), 2w ≤ hk (xq ), then k ′q ≥ 1. Therefore, we have fk (xi2 ) > 2w ∗ fk (xi1 ). 159 C Calculation of Detection Probability We first compute the number of choices for deleting a data item in all query results that will be detected by customers. Recall Case 1 in Section 3.4.2. Given a range query [a, b], deleting a data item in Bi will be detected by the customer if [a, b] is the superset of Bi , i.e., Bi ⊆ [a, b]. For ease of presentation, let [li , hi ] denote a bucket Bi . A bucket Bi is called a single-value bucket if li = hi . For the bucket Bi = [li , hi ], there are li − x1 + 1 distinct values which are less than or equal to li . Similarly, there are xN − hi + 1 distinct values which are larger than or equal to hi . Thus, the total number of queries, which are the supersets of [li , hi ], can be computed as (li − x1 + 1)(xN − hi + 1). Let e(x) denote the frequency of the data item with value x. The number of data items with different values satisfying a query [a, b] can be computed as b x=a e(x). Thus, for deleting a data item in bucket Bi that will be detected by customers, the number of choices can be computed as hi π(Bi ) = (li − x1 + 1)(xN − hi + 1) e(x) (7.3) x=li Therefore, for deleting a data item in all buckets B1 , · · · , Bm that will be detected by customers, the number of choices can be computed as m π= m hi (li − x1 + 1)(xN − hi + 1)e(x) π(Bi ) = i=1 (7.4) i=1 x=li In our context, e(x) is either equal to 0 or 1 because if multiple data items have the same value, the organization simply represents them as one data item annotated with the number of items that share this value. If there is no data item satisfying the query [a, b], b x=a e(x) = 0. Similarly, the number of choices for deleting a data item in all query results can be computed as ∗ xN xN b π = e(x) a=x1 b=a x=a 160 (7.5) Let Ij () (1 ≤ j ≤ n) denote an indicator function as Ij (x) = We have e(x) = n j=1 Ij (x). π∗ 1 if x = dj 0 otherwise Thus, π ∗ can be transformed to xN xN b n = n xN xN b Ij (x) = a=x1 b=a x=a j=1 n dj xN = 1= j=1 a=x1 b=dj Ij (x) j=1 a=x1 b=a x=a n (dj − x1 + 1)(xN − dj + 1) (7.6) j=1 Thus, the probability that a deletion operation of the cloud provider can be detected is π Pr = ∗ = π = m i=1 m i=1 π(Bi ) π∗ hi x=li (li − x1 + 1)(xN − hi + 1)e(x) n j=1 (dj − x1 + 1)(xN − dj + 1) 161 (7.7) D Proof of Theorems 3 and 4 Proof of Theorem 3 Proof. First, we prove that if each data item dj (1 ≤ j ≤ n) forms a single-value bucket, then P rmax = 100%. Obviously, P rmax = 100% is equivalent to π = π ∗ . Thus, we only need dj x=dj to prove π = π ∗ . For each bucket [dj , dj ] (1 ≤ j ≤ n), because e(x) = 1, we have π([dj , dj ]) = (dj − x1 + 1)(xN − dj + 1). For each empty bucket Bi , because hi x=li e(x) = 0, we have π(Bi ) = 0. Thus, according to Equation 7.4, we have n π= (dj − x1 + 1)(xN − dj + 1) = π ∗ j=1 Second, we prove that if P rmax = 100%, each data item dj (1 ≤ j ≤ n) should form a single-value bucket. We prove it by contradiction. If a data item dj does not form a single-value bucket, P r = π/π ∗ < 100%, which is equivalent to π < π ∗ . Assuming that B(dj ) = [l∗ , h∗ ] is not a single-value bucket, we consider the following two cases. (1) If B(dj ) includes only one data item dj , i.e., h∗ x=l∗ e(x) = 1, the condition l∗ < dj < h∗ must hold. We have π(B(dj )) = (l∗ − x1 + 1)(xN − h∗ + 1) < (dj − x1 + 1)(xN − dj + 1) Therefore, π < n j=1 (dj − x1 + 1)(xN − dj + 1) = π ∗ . (2) If B(dj ) includes n2 −n1 +1 data items, dn1 , dn1 +1 , · · · , dj , · · · , dn2 , (1 < n2 −n1 +1 < n), the condition l∗ ≤ dn1 < dn2 ≤ h∗ must hold. We have π(B(dj )) = (l∗ − x1 + 1)(xN − h∗ h∗ e(x) + 1) x=l∗ n2 (dj − x1 + 1)(xN − dj + 1) < j=n1 Thus, π < n j=1 (dj − x1 + 1)(xN − dj + 1) = π ∗ . 162 Proof of Theorem 4 Proof. Obviously, P rmin is equivalent to π = n. Thus, we only need to prove that π = n if and only if there is only one bucket [x1 , xN ]. First, we prove that if there is only one bucket [x1 , xN ], then π = n. xN π = π(B1 ) = (x1 − x1 + 1)(xN − xN + 1) f (x) = n x=x1 Second, we prove that if π = n, then there is only one bucket [x1 , xN ]. We prove it by contradiction. If there are multiple buckets B1 , · · · , Bm (m ≥ 2), then π > n. Let [li , hi ] denote a bucket Bi (1 ≤ i ≤ m). All buckets must satisfy the following condition, x1 = l1 ≤ h1 < l2 ≤ h2 < · · · < lm ≤ hm = xN . Thus, for each bucket Bi (1 ≤ i ≤ m), hi π(Bi ) = (li − x1 + 1)(xN − hi + 1) hi e(x) > x=li Thus, we have m π= m hi e(x) = n π(Bi ) > i=1 i=1 x=li 163 e(x) x=li BIBLIOGRAPHY BIBLIOGRAPHY [1] Amazon web services, aws.amazon.com. [2] Firewall throughput test, http://www.hipac.org/performance tests/results.html. [3] Google app engine, code.google.com/appengine. [4] Intel lab data. http://berkeley.intel-research.net/ labdata. [5] Microsoft azure, www.microsoft.com/azure. [6] Rise project. http://www.cs.ucr.edu/ rise. [7] Stargate gateway (spb400). http://www.xbow.com. [8] Tossim. http://www.cs.berkeley.edu/ pal/research/ tossim.html. [9] Rakesh Agrawal, Alexandre Evfimievski, and Ramakrishnan Srikant. Information sharing across private databases. In Proc. ACM Inte. Conf. on Management of Data (SIGMOD), pages 86–97, 2003. [10] Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu. Order preserving encryption for numeric data. In Proc. ACM Inte. Conf. on Management of Data (SIGMOD), pages 563–574, 2004. [11] Ehab Al-Shaer and Hazem Hamed. Discovery of policy anomalies in distributed firewalls. In IEEE INFOCOM’04, pages 2605–2616, March 2004. [12] Ehab Al-Shaer, Will Marrero, Adel El-Atawy, and Khalid ElBadawi. Network configuration in a box: Towards end-to-end verification of network reachability and security. In Proc. IEEE Inte. Conf. on Network Protocols (ICNP), 2009. [13] Sruthi Bandhakavi, Sandeep Bhatt, Cat Okita, and Prasad Rao. Analyzing end-toend network reachability. In Proc. IFIP/IEEE Inte. Conf. on Symposium on Integrated Network Management, 2009. [14] Theophilus Benson, Aditya Akella, and David Maltz. Unraveling the complexity of network management. In Proc. USENIX Symposium on Networked Systems Design and Implementation, 2009. [15] Burton Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of ACM, 13(7):422–426, 1970. 164 [16] Alexandra Boldyreva, Nathan Chenette, Younho Lee, and Adam O’Neill. Orderpreserving symmetric encryption. In Proc. Inte. Conf. on Advances in Cryptology: the Theory and Applications of Cryptographic Techniques (EUROCRYPT), pages 224–241, 2009. [17] Dan Boneh and Brent Waters. Conjunctive, subset, and range queries on encrypted data. In Proc. Theory of Cryptography Conference (TCC), pages 535–554, 2007. [18] Justin Brickell and Vitaly Shmatikov. Privacy-preserving graph algorithms in the semihonest model. In Proc. Inte. Conf. on the Theory and Application of Cryptology and Information Security (ASIACRYPT), pages 236–252, 2010. [19] Martin Casado, Tal Garfinkel, Aditya Akella, Michael J. Freedman, Dan Boneh, Nick McKeown, and Scott Shenker. Sane: A protection architecture for enterprise networks. In Proc. Usenix Security Symposium, 2006. [20] Yeim-Kuan Chang. Fast binary and multiway prefix searches for packet forwarding. Computer Networks, 51(3):588–605, 2007. [21] Fei Chen, Bezawada Bruhadeshwar, and Alex X. Liu. A cross-domain privacy-preserving protocol for cooperative firewall optimization. In Proc. IEEE Conf. on Computer Communications (INFOCOM), 2011. [22] Hong Chen, Xiaonan Man, Windsor Hsu, Ninghui Li, and Qihua Wang. Access control friendly query verification for outsourced data publishing. In Proc. 13th European Symposium on Research in Computer Security (ESORICS), pages 177–191, 2008. [23] Jerry Cheng, Hao Yang, Starsky H.Y. Wong, and Songwu Lu. Design and implementation of cross-domain cooperative firewall. In Proc. IEEE Inte. Conf. on Network Protocols (ICNP), 2007. [24] Weiwei Cheng, HweeHwa Pang, and Kian-Lee Tan. Authenticating multi-dimensional query results in data publishing. In Data and Applications Security 2006, pages 60–73, 2006. [25] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press. [26] Peter Desnoyers, Deepak Ganesan, Huan Li, and Prashant Shenoy. Presto: A predictive storage architecture for sensor networks. In Proc. 10th Workshop on Hot Topics in Operating Systems (HotOS), 2005. 165 [27] Premkumar Devanbu, Michael Gertz, Charles Martel, and Stuart G. Stubblebine. Authentic data publication over the internet. Journal of Computer Security, 11(3):291–314, 2003. [28] Qunfeng Dong, Suman Banerjee, Jia Wang, Dheeraj Agrawal, and Ashutsh Shukla. Packet classifiers in ternary CAMs can be smaller. In Proc. ACM Sigmetrics, pages 311–322, 2006. [29] D. Eastlake and P. Jones. Us secure hash algorithm 1 (sha1). RFC 3174, 2001. [30] E. Allen Emerson. Temporal and modal logic. 1990. [31] Edward A. Fox, Qi Fan Chen, Amjad M. Daoud, and Lenwood S. Heath. Orderpreserving minimal perfect hash functions and information retrieval. ACM Transactions on Information Systems, 9:281–308, 1991. [32] A. Frank and A. Asuncion. UCI machine learning repository, 2010. [33] Michael Freedman, Kobi Nissim, and Benny Pinkas. Efficient private matching and set intersection. In Proc. Inte. Conf. on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), pages 1–19, 2004. [34] Oded Goldreich. Secure multi-party computations. Working draft. Version 1.4 edition, 2002. [35] Oded Goldreich. Foundations of Cryptography: Volume II (Basic Applications). Cambridge University Press, 2004. [36] Mohamed G. Gouda and Alex X. Liu. Firewall design: consistency, completeness and compactness. In Proc. 24th IEEE Inte. Conf. on Distributed Computing Systems (ICDCS-04), pages 320–327, March 2004. [37] Mohamed G. Gouda and Alex X. Liu. Structured firewall design. Computer Networks Journal (Elsevier), 51(4):1106–1120, March 2007. [38] Pankaj Gupta. Algorithms for Routing Lookups and Packet Classification. PhD thesis, Stanford University, 2000. [39] Pankaj Gupta and Nick McKeown. Algorithms for packet classification. IEEE Network, 15(2):24–32, 2001. [40] Hakan Hacig¨ m¨¸, Bala Iyer, Chen Li, and Sharad Mehrotra. Executing sql over enu us crypted data in the database-service-provider model. In Proc. ACM Inte. Conf. on Management of Data (SIGMOD), pages 216–227, 2002. 166 [41] Bijit Hore, Sharad Mehrotra, and Gene Tsudik. A privacy-preserving index for range queries. In Proc. 30th Inte. Conf. on Very Large Data (VLDB), pages 720–731, 2004. [42] Kyle Ingols, Richard Lippmann, and Keith Piwowarski. Practical attack graph generation for network defense. In Proc. Annual Computer Security Applications Conf. (ACSAC), 2006. [43] Zeus Kerravala. As the value of enterprise networks escalates, so does the need for configuration management. Enterprise Computing & Networking, The Yankee Group Report, January 2004. [44] Amir R. Khakpour and Alex X. Liu. Quantifying and querying network reachability. In Proc. Inte. Conf. on Distributed Computing Systems (ICDCS), 2010. [45] Lea Kissner and Dawn Song. Privacy-preserving set operations. In Advances in Cryptology (CRYPTO), pages 241–257, 2005. [46] Hugo Krawczyk, Mihir Bellare, and Ran Canetti. Hmac: Keyed-hashing for message authentication. RFC 2104, 1997. [47] Franck Le, Sihyung Lee, Tina Wong, Hyong S. Kim, and Darrell Newcomb. Detecting network-wide and router-specific misconfigurations through data mining. 2009. [48] Alex X. Liu and Fei Chen. Collaborative enforcement of firewall policies in virtual private networks. In Proc. Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC), Toronto, Canada, August 2008. [49] Alex X. Liu and Mohamed G. Gouda. Diverse firewall design. IEEE Transactions on Parallel and Distributed Systems (TPDS), 19(8), 2008. [50] Alex X. Liu and Mohamed G. Gouda. Complete redundancy removal for packet classifiers in tcams. IEEE Transactions on Parallel and Distributed Systems (TPDS), in press. [51] Alex X. Liu, Chad R. Meiners, and Eric Torng. Tcam razor: A systematic approach towards minimizing packet classifiers in tcams. IEEE/ACM Transactions on Networking, to appear. [52] Alex X. Liu, Chad R. Meiners, and Yun Zhou. All-match based complete redundancy removal for packet classifiers in TCAMs. In Proc. 27th Annual IEEE Conf. on Computer Communications (Infocom), April 2008. 167 [53] Alex X. Liu, Eric Torng, and Chad Meiners. Firewall compressor: An algorithm for minimizing firewall policies. In Proc. 27th Annual IEEE Conf. on Computer Communications (Infocom), Phoenix, Arizona, April 2008. [54] Petr Matousek, Jaroslav Rab, Ondrej Rysavy, and Miroslav Sveda. A formal model for network-wide security analysis. In Proc. IEEE Inte. Conf. and Workshop on the Engineering of Computer Based Systems, 2008. [55] Chad R. Meiners, Alex X. Liu, and Eric Torng. TCAM Razor: A systematic approach towards minimizing packet classifiers in TCAMs. In Proc. 15th IEEE Conf. on Network Protocols (ICNP), pages 266–275, October 2007. [56] Chad R. Meiners, Alex X. Liu, and Eric Torng. Bit weaving: A non-prefix approach to compressing packet classifiers in TCAMs. In Proc. IEEE Conf. on Network Protocols (ICNP), pages 93–102, October 2009. [57] Chad R. Meiners, Alex X. Liu, and Eric Torng. Topological transformation approaches to optimizing tcam-based packet processing systems. In Proc. ACM Inte. Conf. on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 73–84, August 2009. [58] Ralph Merkle. Protocols for public key cryptosystems. In Proc. IEEE Symposium on Security and Privacy, pages 122–134, 1980. [59] Maithili Narasimha and Gene Tsudik. Authentication of outsourced databases using signature aggregation and chaining. In Proc. Inte. Conf. on Database Systems for Advanced Applications (DASFAA), 2006. [60] Silvio Micali Oded Goldreich and Avi Wigderson. How to play any mental game. In Proc. nineteenth anual ACM Conf. on Theory of computing, May 1987. [61] David Oppenheimer, Archana Ganapathi, and David A. Patterson. Why do internet services fail, and what can be done about it? In Proc. 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003. [62] HweeHwa Pang, Arpit Jain, Krithi Ramamritham, and Kian-Lee Tan. Verifying completeness of relational query results in data publishing. In Proc. ACM Inte. Conf. on Management of Data (SIGMOD), pages 407–418, 2005. [63] HweeHwa Pang and Kian-Lee Tan. Authenticating query results in edge computing. In Proc. 20th Inte. Conf. on Data Engineering, pages 560–571, 2004. 168 [64] Stephen C. Pohlig and Martin E. Hellman. An improved algorithm for computing logarithms over gf(p) and its cryptographic significance. IEEE Transactions Information and System Security, IT-24:106–110, 1978. [65] Sylvia Ratnasamy, Brad Karp, Scott Shenker, Deborah Estrin, Ramesh Govindan, Li Yin, and Fang Yu. Data-centric storage in sensornets with ght, a geographic hash table. Mobile Networks and Applications, 8(4):427–442, 2003. [66] R. Rivest. The md5 message-digest algorithm. RFC 1321, 1992. [67] David K. Hess David R. Safford and Douglas Lee Schales. Secure RPC authentication (SRA) for TELNET and FTP. Technical report, 1993. [68] Yingpeng Sang and Hong Shen. Efficient and secure protocols for privacy-preserving set operations. ACM Transactions on Infomation and System Security, 13:9:1–9:35, 2009. [69] Bo Sheng and Qun Li. Verifiable privacy-preserving range query in two-tiered sensor networks. In Proc. IEEE Inte. Conf. on Computer Communications (INFOCOM), pages 46–50, 2008. [70] Bo Sheng, Qun Li, and Weizhen Mao. Data storage placement in sensor networks. In Proc. 7th ACM Inte. Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), pages 344–355, 2006. [71] Bo Sheng, Chiu C. Tan, Qun Li, and Weizhen Mao. An approximation algorithm for data storage placement in sensor networks. In Proc. Inte. Conf. on Wireless Algorithms, Systems and Applications (WASA), pages 71–78, 2007. [72] Elaine Shi, John Bethencourt, T-H. Hubert Chan, Dawn Song, and Adrian Perrig. Multi-dimensional range query over encrypted data. In Proc. IEEE Symposium on Security and Privacy (S&P), pages 350–364, 2007. [73] Jing Shi, Rui Zhang, and Yanchao Zhang. Secure range queries in tiered sensor networks. In Proc. IEEE Inte. Conf. on Computer Communications (INFOCOM), 2009. [74] Sumeet Singh, Florin Baboescu, George Varghese, and Jia Wang. Packet classification using multidimensional cutting. In Proc. ACM SIGCOMM, pages 213–224, 2003. [75] Yu-Wei Eric Sung, Carsten Lund, Mark Lyn, Sanjay Rao, and Subhabrata Sen. Modeling and understanding end-to-end class of service policies in operational networks. In Proc. SIGCOMM, pages 219–230, 2009. [76] Avishai Wool. A quantitative study of firewall configuration errors. IEEE Computer, 37(6):62–67, 2004. 169 [77] Geoffery G. Xie, Jibin Khan, David A. Maltz, Hui Zhang, Albert Greenberg, G´ ısli Hj´lmt´sson, and Jennifer Rexford. On static reachability analysis of ip networks. In a y Proc. Annual Joint Conference of the IEEE Computer and Communication Societies (INFOCOM), 2005. [78] Zhiqiang Yang, Sheng Zhong, and Rebecca N. Wright. Privacy-preserving classification of customer data without loss of accuracy. In Proc. Inte. Conf. on Data Mining (SIAM), 2005. [79] Andrew C. Yao. Protocols for secure computations. In Proc. 23rd IEEE Symposium on the Foundations of Computer Science (FOCS), pages 160–164, 1982. [80] Andrew C. Yao. How to generate and exchange secrets. In Proc. 27th IEEE Symposium on Fundations of Computer Science, 1986. [81] Lihua Yuan, Hao Chen, Jianning Mai, Chen-Nee Chuah, Zhendong Su, and Prasant Mohapatra. Fireman: a toolkit for firewall modeling and analysis. In IEEE Symposium on Security and Privacy, May 2006. [82] Demetrios Zeinalipour-yazti, Song Lin, Vana Kalogeraki, Dimitrios Gunopulos, and Walid A. Najjar. Microhash: An efficient index structure for flash-based sensor devices. In Proc. 4th USENIX Conf. on File and Storage Technologies (FAST), pages 31–44, 2005. [83] Bo Zhang, T. S. Eugene Ng, and Guohui Wang. Reachability monitoring and verification in enterprise networks. In Proc. SIGCOMM, 2008. [84] Rui Zhang, Jing Shi, and Yanchao Zhang. Secure multidimensional range queries in sensor networks. In Proc. ACM Inte. Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc), 2009. 170