FEDERATED LEARNING BENCHMARKS AND FRAMEWORKS FOR ARTIFICIAL INTELLIGENCE OF THINGS By Samiul Alam A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Master of Science 2023 ABSTRACT The growing integration of the Internet of Things (IoT) and Artificial Intelligence (AI), commonly referred to as the Artificial Intelligence of Things (AIoT), has amplified the importance of Federated Learning (FL). However, the application of FL in AIoT is challenged by the lack of authentic IoT datasets and the constraints associated with model-homogeneous FL approaches. Addressing these gaps, this thesis introduces two significant contributions: FedAIoT and FedRolex. FedAIoT is a comprehensive FL benchmark designed for AIoT, encompassing eight diverse datasets collected from a wide range of IoT devices. It offers a unified end-to-end FL framework, making it an invaluable tool for standardizing AIoT-based FL applications. The framework is available at https://github.com/AIoT-MLSys-Lab/FedAIoT. On the other hand, FedRolex is a novel Partial Training (PT)-based model-heterogeneous FL approach. With an emphasis on device heterogeneity typical in AIoT applications, FedRolex enables the training of a global server model that is larger than any client model, by using a rolling sub-model extraction scheme. This approach mitigates client drift, enhances the performance of low-end devices, and reduces the gap between model-heterogeneous and model-homogeneous FL. Benchmark results indicate that FedRolex outperforms existing PT-based model-heterogeneous FL methods, making it a crucial resource for researchers and practitioners in the field of FL for AIoT. Our code is available at: https://github.com/AIoT-MLSys-Lab/FedRolex. ACKNOWLEDGEMENTS I am deeply indebted to my advisor, Dr. Mi Zhang, whose guidance, patience, and expert counsel were pivotal in the successful completion of this research. The opportunity to conduct research under his esteemed supervision within his laboratory was an academically enriching experience and a privilege that I have found both profoundly enjoyable and intellectually stimulating. My earnest appreciation extends to my distinguished thesis committee members - Dr. Zhicao Cao, Dr. Luyang Liu, and Dr. Guan-Hua Tu. Their perceptive feedback, incisive criticisms, and judicious advice significantly shaped this work, imbuing it with the depth and breadth that it possesses. I would also like to formally acknowledge my colleagues, Tuo Zhang, and Tiantian Feng, for their collaborative contributions. Our rigorous discussions, intense brainstorming sessions, and the invaluable exchange of ideas have provided essential refinement to my research methodology and thesis presentation. Lastly, I must express my profound gratitude to my family - my parents, my spouse, and my cherished daughter. Their unwavering support, unyielding faith in my abilities, and unbounded love have consistently served as my beacon during this rigorous academic journey. This work is a testament to their unwavering support and I dedicate it to them, with profound respect and heartfelt love. iii TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 CHAPTER 2 FEDAIOT: A FEDERATED LEARNING BENCHMARK FOR ARTIFICIAL INTELLIGENCE OF THINGS . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Design of FedAIoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 CHAPTER 3 FEDROLEX: MODEL-HETEROGENEOUS FEDERATED LEARNING WITH ROLLING SUB-MODEL EXTRACTION . . . . . . . . . . . . . 23 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 CHAPTER 4 LIMITATIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . 50 CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 iv CHAPTER 1 INTRODUCTION 1.1 Background The advent and proliferation of the Internet of Things (IoT) has dramatically changed the way we interact with the world. A vast array of IoT devices, such as smartphones, smartwatches, drones, and sensors deployed in homes, collect massive amounts of data daily [1]. These devices, combined with advances in Artificial Intelligence (AI), have driven the integration of AI and IoT, giving rise to the Artificial Intelligence of Things (AIoT). However, IoT-collected data often contain privacy-sensitive information, making federated learning (FL) an increasingly crucial approach in handling AIoT data [2, 3]. Traditionally, FL studies have concentrated on the model-homogeneous setting, where the server model and the client models across all participating client devices are identical [4, 5, 6, 7]. However, given the diversity of client devices in terms of on-device resources and the state-of-the-art trend towards larger machine learning models [8], constraints have emerged with this approach for AIoT applications. Additionally, a majority of existing FL works are conducted on well-known datasets such as CIFAR-10 and CIFAR-100. These datasets, however, do not originate from authentic IoT devices and thus fail to capture the unique modalities and inherent challenges associated with real-world IoT data. To assess the effectiveness of FL algorithms, a benchmark for IoT devices is crucial. 1.2 Contributions Our contributions of this thesis are twofold. Firstly, acknowledging the critical gap that remains in the FL field: the datasets typically used do not originate from authentic IoT devices, leading to a significant discrepancy as they fail to capture the unique modalities and inherent challenges associated with real-world IoT data. To fill this critical gap, we introduce FedAIoT, an FL benchmark for AIoT. FedAIoT comprises eight datasets collected from a diverse range of authentic IoT devices and encapsulates a variety of unique modalities targeting representative AIoT applications. 1 Secondly, to relax the constraints of heterogeneity and handle emerging challenges for large models, we introduce FedRolex, a novel partial training(PT)-based model-heterogeneous FL approach. By using a rolling sub-model extraction scheme, FedRolex ensures that all parameters of the global server model are evenly trained over the local data of client devices [9, 10, 11]. This innovative approach offers several merits, including mitigation of client drift, the reduction of communication overheads, and compatibility with secure aggregation protocols that enhance the privacy properties of FL systems [12]. 1.3 Thesis Organization Our objective is to address both the challenges of device and model heterogeneity in federated learning and the lack of authentic IoT datasets in current FL studies, fostering advancement in this rapidly evolving field of FL for AIoT. The thesis focuses on these two aspects and distinctively as two chapters. Chapter 2 discusses the contributions and results of FedAIoT and how it provides a unified end-to-end FL framework for AIoT, from non-IID data partitioning, data preprocessing, AIoT-friendly models and FL hyperparameters. Chapter 3 evolves around the FedRolex algorithm, its performance and its relevance. Finally, we conclude the thesis after discussing the limitations and future work. 2 CHAPTER 2 FEDAIOT: A FEDERATED LEARNING BENCHMARK FOR ARTIFICIAL INTELLIGENCE OF THINGS 2.1 Related Work The importance of data to FL research pushes the development of FL benchmarks on a variety of data modalities. Existing FL benchmarks, however, predominantly center around curating federated datasets in the domain of computer vision (CV) [13, 14, 15, 16, 17], natural language processing (NLP) [18, 14, 16, 17], medical imaging [19], speech and audio [20, 17], and graph neural networks [21]. For example, LEAF [14] is one of the earliest FL benchmarks which comprises six datasets dedicated to CV and NLP; FedCV [13], FedNLP [18], and FedAudio [20] focuses on CV, NLP, and audio-related datasets and tasks respectively; FedScale [16] provides an assortment of 20 federated datasets mainly in CV and NLP applications, placing a distinct emphasis on system-related aspects; FLUTE [17] covers a mix of datasets from CV, NLP, and audio; and FLamby [19] presents seven healthcare-related datasets including five medical imaging datasets. Although these benchmarks have significantly contributed to FL research, a dedicated FL benchmark explicitly tailored for IoT data is absent. FedAIoT is specifically designed to fill this critical gap by providing a dedicated benchmark that focuses on data collected by a wide range of authentic IoT devices. 2.2 Design of FedAIoT 2.2.1 Datasets Table 2.1 provides an overview of the eight datasets included in FedAIoT. In this section, we provide a brief overview of each included dataset. WISDM: The widely used Wireless Sensor Data Mining (WISDM) dataset [22, 31] offers accelerometer and gyroscope sensor data collected from smartphones and smartwatches for daily activity recognition. Data was collected from 51 participants who performed 18 daily activities in 3-minute sessions. Activities like eating soup, chips, pasta, and sandwiches were unified into a single category - "eating", while activities such as kicking, catching, or dribbling balls were 3 Table 2.1 Overview of the datasets included in FedAIoT. Dataset IoT Platform Data Modality Data Dimension Dataset Size # Training Samples # Clients WISDM-W [22] Smartwatch Accelerometer 200 × 6 294 MB 16,569 80 Gyroscope WISDM-P [22] Smartphone Accelerometer 200 × 6 253 MB 13,714 80 Gyroscope UT-HAR [23] Wi-Fi Router Wireless Signal 3 × 30 × 250 854 MB 3,977 20 Widar [24, 25] Wi-Fi Router Wireless Signal 22 × 20 × 20 3.3 GB 11,372 40 VisDrone [26] Drone Images 3 × 224 × 224 1.8 GB 6,471 30 CASAS [27] Smart Home Motion Sensor 2000 × 1 233 MB 12,190 60 Door Sensor Thermostat AEP [28] Smart Home Energy, Humidity 18 × 1 12 MB 15,788 80 Temperature EPIC-SOUNDS [29, 30] Augmented Reality Acoustics 400 × 128 34 GB 60,055 210 eliminated due to their rarity. For our training and test set partition, we selected 45 participants for the training set and the remaining 6 for the test set. To accommodate real-life scenarios where individuals may not always use a smartphone and wear a smartwatch simultaneously, we divided WISDM into WISDM-W (smartwatch data) and WISDM-P (smartphone data). The sample count for the training and test set is 16, 569 and 4, 103 for WISDP-W and 13, 714 and 4, 073 for WISDP-P respectively. Licensing details are not explicitly mentioned on the dataset homepage. UT-HAR: UT-HAR dataset [23] provides Channel State Information (CSI) for contactless activity recognition tasks. The CSI is collected via three pairs of antennas and an Intel 5300 Network Interface Card (NIC), each antenna pair capturing 30 subcarriers of CSI. The dataset incorporates activities like walking and running performed by various participants. The UT-HAR comes with a pre-set training and test set, totalling 3, 977 and 500 samples respectively. Widar: The Widar dataset [24, 25] is a Wi-Fi dataset designed for contactless gesture recognition, and records Wi-Fi signal strength measurements collected via Wi-Fi access points. Data is collected from 17 participants performing 22 distinct gestures. However, to maintain consistency, only those gestures performed by more than three users are included in the experimental dataset. As a result, our balanced dataset encompasses nine gestures with 11, 372 training and 5, 222 test samples. Widar is licensed under the Creative Commons Attribution-NonCommercial 4.0 International Licence (CC BY 4). VisDrone: The VisDrone dataset [26] is an extensive collection dedicated to object detection 4 in aerial images captured by drone cameras. It consists of 263 video clips, containing 179, 264 frames and 2, 512, 357 labeled objects. The objects fall into 12 categories and are recorded in various scenarios like crowded urban areas, highways, and parks. The training and test sets comprise 6, 471 and 1, 610 samples respectively and are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. CASAS: Derived from the CASAS smart home project, the CASAS dataset [27] uses sensor data for recognizing activities of daily living (ADL) to support independent living. Data from three distinct apartments equipped with motion, temperature, and door sensors are collected. We selected five specific datasets - Milan’, Cairo’, Kyoto2’, Kyoto3’, and ‘Kyoto4’ - for their uniform sensor data representation. The activities were consolidated into 11 home activities categories. The dataset, split into training and test sets with an 80-20 ratio, includes 12, 190 training and 3, 048 test samples. No explicit license information is provided for this dataset. AEP: The Appliances Energy Prediction (AEP) dataset [28] collects data from energy, temperature, and humidity sensors installed in a home for the task of predicting home energy usage. The data, captured every 10 minutes over 4.5 months, includes 15, 788 training and 3, 947 test samples. Licensing information is not explicitly mentioned. EPIC-SOUNDS: The EPIC-SOUNDS dataset [29] is a large-scale collection of audio recordings for audio-based human activity recognition in Augmented Reality applications. It offers over 100k categorized segments across 44 distinct classes, captured via a head-mounted microphone. The dataset includes pre-determined training and test sets with 60, 055 and 40, 175 samples respectively, and is licensed under CC BY 4. 2.2.1.1 WISDM The WISDM dataset comprises raw accelerometer and gyroscope data collected from 51 subjects performing 18 activities for three minutes each. Data were gathered at a 20Hz sampling rate from both a smartphone (Google Nexus 5/5x or Samsung Galaxy S5) and a smartwatch (LG G Watch). Data for each device and sensor type are stored in different directories, resulting in four directories overall. Each directory contains 51 files, each corresponding to a subject. The data entry format is: 5 Non-IID Data Data AIoT-friendly FL IoT Factors Partitioning Preprocessing Models Hyperparameters Non-IID Partition over Normalization LSTM ResNet Data Heterogeneity Output Labels Level Erroneous Labels Data Augmentation Non-IID Partition over YOLO BiLSTM FL Optimizer Input Features Spectrogram Quantized Training Non-IID Partition over ResNet Body Velocity MLP + Client Sampling Ratio Output Distribution Profiling LSTM Figure 2.1 Architecture of the end-to-end FL framework for AIoT incorporated in FedAIoT. . Separate files for the gyroscope and accelerometer readings are provided and are later combined by matching timestamps. Subject ID is given from 1600 to 1650 and the activity code is an alphabetical character between ‘A’ and ‘S’ excluding ‘N’. The timestamp is in Unix time. The code to read and partition the data into 10s segments is provided by our benchmark. The input shape of the processed data is 200 × 6. The original dataset is available at https://rb.gy/xla1i. 2.2.2 Input-Output Formats and Sourcing 2.2.2.1 UT-HAR The UT-HAR dataset was collected using the Linux 802.11n Channel State Information (CSI) Tool for the task of Human Activity Recognition (HAR). The original data consist of two file types: “input" and “annotation". “input" files contain Wi-Fi CSI data. The first column indicates the timestamp in Unix. Columns 2-91 represent amplitude data for 30 subcarriers across three antennas, and columns 92-181 contain the corresponding phase information. “annotation" files provide the corresponding activity labels, serving as the ground truth for HAR. In our benchmark, only amplitude is used. The final samples are created by taking a sliding window of size 250 where each sample consists of amplitude information across three antennas and from 30 subcarriers and has shape 3 × 30 × 250. The original dataset is available at https://github.com/ermongroup/Wifi _Activity_Recognition/tree/master. 6 2.2.2.2 Widar The Widar dataset (Widar3.0) was collected with a system comprising one transmitter and three receivers, all equipped with Intel 5300 wireless NICs. The system uses the Linux CSI Tool to record the Wi-Fi data. Devices operate in monitor mode on channel 165 at 5.825 GHz. The transmitter broadcasts 1, 000 Wi-Fi packets per second while receivers capture data using their three linearly arranged antennas. In our benchmark, we use the processed body velocity profile (BVP) features extracted from the dataset. The size of each data sample after processing is 22 × 20 × 20 consisting of 22 samples over time each having 20 BVP features each in both x and y directions. The raw dataset is available for download at http://tns.thss.tsinghua.edu.cn/widar3.0/index.html. 2.2.2.3 VisDrone The VisDrone dataset was collected by the AISKYEYE team at Tianjin University, China. It comprises 288 video clips with 261,908 frames and 10,209 static images captured by cameras mounted on drones at 14 different cities in China in diverse environments, scenarios, weather, and lighting conditions. The frames were manually annotated with over 2.6 million bounding boxes of common targets like pedestrians, cars, and bicycles. Additional attributes like scene visibility, object class, and occlusion are also provided for enhanced data utilization. The dataset is available at https://github.com/VisDrone/VisDrone-Dataset. 2.2.2.4 CASAS The CASAS dataset is a collection of data generated in smart home environments, where intelligent software uses sensors deployed at homes to monitor resident activities and conditions within the space. The CASAS project considers environments as intelligent agents and employs custom IoT hardware known as Smart Home in a Box (SHiB), which encompasses the necessary sensors, devices, and software. The sensors in SHiB perceive the status of residents and their surroundings, and through controllers, the system acts to enhance living conditions by optimizing comfort, safety, and productivity. The CASAS dataset includes the date (in yyyy-mm-dd format), time (in hh:mm:ss.ms format), sensor name, sensor readings, and an activity label in string format. The data were collected in real-time as residents go about their daily activities. The code to extract 7 categorical sensor readings to create input sequences and labels is provided in our benchmark. The CASAS dataset can be downloaded from https://casas.wsu.edu/datasets/. 2.2.2.5 AEP The AEP dataset, collected over 4.5 months, comprises readings taken every 10 minutes from a ZigBee wireless sensor network monitoring house temperature and humidity. Each wireless node transmitted data around every 3.3 minutes, which were then averaged over 10-minute periods. Additionally, energy data was logged every 10 minutes via m-bus energy meters. The dataset includes attributes such as date and time (in year-month-day hour:minute:second format), the energy usage of appliances and lights (in Wh), temperature and humidity in various rooms including the kitchen (T1 , RH1 ), living room (T2 , RH2 ), laundry room (T3 , RH3 ), office room (T 4, RH4 ), bathroom (T5 , RH5 ), ironing room (T7 , RH7 ), teenager room (T8 , RH8 ), and parents room (T9 , RH9 ), and temperature and humidity outside the building (T6 , RH6 ) - all with temperatures in Celsius and humidity in percentages. Additionally, weather data from Chievres Airport, Belgium was incorporated, consisting of outside temperature (To in Celsius), pressure (in mm Hg), humidity (RHout in %), wind speed (in m/s), visibility (in km), and dew point (Tdewpoint in °C). The dataset is available at https://archive.ics.uci.edu/dataset/374/appliances+energy+prediction. 2.2.2.6 EPIC-SOUNDS As an extension of the EPIC-KITCHENS-100 dataset, the EPIC-SOUNDS dataset focuses on annotating distinct audio events in the videos of EPIC-KITCHENS-100. The annotations include the time intervals during which each audio event occurs, along with a text description explaining the nature of the sound. Given the variation in video lengths in the dataset, which range from 30 seconds to 1.5 hours, the videos are segmented into clips of 3-4 minutes each to make the annotation process more manageable. In order to ensure that annotators concentrate solely on the audio aspects, only the audio stream is provided to them. This decision is taken to prevent bias that could be introduced by the visual and contextual elements in the videos. Additionally, annotators are given access to the plotted audio waveforms. These visual representations of the audio data help the annotators by guiding them in pinpointing specific sound patterns, thus making the annotation process more 8 efficient and targeted. The EPIC-SOUNDS dataset can be extracted from the EPIC-KITHENS-100 dataset with the GitHub repo at https://github.com/epic-kitchens/epic-sounds-annotations. The extracted audio data in the form of HDF5 file format can also be requested. 2.2.3 End-to-End Federated Learning Framework for AIoT To benchmark the performance of the datasets and facilitate future research in FL for AIoT, we have designed and developed an end-to-end FL framework for AIoT as another key part of FedAIoT. As illustrated in Figure 2.1, our framework covers the complete FL-for-AIoT pipeline, which includes five components: (1) non-IID data partitioning, (2) data preprocessing, (3) AIoT-friendly models, (4) FL hyperparameters, and (5) IoT-factor emulator. In this section, we describe each component within the framework in detail. 2.2.3.1 Partitioning Non-IID Data The primary goal of non-IID data partitioning is to split the training set in a manner that results in each client having data that follows a non-IID distribution. The eight datasets incorporated into FedAIoT cover three fundamental tasks: classification, regression, and object detection. Consequently, FedAIoT employs three non-IID data partitioning methods tailored to these tasks. Scheme#1: Non-IID Partitioning Based on Output Labels. This scheme is applied to classification tasks (WISDM-W, WISDM-P, UT-HAR, Widar, CASAS, EPIC-SOUNDS), which involve C classes. We initially establish a distribution over these classes for each client, utilizing a Dirichlet Distribution with a parameter α [32]. Lower values of α create a skewed distribution favoring a few classes, while higher values lead to a more balanced class distribution. This parameter is also used to determine the number of samples received by each client. In addition, we generate a distribution over the total number of samples using the same Dirichlet Distribution. This distribution is then used to distribute a varying number of samples to each client. This methodology enables us to create non-IID data partitions that more accurately represent real-world scenarios where the class distribution and the number of samples can differ across clients. Scheme#2: Non-IID Partitioning Based on Input Features. This scheme is used for object detection tasks (VisDrone), where there are no specific classes. Here, we use the input features 9 to create non-IID partitions. We employ ImageNet features generated from a VGG19 model [33], which capture the essential visual information needed for further analysis. Using these ImageNet features, we conduct clustering in the feature space using K-nearest neighbors to divide the dataset into ten distinct clusters, treating each cluster as a pseudo-class. A Dirichlet Allocation is subsequently used on these pseudo-classes to generate the non-IID distribution across different clients. Scheme#3: Non-IID Partitioning Based on Output Distribution. For regression tasks (e.g., AEP dataset) characterized by a continuous output, we use Quantile Binning to convert the continuous variable into a categorical one. This process divides the output variable’s range into ten equal groups, or quantiles, ensuring roughly equal sample sizes in each bin. These bins are then treated as pseudo-classes. After converting the continuous output into ten categories, we apply Dirichlet Allocation to generate a non-IID distribution of data across the clients. 2.2.3.2 Data Preprocessing FedAIoT encompasses eight datasets, each requiring different data preprocessing techniques due to their unique data modalities. Given the diversity in sensor data and data modalities, the preprocessing techniques are tailored to remove outliers and reduce noise. WISDM: We utilize standard preprocessing techniques used in accelerometer and gyroscope-based activity recognition. Specifically, we extract samples from the raw accelerometer and gyroscope data sequences using a 10-second sliding window with a 50 UT-HAR: We follow the method in [34], applying a sliding window of 250 packets with a 50 Widar: We use the body velocity profile (BVP) processing technique, as outlined in [34, 25], to effectively handle and remove environmental variations from the data. We then apply standard scalar normalization for further refinement. This process creates data samples with the shape (22x20x20), reflecting time axis, x, and y velocity features respectively. VisDrone: We first normalize the images to the range of 0 to 1 to standardize pixel values. Data augmentation techniques such as random shifts in Hue, Saturation, and Value color space, image compression, shearing transformations, scaling transformations, horizontal and vertical flipping, 10 Table 2.2 Non-IID data partitioning schemes and models used for each dataset. Dataset WISDM-W WISDM-P UT-HAR Widar VisDrone CASAS AEP EPIC-SOUNDS Partition Output Labels Output Labels Output Labels Output Labels Input Features Output Labels Output Distribution Output Labels Model LSTM LSTM ResNet18 ResNet18 YOLOv8n BiLSTM MLP ResNet18 [36] [36] [40, 34, 41] [40, 34, 41] [37] [35] [38] [29] and MixUp are then applied to enhance the diversity and generalizability of the dataset. CASAS: We follow the approach in [35], transforming sensor readings into categorical sequences for semantic encoding. Unique temperature settings, instances of motion sensors, and door sensors activations are each assigned distinct categorical values. We extract a sequence of the previous 2000 sensor activations for each recorded activity for modeling and prediction. AEP: Temperature data are log-transformed for skewness, and ’visibility’ is binarized. Outliers below the 10th or above the 90th percentile are replaced with corresponding percentile values. Central tendency and date features are added to capture time-related patterns. Principal component analysis is used for data reduction, and the output is normalized using a standard scaler. EPIC-SOUNDS: We first apply the Short-Time Fourier Transform to raw audio segments, then apply a Hann window of 10ms duration with a 5ms step size to ensure optimal spectral resolution. We extract 128 Mel Spectrogram features, which are a popular choice for audio classification tasks due to their ability to mimic the human auditory system. We apply natural logarithm scaling to the Mel Spectrogram output to further refine the data, and each segment is padded to reach a consistent length of 400. 2.2.3.3 AIoT-friendly Models Our selection of models is informed by both state-of-the-art results, as referenced in [36, 34, 37, 35, 38, 39, 29], and the resource constraints of IoT platforms. It is unrealistic to expect that IoT platforms could accommodate large Transformer-based models for FL, so we prioritize AIoT-friendly models in FedAIoT. Table 2.2 lists the chosen models, and a detailed breakdown of each model’s architecture can be found in Section 2.3.3. 11 2.2.3.4 Configuring FL Parameters Degree of Data Heterogeneity. Non-IID data can significantly disrupt FL training due to issues such as gradient skew, which may impair the performance of the resulting model. In Section 2.2.3.1, we explained how FedAIoT can construct various data partitions, enabling researchers to simulate different levels of data heterogeneity as per the requirements of their experiments. FL Optimizer Selection. FedAIoT is compatible with several frequently used FL optimizers. In our experimental section, we demonstrate the benchmark outcomes of two FL optimizers: FedAvg [4] and FedOPT [42]. Client Sampling Ratio. The client sampling ratio refers to the fraction of clients chosen for local training in each FL round. This critical hyperparameter can impact both the computational and communication costs of FL training. With FedAIoT, you can create various client sampling ratios and assess their influence on model performance and the speed of convergence during FL training. 2.2.3.5 Emulation of IoT Conditions Simulation of Real-world Label Errors. In actual FL deployments on IoT devices, label noise is a common problem due to sources such as annotator bias, varying skill levels, and errors during labeling. To realistically simulate label errors in FL, we suggest modifying the original labels of a dataset using a confusion matrix, denoted as Q. Here, Qi j gives the probability of changing the correct label i to an incorrect label j, that is, P(ŷ = j | y = i). Contrary to previous benchmark studies [20] which randomly built the confusion matrix Q, our strategy is to construct the confusion matrix based on the outcomes of centralized training. More specifically, we ascertain the elements of Q (i.e., Qi j ) by determining the proportion of samples labeled as j by the centrally trained machine learning model relative to the total number of samples with the correct label i. This method of constructing the confusion matrix ensures that it accurately reflects the labeling patterns seen during centralized training. We utilize the confusion matrix Q as a guide to produce error labels. To introduce different levels of erroneous label ratio ε, we randomly select the necessary number of data samples and apply the label alterations based on the probabilities given in the confusion matrix Q. By integrating such realistic label errors in our FL simulations, we aim to offer a more robust 12 evaluation of FL algorithms under realistic and challenging conditions. Training with Quantization. IoT devices frequently have significant resource limitations, making model quantization a necessity. This approach reduces the numerical precision of computations and data in AI models, enhancing memory usage and computational effectiveness. In FedAIoT, we implement two precision levels, full (float32) and half (float16). Although most research has primarily focused on applying quantization during the inference stage, it is equally important to understand the impact of training models under quantized conditions in the context of FL. Hence, our models were trained using both precision types. The goal was to explore the trade-off between computational efficiency and model accuracy, which is essential for overcoming the resource limitations of IoT devices and enabling FL for AIoT. 2.3 Experimental Setup 2.3.1 Experimental Hyperparameters Hyperparameters for Table 2.3. For WISDM-W, the learning rate for centralized training was 0.01 and we trained for 200 epochs with batch size 64. For FedAvg, in both low and high data heterogeneity scenarios, we used a client learning rate of 0.01 and trained for 400 communication rounds with batch size 32. For FedOPT, in both low and high data heterogeneity scenarios, we used a client learning rate of 0.01 and a server learning rate of 0.01. We also trained for 400 communication rounds. For WISDM-P, the learning rate for centralized training was 0.01 and we trained for 200 epochs with batch size 128. For FedAvg, in both low and high data heterogeneity scenarios, we used a client learning rate of 0.008 and trained for 400 communication rounds with batch size 32. For FedOPT, in both low and high data heterogeneity scenarios, we used a client learning rate of 0.01 and a server learning rate of 0.01. We also trained for 400 communication rounds. For UT-HAR and Widar, the learning rate for centralized training was 0.001 and the number of epochs was 500 and 200 for UT-HAR and Widar respectively with a batch size of 32. For both low and high data heterogeneity in both FedAvg and FedOPT, the client learning rate was 0.01 and the server learning rate for FedAvg and FedOPT was 1 and 0.01 respectively. The number of communication rounds was 1200 and 900 for UT-HAR and Widar respectively with a batch size of 13 32. For VisDrone, we used a cosine learning rate scheduler with T0 = 10, Tmult = 2 and trained for 200 epochs with a learning rate of 0.1 and batch size 12. For all the experiments on VisDrone, the client learning rate was also 0.1 and the batch size was 12. For FedOPT, the server learning rate was 0.1. For CASAS, the centralized learning rate was 0.1 with batch size 128. For the federated setting, the client learning rate was 0.005, and the batch size was 32. We trained for 400 rounds. For FedOPT, the server learning rate was 0.01. For AEP, the learning rate for centralized training was 0.001 and the batch size was 32 and it was trained for 1200 epochs. For federated experiments, the client learning rate was 0.01, and the batch size was 32. For FedOPT, the server learning rate was 0.1. For EPIC-SOUNDS, for centralized training, the learning rate was 0.1 with batch size 512. The number of epochs was 120. For federated settings, we used a client learning rate of 0.1 and batch size 32. For FedOPT, the server learning rate was 0.01. Hyperparameters for Table 2.4. The setup for all the datasets with 10% client sampling rate is the same as that of Table 2.3 under high data heterogeneity. For the 30% client sampling rate, the hyperparameters were kept the same as that of the 10% client sampling rate experiments, with the exception of CASAS, where the learning rate was set to 0.15. Hyperparameters for Table 2.5. The hyperparameters were the same as that of Table 2.3 with 10% sampling rate under high data heterogeneity scenario. Hyperparameters for Table 2.6. The hyperparameters were same as that of Table 2.3 with 10% client sampling rate under high data heterogeneity scenario. 2.3.2 Base Data Partition Schemes We implemented three base data partitioning schemes for simulating data partitioning in a federated setting. Uniform Partition Uniform distribution basically samples from the main training dataset and assigns data to clients in a uniform nature. This partitioning can be used as a baseline best-case scenario or to debug the functionality of federated algorithms in the benchmark. Dirichlet Partition This partition as explained in Section 2.2 is designed to partition a dataset into subsets for simulating a federated learning environment with multiple clients. It is the basis 14 for all the partitioning techniques used in the analyses. It uses the Dirichlet distribution to allocate samples of different classes across clients, attempting to maintain a Dirichlet distribution while ensuring that each client receives at least a minimum number of samples. Disjoint Label Partition The Disjoint Label Partition scheme partitions a dataset into multiple subsets such that each subset is allocated to a different client for the purpose of simulating federated learning. In this setup, each client is assigned a limited number of unique classes, Cl from the dataset. Characterized by the maximum number of unique classes that can be assigned to each user, it systematically organizes the dataset entries according to their labels and distributes these among the clients, ensuring that each client gets disjoint sets of labels. Each client receives indices from Cl unique classes, with shards of data being divided among the clients. If there is any leftover data, it is evenly distributed among the shards. Manual Partition We also provide ways to induce custom partitions using data-client mapping. This can be used to test out unique partitioning schemes or use natural partitioning if available in the datasets. 2.3.3 Model Architectures 2.3.3.1 WISDM For WISDM, we use a custom LSTM model that consists of an LSTM layer followed by a feed-forward neural network. The LSTM layer has an input dimension of 6 and a hidden dimension of 6. After the LSTM layer, the output is flattened and passed through a dropout layer with a rate of 0.2 for regularization. It then goes through a fully connected linear layer with an input size of 1, 200 (6 hidden units * 200 timesteps) and an output size of 128, followed by a ReLU activation function. Another dropout layer with a rate of 0.2 is applied before the final fully connected linear layer with an input size of 128 and an output size of 12. 2.3.3.2 UT-HAR For UT-HAR, we use a ResNet-18 model with custom architecture designed for the Wi-Fi based Human Activity Recognition (HAR) task. The model consists of an initial convolutional layer that 15 reshapes the input into a 3-channel tensor followed by the main ResNet architecture with 18 layers. This main architecture includes a series of convolutional blocks with residual connections, Group Normalization layers, ReLU activations, and max-pooling. Finally, there is an adaptive average pooling layer followed by a fully connected layer that outputs the class probabilities. The model utilizes 64 output channels in the initial layer and doubles the number of channels as it goes deeper. The last fully connected layer has 7 output units corresponding to the number of classes for the UT-HAR task. 2.3.3.3 Widar For Widar, we also use a custom ResNet-18 model tailored for the Widar dataset. The model starts by reshaping the 22-channel input to 3 channels using two convolutional transpose layers, followed by a convolutional layer with 64 filters, Group Normalization, ReLU activation, and max-pooling. The core of the model consists of four layers of residual blocks (similar to the standard ResNet18) with 64, 128, 256, and 512 filters. Each basic block within these layers contains two convolutional layers, Group Normalization, and ReLU activations. Finally, an adaptive average pooling layer reduces spatial dimensions to 1 × 1, followed by a fully connected layer to output class scores. 2.3.3.4 VisDrone For VisDrone, we use the default YOLOv8n model from Ultralytics library. YOLOv8n is the smallest YOLOv8 model variant with the three scale parameters: depth, width, and the maximum number of channels set to 0.33, 0.25, and 1024 respectively. 2.3.3.5 CASAS For CASAS, we use a BiLSTM neural network which is composed of an embedding layer, a bidirectional LSTM, and a fully connected layer. The embedding layer takes input sequences with dimensions equal to the input dimension and converts them to dense vectors of size 64. The bidirectional LSTM layer has an input size equal to 64, the same number of hidden units, and processes the embedded sequences in both forward and backward directions. The output of the 16 LSTM layer is connected to a fully connected layer with an input size of 128 (to account for the bidirectional LSTM concatenation) and outputs the logits for 12 activities in the CASAS dataset. 2.3.3.6 AEP For AEP, we use a custom multi-layer perceptron (MLP) neural network with an architecture comprising five hidden layers and an output layer. The input layer accepts 18 features and passes them through a linear transformation to the first hidden layer with 210 units. Each of the following hidden layers progressively scales the number of units by factors of 2 and 4 and then scales down. Specifically, the sizes of the hidden layers are 210, 420, 840, 420, and 210 units respectively. Each hidden layer uses a ReLU activation function followed by a dropout layer with a dropout rate of 0.3 for regularization. The output layer has a single unit, and the output of the network is obtained by passing the activations of the last hidden layer through a final linear transformation. 2.3.3.7 EPIC-SOUNDS For EPIC-SOUNDS, we again use a custom ResNet-18 model which consists of a stack of convolutional layers followed by batch normalization and ReLU activation. The architecture begins with a 7 × 7 convolutional layer with stride 2, followed by a max pooling layer. Then, it contains four blocks, each comprising a sequence of basic blocks with a residual connection; specifically, each block contains two basic blocks, with output channel sizes of 64, 128, 256, and 512 respectively. Each basic block comprises two sets of 3x3 convolutional layers, each followed by batch normalization and ReLU activation. The first convolutional layer in the basic block has a stride of 2 in the second, third, and fourth blocks. Finally, the model has an adaptive average pooling layer, which reduces the spatial dimensions to 1x1, followed by a fully connected layer with an output size of 44 classes. 2.4 Experiments and Analysis We implemented FedAIoT using PyTorch [43] and Ray [44], and conducted our experiments on a combination 8×NVIDIA A6000 GPU cluster, 8×NVIDIA RTX8000 GPU, 8×NVIDIA A6000 GPU, 8×NVIDIA RTX3090 GPU and 10×NVIDIA A100 GPU clusters as needed. For each 17 Table 2.3 Overall performance. Low Data Heterogeneity (α = 0.5) High Data Heterogeneity (α = 0.1) Dataset Metric Centralized FedAvg FedOPT FedAvg FedOPT WISDM-W Accuracy (%) 74.05 ± 2.47 70.03 ± 0.13 71.50 ± 1.52 68.51 ± 2.21 65.76 ± 2.42 WISDM-P Accuracy (%) 36.88 ± 1.08 36.21 ± 0.19 34.32 ± 0.84 34.28 ± 3.28 32.99 ± 0.55 UT-HAR Accuracy (%) 95.24 ± 0.75 94.03 ± 0.63 94.10 ± 0.84 74.24 ± 3.87 87.78 ± 5.48 Widar Accuracy (%) 61.24 ± 0.56 59.21 ± 1.79 56.26 ± 3.11 54.76 ± 0.42 47.99 ± 3.99 VisDrone MAP-50 (%) 34.26 ± 1.56 32.70 ± 1.19 32.21 ± 0.28 31.23 ± 0.70 31.51 ± 2.18 CASAS Accuracy (%) 83.70 ± 2.21 75.93 ± 2.82 76.40 ± 2.20 74.72 ± 1.32 75.36 ± 2.40 AEP R2 0.586 ± 0.006 0.502 ± 0.024 0.503 ± 0.011 0.407 ± 0.003 0.475 ± 0.016 EPIC-SOUNDS Accuracy (%) 45.67 ± 0.12 45.51 ± 1.07 42.39 ± 2.01 33.02 ± 5.62 37.21 ± 2.68 experiment, we run three times based on three random seeds, and report both mean and standard deviation values. 2.4.1 Overall Performance First, we benchmark the FL performance under two FL optimizers, FedAvg and FedOPT, under low (α = 0.5) and high (α = 0.1) data heterogeneity levels, and compare it against centralized training. Benchmark Results: Table 2.3 summarizes our results. We make three observations. (1) Data heterogeneity level and FL optimizer have different impacts on different datasets. In particular, the performance of UT-HAR and Widar are very sensitive to the data heterogeneity level. In contrast, WISDM-P does not show a noticeable accuracy difference under FedAvg at different data heterogeneity levels. (2) Under low data heterogeneity, FedAvg provides a more stable performance compared to FedOPT and consistently achieves performance closer to centralized training across diverse data modalities. (3) Compared to the other datasets, CASAS, AEP, and WISDM-W have higher accuracy margins between centralized training and low data heterogeneity. This indicates the need for more advanced FL algorithms for CASAS, AEP, and WISDM-W datasets. 2.4.2 Impact of Client Sampling Ratio IoT devices usually have significant communication restrictions and hence the client sampling ratio is a critical hyperparameter for FL systems operating AIoT devices. In this experiment, 18 Table 2.4 Impact of client sampling ratio. Low Client Sampling Ratio (10%) High Client Sampling Ratio (30%) Dataset Training Rounds 50% Rounds 80% Rounds 100% Rounds 50% Rounds 80% Rounds 100% Rounds WISDM-W 400 58.81 ± 1.43 63.82 ± 1.53 68.51 ± 2.21 65.57 ± 2.10 67.23 ± 0.77 69.21 ± 1.13 WISDM-P 400 29.49 ± 3.65 31.65 ± 1.42 34.28 ± 3.28 33.73 ± 2.77 34.01 ± 2.27 36.01 ± 2.23 UT-HAR 2000 61.81 ± 7.01 70.76 ± 2.23 74.24 ± 3.87 86.46 ± 10.90 90.84 ± 4.42 92.51 ± 2.65 Widar 1500 47.55 ± 1.20 50.65 ± 0.24 54.76 ± 0.42 53.93 ± 2.90 55.74 ± 2.15 57.39 ± 3.14 VisDrone 600 27.07 ± 3.09 31.05 ± 1.55 31.23 ± 0.70 30.56 ± 2.71 33.52 ± 2.90 34.85 ± 0.83 CASAS 400 71.68 ± 1.96 74.19 ± 1.26 74.72 ± 1.32 73.89 ± 1.16 74.68 ± 1.50 76.12 ± 2.03 AEP 3000 0.325 ± 0.013 0.371 ± 0.017 0.407 ± 0.003 0.502 ± 0.006 0.523 ± 0.014 0.538 ± 0.005 EPIC-SOUNDS 300 20.99 ± 5.19 25.73 ± 1.99 28.89 ± 2.82 23.70 ± 6.25 31.74 ± 7.83 35.11 ± 1.99 we focus on two client sampling ratios: 10% and 30%. Our exploration involved recording the maximum accuracy reached after completing 50%, 80%, and 100% of the total training rounds for both these ratios under high data heterogeneity, thereby offering empirical evidence of how the model’s performance and convergence rate are affected by the client sampling ratio. Benchmark Results: Table 2.4 summarizes our results. We make two observations. (1) An increased client sampling ratio is highly correlated with superior model accuracy (i.e., highest accuracy within 100% training rounds) across different IoT data modalities. This demonstrates the importance of the client sampling ratio to the final model performance at the end of FL. (2) However, a higher sampling ratio does not inherently guarantee faster model convergence. For example, WISDM-P, Widar, and EPIC-SOUNDS achieve higher model performance with a lower client sampling ratio at 50% training rounds compared to a higher client sampling ratio. This result underscores the complex dynamics between client participation and learning efficiency for different IoT data modalities. 2.4.3 Impact of Erroneous Labels As elaborated in Section 2.2.3.5, we investigate the implications of erroneous labels. We assess the performance of our models under circumstances where the label error ratio is set at 10% and 30%, juxtaposing these results with the control scenario that involves no label errors. Note that we only showcase this for WISDM, UT-HAR, WIDAR, CASAS, and EPIC-SOUNDS as these are classification tasks, and the concept of erroneous labels only apply to classification tasks. 19 Table 2.5 Impact of erroneous labels. Erroneous Label Ratio WISDM-W WISDM-P UT-HAR Widar CASAS EPIC-SOUNDS 0% 68.51 ± 2.21 34.28 ± 3.28 74.24 ± 3.87 54.76 ± 0.42 74.72 ± 1.32 28.89 ± 2.82 10% 50.63 ± 4.19 28.85 ± 1.44 73.75 ± 5.67 34.03 ± 0.33 65.01 ± 2.98 21.43 ± 3.86 30% 47.90 ± 3.05 27.68 ± 0.39 70.55 ± 3.27 27.20 ± 0.56 63.16 ± 1.34 13.30 ± 0.42 Benchmark Results: Table 2.5 summarizes our results. We make two observations. (1) As the ratio of erroneous labels increases, the performance of the models decreases across all the datasets, and the impact of erroneous labels varies across different datasets. For example, WISDM-W only experiences a little performance drop at 10% label error ratio, but its performance significantly drops when the label error ratio increases to 30%. In contrast, CASAS exhibits a more gradual decline in performance as the error ratio increases from 0% to 10% and from 10% to 30%. (2) UT-HAR and EPIC-SOUNDS are very sensitive to label error and show significant accuracy drop even at 10% label error ratio. 2.4.4 Performance on Quantized Training Lastly, we examine the impact of model quantization on federated learning, specifically using half-precision (FP16). We assess the models’ accuracy and memory usage under this quantization, comparing these results to those from the full-precision (FP32) models. Memory is measured by analyzing the GPU memory usage of a model when trained with the same batch size under a centralized setting. Benchmark Results: Table 2.6 summarizes the model performance and memory usage at two precision levels. We make three observations: (1) As expected, the memory usage significantly decreases when using FP16 precision, ranging from 57.0% to 63.3% reduction across different datasets. (2) As shown in the previous work [45], the model performance associated with the precision levels varies depending on the dataset. For WISDM-W, CASAS, and EPIC-SOUNDS, the FP16 models maintain or even improve the performance compared to the FP32 models. (3) Widar, VisDrone, and AEP have a significant decline in performance when quantized to FP16 precision. 20 Table 2.6 Performance on quantized training. FP32 FP16 Dataset Metric Model Performance Memory Usage Model Performance Memory Usage WISDM-W Accuracy (%) 68.51 ± 2.21 1444 MB 60.31 ± 5.38 564 MB (↓ 60.9%) WISDM-P Accuracy (%) 34.28 ± 3.28 1444 MB 30.22 ± 2.05 564 MB (↓ 60.9%) UT-HAR Accuracy (%) 74.24 ± 3.87 1716 MB 72.86 ± 4.49 639 MB (↓ 62.8%) Widar Accuracy (%) 54.76 ± 0.42 1734 MB 34.03 ± 0.33 636 MB (↓ 63.3%) VisDrone MAP-50 (%) 31.23 ± 0.70 8369 MB 29.17 ± 4.70 3515 MB (↓ 60.0%) CASAS Accuracy (%) 74.72 ± 1.32 1834 MB 72.86 ± 4.49 732 MB (↓ 60.1%) AEP R2 0.407 ± 0.003 1201 MB 0.469 ± 0.044 500 MB (↓ 58.4%) EPIC-SOUNDS Accuracy (%) 33.02 ± 5.62 2176 MB 35.43 ± 6.61 936 MB (↓ 57.0%) 2.4.5 Insights from Benchmark Results Need for Resilience on High Data Heterogeneity: As presented in Table 2.3, datasets can exhibit a notable response to changes in data heterogeneity. We observe that CASAS, AEP, and EPIC-SOUNDS show a significant impact even at a low data heterogeneity. UT-HAR and Widar see a drastic decline in high data heterogeneity. These findings emphasize the need for developing advanced FL algorithms for data modalities that are sensitive to high data heterogeneity. Need for Balancing between Client Sampling Ratio and Resource Consumption of IoT Devices: Table 2.4 reveals that a higher sampling ratio can lead to improved performance in the long run. However, higher client sampling ratios generally entail increased communication bandwidth and energy consumption, which may not be desirable for IoT devices. Therefore, it is crucial to identify the sweet spot that strikes a balance between the client sampling ratio and resource consumption. Need for Resilience on Erroneous Labels: As demonstrated in Table 2.5, certain datasets exhibit high sensitivity to label errors, resulting in a significant drop in FL performance. Notably, both UT-HAR and EPIC-SOUNDS experience a drastic decrease in accuracy when faced with a 10% erroneous label ratio. Given the inevitability of label errors in real FL deployments, where private data remains unmonitored and uncalibrated except by the respective data owners, the development of label error resilient techniques becomes crucial for achieving reliable FL performance. 21 Table 2.7 Analysis of quantization demands. Application Dataset IoT Platform Representative Devices Hardware RAM Size Need Quantization WISDM-W Smartwatch Apple Watch 8 512 MB to 1 GB Yes Activity Recognition WISDM-P Smartphone iPhone 14 6 GB No UT-HAR Wi-Fi Router TP-Link AX1800 64 MB to 1 GB Yes Gesture Recognition Widar Wi-Fi Router TP-Link AX1800 64 MB to 1 GB Yes Independent Living CASAS Smart Home Raspberry Pi 4 1 GB to 8 GB No Energy Prediction AEP Smart Home Raspberry Pi 4 1 GB to 8 GB No Objective Detection VisDrone Drone Dji Mavic 3 + Raspberry Pi 4 1 GB to 8 GB Yes Augmented Reality EPIC-SOUNDS Head-mounted Device GoPro / AR Headset 1 GB to 8 GB No Need for Quantization: Table 2.7 highlights the importance of quantization in FL for all eight datasets. Notably, certain IoT devices, such as drones, lack sufficient RAM storage capacity for FL. Hence, external hardware interfaces like Raspberry Pi 4 has to be incorporated as assistive computing platforms. Analysis from Table 2.6 reveals that the performance of VisDrone drops significantly from 32FP precision to 16FP precision, and WISDM-W, UT-HAR, and VisDrone require computing memory size that exceeds the representative hardware RAM limits when using 32FP precision, underscoring the necessity of quantized training. 22 CHAPTER 3 FEDROLEX: MODEL-HETEROGENEOUS FEDERATED LEARNING WITH ROLLING SUB-MODEL EXTRACTION 3.1 Related Work Knowledge Distillation (KD) Techniques in Heterogeneous FL. Knowledge distillation (KD) is a significant strategy for implementing heterogeneous model Federated Learning (FL) across various devices [46]. Specifically, FedDF [47] implements KD by distilling knowledge from multiple classifiers, trained with private data from different client devices. The logits from each classifier are applied to an unlabeled public dataset to facilitate the server in training a student model via KD. DS-FL [48] took a similar approach but also proposed a semi-supervised FL method that employs pseudo-labeling of public data to enhance performance. Group knowledge transfer, as introduced by FedGKT [49], facilitates knowledge transmission to a substantial model on the server from client devices without utilizing public data. Furthermore, Fed-ET [50] formulated a weighted consensus distillation method with diversity regularization, enabling the server to train a larger model with the aid of smaller client models. Nevertheless, KD-based methods present certain challenges: They often necessitate public data to attain competitive accuracy, with performance depending on the size and domain similarity of public and client data [47, 50, 51]. Moreover, the use of client model weights in KD makes these methods misaligned with secure aggregation protocols, exposing them to potential backdoor attacks [52]. Heterogeneous FL via Partial Training (PT) Methods. To mitigate the limitations associated with KD-based techniques, partial training (PT) has been explored as a viable alternative for heterogeneous model FL. Current PT-based approaches can typically be sorted into two main types: random and static sub-model extraction. In particular, Federated Dropout [9] innovatively employs a random extraction technique inspired by the commonly-used dropout method in centralized training [53]. Although this integrates seamlessly into existing FL frameworks, Federated Dropout’s effectiveness diminishes with increasing data heterogeneity and a smaller client cohort, as observed in [54]. On the other hand, HeteroFL [10] and FjORD [11] proposed a static extraction method where 23 sub-models are always taken from a fixed portion of the global server model. This strategy, however, encounters two primary drawbacks. Firstly, it restricts the global server model to the same size as the largest client model, thereby limiting its potential due to client resource constraints. Secondly, this method mandates that different sub-models must only be trained on clients with corresponding resources, leading to different parts of the global model being trained on various data distributions, potentially harming the overall performance, especially in high data heterogeneity scenarios. In response to these challenges, our research introduces a rolling sub-model extraction mechanism that effectively addresses the issues found in both random and static sub-model extraction methods. 3.2 Methodology 3.2.1 Formulation of Model-Heterogeneous FL Let N denote N client devices with non-IID (non-identically and independently distributed) local data D = {D1 , D2 , ..., DN }. Model-homogeneous FL trains a global model of parameter θ by solving the following optimization problem: N min F (θ ) ≜ ∑ pnFn(θ ) (3.1) θ n=1 with 1 mn Fn (θ ) ≜ ∑ l(θ ; dn,k ), (3.2) mn k=1 where Dn ≜ {dn,1 , dn,2 , dn,3 ...dn,mn } is the set of local data samples of client n and pn is its corresponding weight such that pn ≥ 0 and ∑N n=1 pn = 1. In comparison, in model-heterogeneous FL, clients train local models with heterogeneous capacities β = {β1 , β2 , ..., βN }, and the local objective function of the nth client becomes 1 mn Fn′ (θn ) ≜ ∑ l(θn; dn,k ). (3.3) mn k=1 Here, βn denotes the model capacity of client n, and we define it as the proportion of nodes extracted from each layer in θ for client n. The size of θn depends on βn , and the parameter θn is obtained by selecting a sub-model from the global model θ , which can change from one round to another. If 24 Global Server Model Round Round Round Large-capacity Small-capacity Client Model Client Model Figure 3.1 Overview of the rolling sub-model extraction scheme in FedRolex. θn changes, the objective function also changes. For simplicity, we use the same notation l for the loss function for all clients and rounds, though they differ between clients and rounds. The key to model-heterogeneous FL is selecting θn from the global model θ given model capacity βn . 3.2.2 FedRolex: Model-Heterogeneous FL with Rolling Sub-Model Extraction In the context of partial training (PT), FedRolex operates by training a sub-model at each client, extracted from the global server model, and then transmitting the relevant sub-model updates back to the server for aggregation. Figure 3.1 provides an illustrative depiction of how FedRolex functions, showing three cycles of federated training across two heterogeneous clients. In this scenario, one client is responsible for training a larger capacity sub-model (on the left), while the other focuses on a smaller capacity one (on the right). At a broad level, during each round, the server takes sub-models of varying capacities from the overall global model and individually sends them to the clients that possess the necessary capabilities to handle them. Each client then proceeds to train the 25 received sub-model on its local data and subsequently sends the heterogeneous updates back to the server. The server, in turn, compiles these updates, using the aggregated result to refresh the global model in preparation for the next round. A detailed breakdown of the FedRolex procedure can be found in Algorithm 3.1. Central to the architecture of FedRolex are two critical design decisions. In the subsequent section, we delve into a comprehensive description of these choices. (1) What sub-models to be extracted for each client across different rounds? In the server, FedRolex employs a rolling window to methodically extract the sub-model from the global model. This rolling window progresses with each round, sequentially traversing all components of the global model across different rounds, looping in a manner that ensures the global model is uniformly trained until it reaches convergence. Consider Figure 3.1 as a reference: during round j, the global model’s large-capacity and small-capacity client models extracted are a, b, c, d and c, d, e respectively. When moving to round j + 1, the rolling window shifts by one step1 , transforming the large-capacity and small-capacity client models to b, c, d, e and d, e, a correspondingly. In a similar manner, when proceeding to round j + 2, the rolling window progresses yet another step, resulting in the large-capacity and small-capacity client models becoming c, d, e, a and e, a, b respectively. Such a rolling sub-model extraction scheme can be formalized as follows. ( j) Let θn denote the parameters of the sub-model extracted from the global model for client n in ( j) round j, Ki denote the total number of nodes in layer i of the global model, and Sn,i denote the node indices of layer i of the global model that belongs to the extracted sub-model for client n in round j. Then the layer i of the sub-model extracted by the rolling sub-model extraction scheme for client n in round j is given by:     { jˆ, jˆ + 1, . . . , jˆ + ⌊βn Ki ⌋ − 1} if jˆ + ⌊βn Ki ⌋ ≤ Ki ,   ( j) Sn,i = (3.4)    { jˆ, jˆ + 1, . . . , Ki − 1} ∪ {0, 1, . . . , jˆ + ⌊βn Ki ⌋ − 1 − Ki } else.   1This step size is a particular hyperparameter of FedRolex. Further insights on this can be found in Section 3.3.7 as part of our ablation study. 26 Global Server Model Global Server Model Round c Round c Small-capacity Large-capacity Small-capacity Large-capacity Global-model-capacity Client Model Client Model Client Model Client Model Client Model Random Sub-model Extraction Static Sub-model Extraction Figure 3.2 Illustration of how sub-models are extracted by random sub-model extraction scheme (Left) and static sub-model extraction scheme (Right) over two rounds. where jˆ = j mod Ki . (2) How to aggregate heterogeneous sub-model updates to update the global model? FedRolex employs a straightforward selective averaging scheme with no client weighting to aggregate heterogeneous sub-model updates sent from the clients to update the global model. Specifically, it computes the average of the updates for each parameter of the global model separately based on how many clients in a round updated that parameter. The parameter remains unchanged if no clients updated it. Taking Figure 3.1 again as an example: in round j, the updates for a and b are obtained from the large-capacity model and the update for e is from the small-capacity model only. In contrast, since c and d are part of both models, the update is computed by taking the average from both models. 3.2.3 Comparison with Random and Static Sub-model Extraction Schemes Existing sub-model extraction schemes can be grouped as random-based (Federated Dropout) and static-based (HeteroFL, FjORD) methods. In this section, we describe the differences between them and the proposed rolling-based scheme employed in FedRolex. For comparison, the pseudocodes of both Federated Dropout and HeteroFL are included in Section 3.3.11. 27 Algorithm 3.1 FedRolex Require: Dn βn ∀n ∈ N Ensure: θ J 1: Initialization: θ (0) , N 2: for j = 0 to J − 1 do 3: Sample subset M from N ( j) 4: Broadcast θ ( j) to client m ∈ M m,Sm,i ( j) ∀i, Sm,i from Equation (3.4) 5: for each client m ∈ M do ( j) 6: CLIENT S TEP (θm , Dm ) 7: end for ( j+1) 8: Aggregate θ[i,k] according to Equation (3.10) 9: end for ( j) 10: function CLIENT S TEP(θn , Dn ) 11: mn ← len(Dn ) 12: for k = 0 to mn do 13: θn ← θn − η∇l(θn ; dn,k ) 14: end forreturn θn 15: end function 3.2.3.1 Comparison with Random Sub-Model Extraction Scheme In a random sub-model extraction scheme, in each round, the sub-models are extracted from the global model in a random manner. As such, the layer i of the sub-model extracted by the random sub-model extraction scheme for client n in round j is given by: ( j) Sn,i = {kc | integer kc ∈ [0, Ki − 1] for 1 ≤ c ≤ ⌊βn Ki ⌋}, (3.5) where a total number of ⌊βn Ki ⌋ nodes are randomly chosen from the global model. Discussion: As shown in Figure 3.2(left), similar to the proposed rolling-based scheme, the sub-models extracted across different rounds by the random-based scheme have different architectures. However, due to its randomness in selecting sub-models in each round, the global model is trained less evenly, making it more vulnerable to client drift. In short, although the expected value of the frequency for updating each index is the same for all the indices, their exact frequencies are not the same due to randomness. Consequently, the random-based scheme cannot balance the update frequencies of different parts of the global model, and it inevitably takes more rounds to update the whole global model. Moreover, as we show in Section 3.2.4, the expected 28 number of rounds for Federated Dropout selecting all I sub-models at least m times is in the order of I log(I) + I(m − 1) log log I, which is larger than that of FedRolex, mI. 3.2.3.2 Comparison with Static Sub-Model Extraction Scheme In static sub-model extraction scheme, in each round, the sub-models are always extracted from a designated part of the global model. As such, the layer i of the sub-model extracted by the static sub-model extraction scheme for client n in round j is given by: ( j) Sn,i = {0, 1, 2, . . . , ⌊βn Ki ⌋ − 1}. (3.6) ( j) Note that Sn,i does not depend on j. In other words, as shown in Figure 3.2(right), the same sub-model is extracted for each client in every round. Moreover, the client model with smaller capacity and client model with larger capacity are not independent. As shown in Figure 3.2(right), the small-capacity model {a, b, c} is a part of the large-capacity model {a, b, c, d}, which again, is a part of the global-capacity model {a, b, c, d, e}. These are the two key differences between both the random-based and the proposed rolling-based scheme. Discussion: Given that, the static-based scheme, however, has two primary drawbacks. First, to cover the whole global model, there must be clients to train the full-size global model {a, b, c, d, e}. As such, the global model is restricted to the same size as the largest client model. Second, as shown in Figure 3.2(right), while a, b and c will be trained on data on all three types of clients, d will not be trained on data on small-capacity clients, and e will only be trained on data on global-model-capacity clients. As a consequence, different parts of the global model are trained on data with different distributions, which inevitably degrades the global model training quality. 3.2.4 Statistical Analysis Lemma 1. Given I indices, and one index is chosen at each round equally randomly. The expected number of rounds of choosing all indices at least once is   1 1 1 I + +···+ , I I −1 1 29 which is the same as ˆ ∞ 1 − (1 − e−t )I dt.  I 0 Proof. We denote the expected number of rounds to choose exactly i indices at least once as E(i). Then we have E(1) = 1, because, after the first round, one index is chosen. After the first round, the I expected number of rounds to choose a new index is I−1 , because one of the remaining I − 1 out of I the total I indices needs to be chosen. That is, E(2) = E(1) + I−1 . Similarly, we have I E(i) = E(i − 1) + , ∀i = 2, . . . , I. I +1−i Thus, we have   I I 1 1 1 E(I) = E(I − 1) + I = E(I − 2) + + = · · · = I + +···+ . 2 1 I I −1 1 The lemma is proved. It shows that the expected number of rounds to choose all indices at least once is I log(I) when I → ∞. This proof can not be generalized to the case for choosing all indices at least m times for m ≥ 2. Therefore, we provide alternative proof for it [55, Example 5.17]. Alternative proof of Lemma 1. This proof considers picking the indices as Poisson processes. Assume that the Poisson process to choose one index has a rate λ = 1. Since the index is chosen equally randomly, choosing the jth index also follows a Poisson process with a rate 1/I for any j [55, Proposition 5.2]. We let X j be the time to choose the first index j, and X = max X j (3.7) 1≤ j≤I is the time all indices are chosen at least once. Since all X j are independent with rate 1/I, we have P{X < t} =P{ max X j < t} = P{X j < t, for j = 1, . . . , I} 1≤ j≤I =(1 − e−t/I )I . 30 Therefore, we have ˆ ∞ ˆ ∞  E[X] = P{x > t}dt = 1 − (1 − e−t/I )I dt 0 0 We let N be the number of rounds to choose all indices at least one, and Ti be the ith interarrival time of the Poisson process for choosing one index. Then we have N X = ∑ Ti , i=1 and Ti are independent. Thus we have E[X|N] = NE[Ti ] = N, and which gives E[X] = E{E[X|N]} = E[N]. Thus we have ˆ ∞  ˆ ∞ −t/I I 1 − (1 − e−t )I dt.  E[N] = 1 − (1 − e ) dt = I 0 0 The lemma is proved. Next, we will present the lemma for choosing each index at least m times. Lemma 2. Given I indices, and one index is chosen at each round equally randomly. The expected number of rounds of choosing all indices at least m times is ˆ ∞ 1 − (1 − Sm (t)e−t )I dt,  I 0 where m−1 l y2 ym−1 y Sm (y) := 1 + y + + · · · + = ∑ . (3.8) 2! (m − 1)! l=0 l! Proof. We consider picking the indices as Poisson processes again. Assume that the Poisson process to choose one index has a rate λ = 1. Since the index is chosen equally randomly, choosing the jth 31 index also follows a Poisson process with a rate of 1/I for any j. We let X j be the time to choose index j for the mth time, and X = max X j (3.9) 1≤ j≤I is the time all indices are chosen at least m times. Since all X j are independent with rate 1/I, we have P{X < t} =P{ max X j < t} = P{X j < t, for j = 1, . . . , I} 1≤ j≤I =(1 − Sm (t/I)e−t/I )I . Therefore, we have ˆ ∞ E[X] = P{x > t}dt. 0 We let N be the number of rounds to choose all indices at least m times, and Ti be the ith interarrival time of the Poisson process for choosing one index. Then we have N X = ∑ Ti , i=1 and Ti are independent. Thus we have E[X|N] = NE[Ti ] = N, and which gives E[X] = E{E[X|N]} = E[N]. Thus we have ˆ ∞  ˆ ∞ −t/I I 1 − (1 − Sm (t)e−t )I dt.  E[N] = 1 − (1 − Sm (t/I)e ) dt = I 0 0 The lemma is proved. It shows that the expected number of rounds to choose all indices at least once is I log(I) + I(m − 1) log log I when I → ∞ [56]. 32 3.2.5 Formal Definition of Selective Aggregation Scheme Formally speaking, let M ⊂ N be the set of selected clients from the client pool from which the server pulls model parameters at round j. Let θ[i,k] be the kth parameter of layer i of the global model and θm,[i,k] be the kth parameter of layer i of client m. We denote Mk ⊂ M as the set of clients updating the kth parameter. The model parameters are aggregated as follows: 1 θ[i,k] = ∑ pm θm,[i,k] , (3.10) ∑m∈Mk pm m∈Mk The client weight pm is assigned based on factors like the client model capacity, the number of data points a client has, etc. Throughout the paper, unless otherwise stated, the weight of all clients is assumed to be the same, i.e, pm = 1/N. 3.3 Experiments Datasets and Models. We evaluate the performance of FedRolex under two regimes. Under small-model small-dataset regime, we train pre-activated ResNet18 (PreResNet18) models [57] on CIFAR-10 and CIFAR-100 [58]. We replace the batch Normalization in PreResNet18 with static batch normalization [10, 59] and add a scalar module after each convolution layer [10]. Under large-model large-dataset regime, we use Stack Overflow [60] and followed [3] to train a modified 3-layer Transformer [61] with a vocabulary of 10, 000 words, where the dimension of token embeddings is 128, and the hidden dimension of the feed-forward network (FFN) block is 2048. We use ReLU activation and use 8 heads for the multi-head attention where each head is based on 12-dimensional (query, key, value) vectors. The statistics of the datasets are listed in Table 3.1. Data Heterogeneity. We modeled non-IID distributions for CIFAR-10 and CIFAR-100 in line with HeteroFL [10], limiting each client to have L labels. In our assessment, two degrees of data Table 3.1 Dataset statistics. Dataset Train Clients Train Examples Validation Clients Validation Examples Test Clients Test Examples CIFAR-10 100 50,000 N/A N/A N/A 10,000 CIFAR-100 100 50,000 N/A N/A N/A 10,000 Stack Overflow 342,477 135,818,730 38,758 16,491,230 204,088 16,586,035 33 heterogeneity are considered. For CIFAR-10, high data heterogeneity is defined as L = 2 and low data heterogeneity as L = 5. Similarly, for CIFAR-100, high and low data heterogeneity correspond to L = 20 and L = 50 respectively. These levels roughly align with a Dirichlet distribution DirK (α), with α being 0.1 and 0.5. With the Stack Overflow dataset, non-IID distribution naturally occurs since data is partitioned by user IDs. Model Heterogeneity. In our evaluation, we contemplate five different client model capacities, denoted as β = 1, 1/2, 1/4, 1/8, 1/16. Here, for instance, 1/2 indicates that the client model capacity is half the size of the largest client model (full model). For ResNet18, we alter the number of kernels in convolution layers while maintaining the nodes in the output layers. In the case of the Transformer, the number of nodes in the hidden layer of the attention heads is varied. Baselines. FedRolex is compared with both state-of-the-art PT-based model-heterogeneous FL methods such as Federated Dropout [9] and HeteroFL [10]2 , and KD-based model-heterogeneous FL methods including FedDF [47], DS-FL [48], and Fed-ET [50]3 . To guarantee fairness, all PT-based baselines underwent training using identical learning rates, communication round numbers, and multi-step learning rate decay schedules. Specific details are provided in Section 3.3.10. Configurations and Platform. We used bounding box crop [62] to augment images for CIFAR-10 and CIFAR-100. During each communication round, a random 10% of clients are selected from a pool of 100 clients. For Stack Overflow, following [3], a 10% dropout rate is applied to prevent overfitting, and 200 clients are randomly chosen from a pool of 342, 477 clients in each round. Details on hyper-parameters for model training are available in Section 3.3.10. Our experiments, conducted on eight NVIDIA A6000 GPUs, were implemented using PyTorch [63] and Ray [64] for FedRolex and PT-based baselines. Evaluation Metrics. Global and local model accuracy serve as our assessment metrics. The global model accuracy refers to the server model’s performance on the test set, while the local model accuracy measures the server model’s performance on each client’s individual datasets. For 2 Comparison with FjORD was omitted as its code is not open-source and results could not be replicated following the paper. 3 FedGKT [49] was excluded as it is solely compatible with CNN models. 34 CIFAR-10 and CIFAR-100, classification accuracy is reported. In the case of Stack Overflow, the next word prediction accuracy is provided, encompassing both out-of-vocabulary (OOV) and end-of-sentence (EOS) tokens. Experiments are carried out with five different seeds for CIFAR-10and CIFAR-100 and three seeds for Stack Overflow. 3.3.1 Performance Comparison with State-of-the-Art Model-Heterogeneous FL Methods First, we compare the performance of FedRolex with state-of-the-art PT and KD-based model- heterogeneous FL methods. For a fair comparison, we followed the experimental settings used in prior arts where the distributions of client model capacities are uniform and the global server model is the same as the largest client model. Evaluation Results: Table 3.2 summarizes our results. We have two observations. (1) In comparison with state-of-the-art PT-based methods, under the small-model small-dataset regime, FedRolex consistently outperforms HeteroFL and Federated Dropout under both low and more challenging high data heterogeneity scenarios. In particular, under high data heterogeneity, Federated Dropout which extracts sub-model randomly has worse performance than FedRolex and HeteroFL which both extract sub-models in a deterministic manner. Under a large-model large-dataset regime, FedRolex also outperforms both HeteroFL and Federated Dropout. These results together demonstrate the superiority of FedRolex under both regimes. (2) In comparison with state-of-the-art KD-based methods, FedRolex only performs worse than Fed-ET and FedDF on CIFAR-10 under high data heterogeneity but outperforms all the KD-based methods on the more challenging CIFAR-100 which has a larger number classes than CIFAR-10 under both low and high data heterogeneity scenarios. It is important to note that KD-based methods leverage public data to boost their model accuracy while FedRolex does not. 3.3.2 Performance Comparison with Model-Homogeneous FL Methods We also compare the global model accuracy of FedRolex with two model-homogeneous cases where all the clients have the largest capacity model (β = {1}) and the smallest capacity model (β = {1/16}), representing the upper and lower-bound performance, respectively. Evaluation Results: As listed in Table 3.2, compared with other PT-based methods, FedRolex 35 Table 3.2 Global model accuracy comparison between FedRolex, PT and KD-based model-heterogeneous FL methods, and model-homogeneous FL methods. Note that the results of KD-based methods were obtained from [50]. For Stack Overflow, since KD-based methods cannot be directly used for language modeling tasks, their results are marked as N/A. High Data Heterogeneity Low Data Heterogeneity Method Stack Overflow CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100 FedDF 73.81 (± 0.42) 31.87 (± 0.46) 76.55 (± 0.32) 37.87 (± 0.31) N/A KD-based DS-FL 65.27 (± 0.53) 29.12 (± 0.51) 68.44 (± 0.47) 33.56 (± 0.55) N/A Fed-ET 78.66 (± 0.31) 35.78 (± 0.45) 81.13 (± 0.28) 41.58 (± 0.36) N/A HeteroFL 63.90 (± 2.74) 52.38 (± 0.80) 73.19 (± 1.71) 57.44 (± 0.42) 27.21 (± 0.22) PT-based Federated Dropout 46.64 (± 3.05) 45.07 (± 0.07) 76.20 (± 2.53) 46.40 (± 0.21) 23.46 (± 0.12) FedRolex 69.44 (± 1.50) 56.57 (± 0.15) 84.45 (± 0.36) 58.73 (± 0.33) 29.22 (± 0.24) Homogeneous (smallest) 38.82 (± 0.88) 12.69 (± 0.50) 46.86 (± 0.54) 19.70 (± 0.34) 27.32 (± 0.12) Homogeneous (largest) 75.74 (± 0.42) 60.89 (± 0.60) 84.48 (± 0.58) 62.51 (± 0.20) 29.79 (± 0.32) reduces the gap in global model accuracy between model-heterogeneous and upper- bound model-homogeneous settings. In particular, FedRolex is on par with the upper-bound model-homogeneous case for Stack Overflow, whereas both HeteroFL and Federated Dropout perform even worse than the model homogeneous case using the smallest model. This result indicates that with FedRolex, we will not be constrained to only using high-end devices to achieve competitive global model accuracy. Note that Fed-ET achieves a higher global model accuracy than the model-homogeneous upper bound on CIFAR-10 under high data heterogeneity, which showcases the advantage of using public data. 3.3.3 Impact of Client Model Heterogeneity Distribution In our previous experiments, the distributions of model capacities across client devices are set to be uniform. In this experiment, we aim to understand the impact of the client model heterogeneity distribution. To do so, without loss of generality, we use two client model capacities β = {1, 1/16} and vary the distribution ratio between the two (denoted as ρ) where ρ = 1 represents the case in which all the clients have the largest capacity model (β = {1}) and ρ = 0 represents the case in which all the clients have the smallest capacity model (β = {1/16}). Evaluation Results: Figure 3.3 shows how global model accuracy changes when ρ varies from 0 36 (i) (ii) (iii) Figure 3.3 Impact of client model heterogeneity distribution on global model accuracy for (i) CIFAR-10, (ii) CIFAR-100, and (iii) Stack Overflow. to 1 for CIFAR-10, CIFAR-100 and Stack Overflow. We have three observations. (1) For CIFAR-10 (Figure 3.3(i)), there is a large gap in global model accuracy between high and low data heterogeneity for a wide range of ρ (from 0.1 to 1). This is because CIFAR-10 is a relatively simple task and hence the global model accuracy is bottlenecked by the level of data heterogeneity instead of model capacity. This result indicates that having more high-capacity models in the cohort has only limited contribution to global model accuracy. (2) For the more challenging CIFAR-100 (Figure 3.3(ii)), the gap in global model accuracy is much lower between high and low data heterogeneity. In contrast to CIFAR-10, the global model accuracy is bottlenecked by the highest capacity of the models rather than the level of data heterogeneity. (3) For both regimes (Figure 3.3(i)(ii) vs. Figure 3.3(iii)), we observe that having a small fraction of large-capacity models significantly boosts the global model accuracy, but keeping increasing the ratio of large-capacity models has limited contribution to the accuracy. 3.3.4 Performance on Training Larger Server Model Similar to Federated Dropout, one advantage of FedRolex over static sub-model extraction methods (HeteroFL and FjORD) is that FedRolex is able to train a global model that is larger than the largest client model. In this experiment, we aim to evaluate the performance of FedRolex on training larger server models. So, we consider the case where the size of the global server model is γ = {2, 4, 8, 16} times the size of client models. For simplicity, all client models have the same size. 37 Low Heterogeneity FedRolex Low Heterogeneity FedRolex FedRolex High Heterogeneity Federated Dropout High Heterogeneity Federated Dropout Federated Dropout (i) (ii) (iii) Figure 3.4 Performance on training larger server model when the server model is γ times the size of the client model for (i) CIFAR-10, (ii) CIFAR-100, and (iii) Stack Overflow. Evaluation Results: Figure 3.4(i) and Figure 3.4(ii) compare FedRolex with Federated Dropout in terms of global model accuracy when γ for CIFAR-10 and CIFAR-100, respectively. As shown, although the global model accuracy drops for both FedRolex and Federated Dropout when γ increases, especially from 1 to 4, FedRolex consistently achieves higher global model accuracy than Federated Dropout across γ = {2, 4, 8, 16} under both low and high data heterogeneity. For Stack Overflow (Figure 3.4(iii)), the global model accuracy has a much smaller drop when γ increases. This demonstrates the superiority of using large models on large-scale datasets for training larger server models. 3.3.5 Enhance Inclusiveness of FL in Real-world Distribution A primary vision of FedRolex is to enhance the inclusiveness of FL. To demonstrate this, in this experiment, we use real-world household income distribution to emulate real-world device distribution. Specifically, we retrieve household income distribution information from [65]. We map βn = 1/16 with the income group with earning less than $75, 000 and assign proportions of remaining groups in $25, 000 increments with increasing values of βn . Detailed mapping of this distribution to the corresponding income distribution is provided in Figure 3.7 in Section 3.3.10. Evaluation Results: Table 3.3 shows both the global and local model accuracies of FedRolex for CIFAR-10 and CIFAR-100 as well as the global model accuracy on Stack Overflow under 38 Table 3.3 Performance of FedRolex under emulated real-world device distribution. High Data Heterogeneity Low Data Heterogeneity Dataset Method Local Accuracy Global Accuracy Local Accuracy Global Accuracy Homogeneous (smallest) 85.90 (± 0.46) 38.82 (± 0.88) 66.02 (± 0.52) 46.86 (± 0.54) CIFAR-10 Homogeneous (largest) 95.54 (± 0.26) 75.74 (± 0.41) 93.54 (± 0.44) 84.48 (± 0.58) FedRolex 94.05 (± 1.01) 63.17 (± 1.45) 91.03 (± 0.36) 80.14 ± 0.52) Homogeneous (smallest) 34.51 (± 0.56) 12.69 (± 0.50) 33.22 (± 0.10) 19.70 (± 0.34) CIFAR-100 Homogeneous (largest) 81.99 (± 0.78) 60.89 (± 0.60) 76.43 (± 0.54) 62.51 (± 0.20) FedRolex 73.33 (± 0.96) 45.78 (± 1.71) 66.31 (± 0.34) 48.44 (± 0.51) Homogeneous (smallest) 27.32 (± 0.12) Stack Overflow Homogeneous (largest) 29.79 (± 0.32) FedRolex 29.55 (± 0.41) the emulated real-world device distribution. Again, we compare with two model-homogeneous cases where all clients have the smallest and largest model capacities, representing lower and upper-bound accuracy, respectively. We make two observations. (1) Looking at the global model accuracy, FedRolex consistently outperforms the lower-bound model-homogeneous case across CIFAR-10, CIFAR-100, and Stack Overflow. This result indicates that FedRolex enhances the inclusiveness of FL and improves the accuracy of the global model, which would otherwise not be able to achieve. (2) Looking at the local model accuracy, FedRolex significantly outperforms the lower-bound model-homogeneous case on CIFAR-10 and CIFAR-100 under both low and high data heterogeneity. This result indicates that FedRolex effectively boosts the performance of low-end devices, which would otherwise not benefit from FL. A detailed illustration of how local model accuracy distribution of individual clients shifts when FedRolex is used compared to the smallest model-homogeneous case with the same client outreach is shown in Figure 3.5. 3.3.6 Impact of Different Weighing Schemes [50] reported that weighting clients is important to improving model accuracy. Therefore, we did an ablation study and evaluated three client weighting schemes: (1) Model size-based weighting scheme: Client weight is proportional to the number of kernels in the model; (2) Model update-based weighting scheme: Client weight is proportional to the number of updates; and (3) 39 CIFAR10 CIFAR10 CIFAR100 CIFAR100 Low Heterogeneity High Heterogeneity Low Heterogeneity High Heterogeneity 30 20 20 15 20 15 15 Counts 10 Counts Counts Counts 10 10 10 5 5 5 0 0 70 80 90 100 0 0 20 40 60 80 70 80 90 100 Accuracy 20 40 60 Accuracy Accuracy Accuracy (i) (ii) (iii) (iv) Figure 3.5 Local model accuracy distribution of FedRolex (orange color) vs. the smallest model-homogeneous case (blue color) for CIFAR-10 and CIFAR-100 under low and high data heterogeneity. Table 3.4 Impact of weighting schemes on model accuracy under high data heterogeneity. Weighting Scheme Local Model Accuracy Global Model Accuracy Non-Weighting 95.95 (±0.81) 69.44 (±1.50) Model Size-based 95.98 (±0.67) 69.09 (±1.42) CIFAR-10 Model Update-based 96.01 (±0.71) 68.83 (±0.89) Hybrid 96.05 (±0.96) 68.78 (±0.89) Non-Weighting 81.58 (±0.59) 56.57 (±0.15) Model Size-based 81.23 (±1.56) 56.99 (±0.27) CIFAR-100 Model Update-based 81.23 (±1.07) 56.63 (±0.36) Hybrid 81.49 (±1.07) 56.71 (±0.20) Hybrid weighting scheme: Client weight is proportional to both (1) model size and (2) model update. Table 3.4 lists the results. As shown, the performance of the three weighting schemes is not significantly better than the non-weighting scheme. Therefore, we used the non-weighting scheme in FedRolex. 3.3.7 Impact of Overlapping Kernels We also studied the impact of overlapping kernels between rounds using ResNet-18 and CIFAR-10/CIFAR-100 as an example. Specifically, we extracted sub-models using a rolling window that advances and loops over all the kernels of each convolution layer in the global model in strides. Let the degree of overlap between each stride of the rolling window be r ∈ [0, 1]. In each iteration, each convolution layer in the global model is advanced by 1 + ⌊βn (1 − r) Ki ⌋ where ⌊ · ⌋ 40 is the floor function. In FedRolex, r = 1, i.e., the kernels are advanced by 1 from one iteration to the next iteration. Figure 3.6 shows the impact of different r on global model accuracy. As shown, the value of r does have some influence on the global model accuracy, but the impact is non-linear and inconsistent. 3.3.8 Impact of Client Participation Rate In our main paper, we followed prior arts [10, 53, 11, 66, 49] and used a 10% client participation rate. To examine the effect of client participation rate, we conducted experiments with both lower (5%) and higher (20%) client participation rates using CIFAR-10 as an example for FedRolex, HeteroFL and Federated Dropout. The results are summarized in Table 3.5. As shown, FedRolex consistently outperforms both Federated Dropout and HeteroFL across 5%, 10% and 20% client participation rates. 3.3.9 Communication and Computation Costs of FedRolex To calculate the communication cost, we use the average size of the models sent by all the participating clients per round as the metric. To calculate the computation overhead, we calculate the FLOPs and numbers of parameters in the models of all the participating clients per round and (i) (ii) Figure 3.6 Impact of inter-round kernel overlap on global model accuracy under low and high data heterogeneity for (i) CIFAR-10 and (ii) CIFAR-100. 41 Table 3.5 Performance of FedRolex, HeteroFL, and Federated Dropout under different client participation rates. Client Participation Rate 5% 10% 20% HeteroFL 48.43 (+/- 1.78) 63.90 (+/-2.74) 65.07 (+/- 2.17) CIFAR-10 Federated Dropout 42.06 (+/- 1.29) 46.64 (+/-3.05) 55.20 (+/- 4.64) FedRolex 57.90 (+/- 2.72) 69.44 (+/-1.50) 71.85 ( +/- 1.22) Table 3.6 Computation and communication costs of FedRolex compared to upper and lower bounds represented by homogeneous settings with largest and smallest models respectively. Homogeneous (largest) FedRolex Homogeneous (smallest) Average Number of Parameters per Client (Million) 11.1722 2.9781232 0.04451 Average FLOPs per Client (Million) 557.656 149.048384 2.41318 Average Model Size per Client (MB) 42.62 11.36 0.17 take the average as the metric. To put these metrics in context, we also calculate the upper and lower bounds of the communication cost and computation overhead (i.e., all the clients were using the same largest model and smallest model, respectively). Table 3.6 lists the results. As shown, compared to the upper bound, FedRolex significantly reduces the communication cost and computation overhead while being able to achieve comparable model accuracy. Compared to the lower bound, although FedRolex has higher communication cost and computation overhead, the model accuracy achieved is much higher than the lower bound. These results indicate that FedRolex is able to achieve comparable high model accuracy as the upper bound with much less communication cost and computation overhead. 3.3.10 Detailed Experimental Setup Details Experimental Setup Details for Table 3.2. The experimental setup for PT-based methods is listed in Table 3.7. The experimental setup for model-homogeneous baselines was slightly different from the PT-based methods and hence is listed separately in Table 3.8. Experimental Setup Details for Figure 3.3. The experimental setup details are tabulated in Tables 3.9 and 3.10. Experimental Setup Details for Figure 3.4. The experimental setup details are tabulated in 42 Table 3.11. 3.3.11 Algorithm Pseudocodes The pseudocodes for HeteroFL and Federated Dropout are given in Algorithms 3.2 and 3.3 respectively. Their differences from FedRolex are marked using blue color. Algorithm 3.2 HeteroFL Require: Dn βn ∀n ∈ N Ensure: θ J 1: Initialization: θ (0) , N 2: for j = 0 to J − 1 do 3: Sample subset M from N ( j) 4: Broadcast θm,[i ; 0,1, ... ⌊βn Ki ⌋−1] ∀i and m ∈ M 5: for each client m ∈ M do ( j) 6: CLIENT S TEP (θm , Dm ) 7: end for ( j+1) 8: Aggregate θ[i,k] according to Equation (3.10) 9: end for ( j) 10: function CLIENT S TEP(θn , Dn ) 11: mn ← len(Dn ) 12: for k = 0 to mn do 13: θn ← θn − η∇l(θn ; dn,k ) 14: end forreturn θn 15: end function 43 Algorithm 3.3 Federated Dropout Require: Dn βn ∀n ∈ N Ensure: θ J 1: Initialization: θ (0) , N 2: for j = 0 to J − 1 do 3: Sample subset M from N ( j) 4: Broadcast θm,[i ; k ,...,k ] ∀i and m ∈ M 1 ⌊βn Ki ⌋ 5: for each client m ∈ M do ( j) 6: CLIENT S TEP (θm , Dm ) 7: end for ( j+1) 8: Aggregate θ[i,k] according to Equation (3.10) 9: end for ( j) 10: function CLIENT S TEP(θn , Dn ) 11: mn ← len(Dn ) 12: for k = 0 to mn do 13: θn ← θn − η∇l(θn ; dn,k ) 14: end forreturn θn 15: end function 1/4× 1/8× 1/2× 11% 18% 10% 1× 6% 55% 1/16× Figure 3.7 Device heterogeneity Distribution 44 Table 3.7 Experimental setup details of PT-based methods in Table 3.2 on CIFAR-10, CIFAR-100 and Stack Overflow. CIFAR-10 CIFAR-100 Stack Overflow Local Epoch 1 1 1 Cohort SIze 10 10 200 Batch Size 10 24 24 Initial Learning Rate 2.00E-04 1.00E-04 2.00E-04 High Data Heterogeneity 800, 1500 1000, 1500 Decay Schedule 600, 800 Low Data Heterogeneity 800, 1250 1000, 1500 Decay Factor 0.1 0.1 0.1 High Data Heterogeneity 2500 3500 Communication Rounds 1200 Low Data Heterogeneity 2000 3500 Optimizer SGD SGD SGD Momentum 0.9 0.9 0.9 Weight Decay 5.00E-04 5.00E-04 5.00E-04 Table 3.8 Experimental setup details of model-homogeneous baselines in Table 3.2 on CIFAR-10 and CIFAR-100 and Stack Overflow. CIFAR-10 CIFAR-100 Stack Overflow Local Epoch 1 1 1 Cohort Size 10 10 200 Batch Size 10 24 24 Initial Learning Rate 2.00E-04 1.00E-04 2.00E-04 High Data Heterogeneity 500, 1000 1000, 1500 Decay Schedule 300 Low Data Heterogeneity 500, 1000 1000, 1500 Decay Factor 0.1 0.1 0.1 High Data Heterogeneity 1250 3500 Communication Rounds∼ 1000 Low Data Heterogeneity 1500 3500 Optimizer SGD SGD SGD Momentum 0.9 0.9 0.9 Weight Decay 5.00E-04 5.00E-04 5.00E-04 45 Table 3.9 Experimental setup for results shown in Figure 3.3. ρ between 0.0 and 0.5 in 0.1 increments. Dataset ρ 0.0 0.1 0.2 0.3 0.4 Decay High 500, 1000 500, 1000 500, 1000 700, 1200 700, 1200 Schedule Heterogeneity Communication CIFAR-10 1250 1250 1250 1500 1500 Rounds Decay Low 500, 1000 500, 1000 500, 1000 700, 1200 700, 1200 Schedule Heterogeneity Communication 1250 1250 1250 1500 1500 Rounds Decay High 1000, 1500 1000, 1500 1000, 1500 1000, 1500 1000, 1500 Schedule Heterogeneity Communication CIFAR-100 2000 2000 2000 2000 2000 Rounds Decay Low 1000, 1500 1000, 1500 1000, 1500 1000, 1500 1000, 1500 Schedule Heterogeneity Communication 2000 2000 2000 2000 2000 Rounds Decay High 800 800 800 800 800 Schedule Heterogeneity Stack Overflow Communication 1500 1500 1500 1500 1500 Rounds Decay Low 800 800 800 800 800 Schedule Heterogeneity Communication 1500 1500 1500 1500 1500 Rounds 46 Table 3.10 Experimental setup for results shown in Figure 3.3. ρ between 0.5 and 1.0 in 0.1 increments. ρ 0.6 0.7 0.8 0.9 1.0 Dataset Decay High 700, 1200 700, 1200 500, 1000 500, 1000 500, 1000 Schedule Heterogeneity CIFAR-10 Communication 1500 1500 1250 1250 1250 Rounds Decay Low 700, 1200 700, 1200 500, 1000 500, 1000 500, 1000 Schedule Heterogeneity Communication 1500 1500 1250 1250 1250 Rounds Decay High 1000, 1500 1000, 1500 1000, 1500 1000, 1500 1000, 1500 Schedule Heterogeneity CIFAR-100 Communication 2000 2000 2000 2000 2000 Rounds Decay Low 1000, 1500 1000, 1500 1000, 1500 1000, 1500 1000, 1500 Schedule Heterogeneity Communication 2000 2000 2000 2000 2000 Rounds Decay High 800 800 800 800 800 Schedule Heterogeneity Stack Overflow Communication 1500 1500 1500 1500 1500 Rounds Decay Low 800 800 800 800 800 Schedule Heterogeneity Communication 1500 1500 1500 1500 1500 Rounds 47 Table 3.11 Experimental setup for results shown in Figure 3.4 Dataset γ 2 4 8 16 Decay High 800, 1200 800, 1200 800, 1200 800, 1200 Schedule Heterogeneity CIFAR-10 Communication 1500 1500 1500 1500 Rounds Decay Low 800, 1200 800, 1200 800, 1200 800, 1200 Schedule Heterogeneity Communication 1500 1500 1500 1500 Rounds Decay High 800, 1200 800, 1200 800, 1200 800, 1200 Schedule Heterogeneity CIFAR-100 Communication 1500 1500 1500 1500 Rounds Decay Low 800, 1200 800, 1200 800, 1200 800, 1200 Schedule Heterogeneity Communication 1500 1500 1500 1500 Rounds Decay High 800 800 800 800 Schedule Heterogeneity Stack Overflow Communication 1500 1500 1500 1500 Rounds Decay Low 800 800 800 800 Schedule Heterogeneity Communication 1500 1500 1500 1500 Rounds Table 3.12 Income distribution Model Capacity Annual Household Income 1/16× < $75, 000 1/8× $75, 000 − $100, 000 1/4× $100, 000 − $150, 000 1/2× $150, 000 − $200, 000 1× > $200, 000 48 Table 3.13 Experimental setup for Table 3.3 for CIFAR-10, CIFAR-100 and Stack Overflow. CIFAR-10 CIFAR-100 Stack Overflow Local Epoch 1 1 1 Cohort SIze 10 10 200 Batch Size 10 24 24 Initial Learning Rate 2.00E-04 1.00E-04 2.00E-04 High Heterogeneity 800, 1500 1000, 1500 Decay Schedule 600, 800 Low Heterogeneity 800, 1250 1000, 1500 Decay Factor 0.1 0.1 0.1 High Heterogeneity 2500 3500 Communication Rounds 1200 Low Heterogeneity 2000 3500 Optimizer SGD SGD SGD Momentum 0.9 0.9 0.9 Weight Decay 5.00E-04 5.00E-04 5.00E-04 49 CHAPTER 4 LIMITATIONS AND FUTURE WORK While the benchmark presented has been instrumental in elucidating the role of different factors on model accuracy, the scope of AIoT (Artificial Intelligence of Things) extends further. A holistic understanding of AIoT demands an examination of its infrastructural aspects, including the computational prowess and energy utilization of IoT platforms, along with the efficiency and security of their communication protocols. These are equally vital dimensions in the AIoT landscape that contribute to the rich complexity of this field. As part of our ongoing commitment to advancing the field, we intend to continually expand the scope of the benchmark, incorporating additional datasets from a more diverse set of applications, integrating new algorithms, and conducting deeper analytical validations. Our aspiration is to build upon our existing work, fostering collaboration and innovation, and thereby contribute to the nuanced and evolving world of AIoT. Furthermore, our work has also provided a statistical analysis of FedRolex, a specific model designed to train a global server model using a federation of heterogeneous client models. However, the full convergence analysis of FedRolex is intricate and will be an area for future investigation. An additional challenge is determining what models to deploy onto each client after the global server model is trained, especially when that model is substantial. This task is separate from our current focus but is something we will pursue in our future work. 50 CHAPTER 5 CONCLUSION In this thesis, we have introduced two key contributions: FedAIoT and FedAIoT. First, we presented FedAIoT, a Federated Learning (FL) benchmark specifically tailored for AIoT (Artificial Intelligence of Things). This benchmark encompasses eight datasets, harvested from a diverse array of genuine IoT devices, and incorporates a unified end-to-end FL framework for AIoT that spans the full FL-for-AIoT pipeline. Through our benchmarking of these datasets, we have been able to shed light on the unique opportunities and challenges that arise in applying FL within the AIoT context. Second, we introduced FedAIoT, a partial training (PT)-based model-heterogeneous FL approach, designed to train a global server model that surpasses the size of the largest client model. By proposing a rolling sub-model extraction scheme, FedAIoT facilitates the equitable training of the parameters of the global server model, thereby reducing client drift resulting from model heterogeneity. We also furnished a theoretical statistical analysis to articulate its advantage over existing techniques like Federated Dropout. The experimental results have confirmed that FedAIoT consistently excels over other state-of-the-art PT-based methods across various models and datasets, at both minor and substantial scales. Additionally, we demonstrated its efficacy on an emulated real-world device distribution, underscoring how FedAIoT furthers the inclusiveness of FL. 51 BIBLIOGRAPHY [1] S. Nižetić, P. Šolić, D. López-de-Ipiña González-de Artaza, and L. Patrono, “Internet of things (iot): Opportunities, issues and challenges towards a smart and sustainable future,” Journal of Cleaner Production, vol. 274, p. 122877, 2020. [2] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021. [3] J. Wang, Z. Charles, Z. Xu, G. Joshi, H. B. McMahan, B. A. y Arcas, M. Al-Shedivat, G. Andrew, S. Avestimehr, K. Daly, D. Data, S. Diggavi, H. Eichner, A. Gadhikar, Z. Garrett, A. M. Girgis, F. Hanzely, A. Hard, C. He, S. Horvath, Z. Huo, A. Ingerman, M. Jaggi, T. Javidi, P. Kairouz, S. Kale, S. P. Karimireddy, J. Konecny, S. Koyejo, T. Li, L. Liu, M. Mohri, H. Qi, S. J. Reddi, P. Richtarik, K. Singhal, V. Smith, M. Soltanolkotabi, W. Song, A. T. Suresh, S. U. Stich, A. Talwalkar, H. Wang, B. Woodworth, S. Wu, F. X. Yu, H. Yuan, M. Zaheer, M. Zhang, T. Zhang, C. Zheng, C. Zhu, and W. Zhu, “A field guide to federated optimization,” 2021. [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282. [5] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine Learning and Systems, vol. 2, pp. 429–450, 2020. [6] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International Conference on Machine Learning. PMLR, 2020, pp. 5132–5143. [7] H.-Y. Chen and W.-L. Chao, “Fedbe: Making bayesian model ensemble applicable to federated learning,” arXiv preprint arXiv:2009.01974, 2020. [8] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021. [9] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar, “Expanding the reach of federated learning by reducing client resource requirements,” arXiv preprint arXiv:1812.07210, 2018. [10] E. Diao, J. Ding, and V. Tarokh, “Heterofl: Computation and communication efficient federated learning for heterogeneous clients,” arXiv preprint arXiv:2010.01264, 2020. [11] S. Horvath, S. Laskaridis, M. Almeida, I. Leontiadis, S. Venieris, and N. Lane, “Fjord: Fair and accurate federated learning under heterogeneous targets with ordered dropout,” Advances in Neural Information Processing Systems, vol. 34, 2021. 52 [12] Z. Charles, K. Bonawitz, S. Chiknavaryan, B. McMahan et al., “Federated select: A primitive for communication-and memory-efficient federated learning,” arXiv preprint arXiv:2208.09432, 2022. [13] C. He, A. D. Shah, Z. Tang, D. F. N. Sivashunmugam, K. Bhogaraju, M. Shimpi, L. Shen, X. Chu, M. Soltanolkotabi, and S. Avestimehr, “Fedcv: a federated learning framework for diverse computer vision tasks,” arXiv preprint arXiv:2111.11066, 2021. [14] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,” arXiv preprint arXiv:1812.01097, 2018. [15] C. Song, F. Granqvist, and K. Talwar, “Flair: Federated learning annotated image repository,” ArXiv, vol. abs/2207.08869, 2022. [16] F. Lai, Y. Dai, S. Singapuram, J. Liu, X. Zhu, H. Madhyastha, and M. Chowdhury, “Fedscale: Benchmarking model and system performance of federated learning at scale,” in International Conference on Machine Learning. PMLR, 2022, pp. 11 814–11 827. [17] D. Dimitriadis, M. H. Garcia, D. Diaz, A. Manoel, and R. Sim, “Flute: A scalable, extensible framework for high-performance federated learning simulations,” ArXiv, vol. abs/2203.13789, 2022. [18] B. Y. Lin, C. He, Z. Zeng, H. Wang, Y. Huang, M. Soltanolkotabi, X. Ren, and S. Avestimehr, “Fednlp: A research platform for federated learning in natural language processing,” arXiv preprint arXiv:2104.08815, 2021. [19] J. O. d. Terrail, S.-S. Ayed, E. Cyffers, F. Grimberg, C. He, R. Loeb, P. Mangold, T. Marchand, O. Marfoq, E. Mushtaq et al., “Flamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings,” arXiv preprint arXiv:2210.04620, 2022. [20] T. Zhang, T. Feng, S. Alam, S. Lee, M. Zhang, S. S. Narayanan, and S. Avestimehr, “Fedaudio: A federated learning benchmark for audio tasks,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. [21] C. He, K. Balasubramanian, E. Ceyani, C. Yang, H. Xie, L. Sun, L. He, L. Yang, P. S. Yu, Y. Rong et al., “Fedgraphnn: A federated learning system and benchmark for graph neural networks,” arXiv preprint arXiv:2104.07145, 2021. [22] G. M. Weiss, K. Yoneda, and T. Hayajneh, “Smartphone and smartwatch-based biometrics using activities of daily living,” IEEE Access, vol. 7, pp. 133 190–133 202, 2019. [23] S. Yousefi, H. Narui, S. Dayal, S. Ermon, and S. Valaee, “A survey on behavior recognition using WiFi channel state information,” IEEE Communications Magazine, vol. 55, no. 10, pp. 98–104, Oct. 2017. [Online]. Available: https://doi.org/10.1109/mcom.2017.1700082 [24] Z. Yang, “Widar3.0 dataset: Cross-domain gesture recognition with wi-fi,” 2020. [Online]. Available: https://ieee-dataport.org/open-access/ widar30-dataset-cross-domain-gesture-recognition-wi-fi 53 [25] Y. Zheng, Y. Zhang, K. Qian, G. Zhang, Y. Liu, C. Wu, and Z. Yang, “Zero-effort cross-domain gesture recognition with wi-fi,” in Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services. ACM, Jun. 2019. [Online]. Available: https://doi.org/10.1145/3307334.3326081 [26] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021. [27] M. Schmitter-Edgecombe and D. J. Cook, “Assessing the quality of activities in a smart environment,” Methods of Information in Medicine, vol. 48, no. 05, pp. 480–485, 2009. [Online]. Available: https://doi.org/10.3414/me0592 [28] L. M. Candanedo, V. Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,” Energy and Buildings, vol. 140, pp. 81–97, 2017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0378778816308970 [29] J. Huh, J. Chalk, E. Kazakos, D. Damen, and A. Zisserman, “EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound,” in IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP), 2023. [30] D. Damen, H. Doughty, G. M. Farinella, , A. Furnari, J. Ma, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” International Journal of Computer Vision (IJCV), vol. 130, p. 33–55, 2022. [Online]. Available: https://doi.org/10.1007/s11263-021-01531-2 [31] J. W. Lockhart, G. M. Weiss, J. C. Xue, S. T. Gallagher, A. B. Grosner, and T. T. Pulickal, “Design considerations for the wisdm smart phone-based sensor mining architecture,” in Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data, ser. SensorKDD ’11. New York, NY, USA: Association for Computing Machinery, 2011, p. 25–33. [Online]. Available: https://doi.org/10.1145/2003653.2003656 [32] T.-M. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,” arXiv preprint arXiv:1909.06335, 2019. [33] S. Liu and W. Deng, “Very deep convolutional neural network based image classification using small training sample size,” in 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 730–734. [34] J. Yang, X. Chen, H. Zou, C. X. Lu, D. Wang, S. Sun, and L. Xie, “Sensefi: A library and benchmark on deep-learning-empowered wifi human sensing,” Patterns, vol. 4, no. 3, p. 100703, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S2666389923000405 [35] D. Liciotti, M. Bernardini, L. Romeo, and E. Frontoni, “A sequential deep learning application for recognising human activities in smart homes,” Neurocomputing, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231219304862 54 [36] C. Reinbothe, “Wisdm—biometric-time-series-data-classification,” https://github.com/ Chrissi2802/WISDM---Biometric-time-series-data-classification, 2023. [37] J. Terven and D. Cordova-Esparza, “A comprehensive review of yolo: From yolov1 and beyond,” 2023. [38] S. Seyedzadeh, F. P. Rahimian, I. Glesk, and M. Roper, “Machine learning for estimation of building energy consumption and performance: a review,” Visualization in Engineering, vol. 6, no. 1, p. 5, 2018. [Online]. Available: https://doi.org/10.1186/s40327-018-0064-7 [39] Sholahudin, A. G. Alam, C. I. Baek, and H. Han, “Prediction and analysis of building energy efficiency using artificial neural network and design of experiments,” Applied mechanics and materials, vol. 819, pp. 541–545, 2016. [40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [41] K. Hsieh, A. Phanishayee, O. Mutlu, and P. B. Gibbons, “The non-iid data quagmire of decentralized machine learning,” in Proceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020. [42] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan, “Adaptive federated optimization,” arXiv preprint arXiv:2003.00295, 2020. [43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/ paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf [44] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging ai applications,” in Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’18. USA: USENIX Association, 2018, p. 561–577. [45] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” ArXiv, vol. abs/1710.03740, 2017. [46] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015. [47] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 2351–2363, 2020. 55 [48] S. Itahara, T. Nishio, Y. Koda, M. Morikura, and K. Yamamoto, “Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data,” arXiv preprint arXiv:2008.06180, 2020. [49] C. He, M. Annavaram, and S. Avestimehr, “Group knowledge transfer: Federated learning of large cnns at the edge,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 068–14 080, 2020. [50] Y. J. Cho, A. Manoel, G. Joshi, R. Sim, and D. Dimitriadis, “Heterogeneous ensemble knowledge transfer for training large models in federated learning,” International Joint Conference on Artificial Intelligence (IJCAI), 2022. [51] S. D. Stanton, P. Izmailov, P. Kirichenko, A. A. Alemi, and A. G. Wilson, “Does knowledge distillation really work?” in Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021. [Online]. Available: https://openreview.net/forum?id=7J-fKoXiReA [52] H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J.-y. Sohn, K. Lee, and D. Papailiopoulos, “Attack of the tails: Yes, you really can backdoor federated learning,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 16 070–16 084. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/ b8ffa41d4e492f0fad2f13e29e1762eb-Paper.pdf [53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014. [54] G. Cheng, Z. Charles, Z. Garrett, and K. Rush, “Does federated dropout actually work?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3387–3395. [55] S. M. Ross, Introduction to Probability Models, 11th ed. San Diego, CA, USA: Academic Press, 2014. [56] D. J. Newman, “The double dixie cup problem,” The American Mathematical Monthly, vol. 67, no. 1, pp. 58–61, 1960. [57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [58] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009. [59] M. Andreux, J. O. d. Terrail, C. Beguier, and E. W. Tramel, “Siloed federated learning for multi-centric histopathology datasets,” in Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning. Springer, 2020, pp. 129–139. 56 [60] TFF, “Tensorflow federated stack overflow dataset,” Online: https://www. tensorflow. org/federated/api_docs/python/tff/simulation/datasets/stackoverflow, 2019. [61] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [62] B. Zoph, E. D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, and Q. V. Le, “Learning data augmentation strategies for object detection,” in European conference on computer vision. Springer, 2020, pp. 566–583. [63] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips. cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf [64] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., “Ray: A distributed framework for emerging {AI} applications,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 561–577. [65] U. C. Bureau, “Percentage distribution of household income in the u.s. in 2020,” In Statista, September 2021, retrieved May 18, 2022, from https://www.statista.com/statistics/203183/ percentage-distribution-of-household-income-in-the-us. [66] M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary, “Federated learning with personalization layers,” arXiv preprint arXiv:1912.00818, 2019. 57