FEDERATED LEARNING BENCHMARKS AND FRAMEWORKS FOR ARTIFICIAL
                    INTELLIGENCE OF THINGS
                                    By
                                Samiul Alam
                                 A THESIS
                                Submitted to
                        Michigan State University
                in partial fulfillment of the requirements
                             for the degree of
                 Computer Science—Master of Science
                                   2023


                                               ABSTRACT
The growing integration of the Internet of Things (IoT) and Artificial Intelligence (AI), commonly
referred to as the Artificial Intelligence of Things (AIoT), has amplified the importance of Federated
Learning (FL). However, the application of FL in AIoT is challenged by the lack of authentic IoT
datasets and the constraints associated with model-homogeneous FL approaches.
    Addressing these gaps, this thesis introduces two significant contributions: FedAIoT and
FedRolex. FedAIoT is a comprehensive FL benchmark designed for AIoT, encompassing eight
diverse datasets collected from a wide range of IoT devices. It offers a unified end-to-end FL
framework, making it an invaluable tool for standardizing AIoT-based FL applications. The
framework is available at https://github.com/AIoT-MLSys-Lab/FedAIoT.
    On the other hand, FedRolex is a novel Partial Training (PT)-based model-heterogeneous FL
approach. With an emphasis on device heterogeneity typical in AIoT applications, FedRolex
enables the training of a global server model that is larger than any client model, by using a rolling
sub-model extraction scheme. This approach mitigates client drift, enhances the performance of
low-end devices, and reduces the gap between model-heterogeneous and model-homogeneous FL.
    Benchmark results indicate that FedRolex outperforms existing PT-based model-heterogeneous
FL methods, making it a crucial resource for researchers and practitioners in the field of FL for
AIoT. Our code is available at: https://github.com/AIoT-MLSys-Lab/FedRolex.


                                    ACKNOWLEDGEMENTS
I am deeply indebted to my advisor, Dr. Mi Zhang, whose guidance, patience, and expert counsel
were pivotal in the successful completion of this research. The opportunity to conduct research
under his esteemed supervision within his laboratory was an academically enriching experience and
a privilege that I have found both profoundly enjoyable and intellectually stimulating.
    My earnest appreciation extends to my distinguished thesis committee members - Dr. Zhicao
Cao, Dr. Luyang Liu, and Dr. Guan-Hua Tu. Their perceptive feedback, incisive criticisms,
and judicious advice significantly shaped this work, imbuing it with the depth and breadth that it
possesses.
    I would also like to formally acknowledge my colleagues, Tuo Zhang, and Tiantian Feng, for
their collaborative contributions. Our rigorous discussions, intense brainstorming sessions, and the
invaluable exchange of ideas have provided essential refinement to my research methodology and
thesis presentation.
    Lastly, I must express my profound gratitude to my family - my parents, my spouse, and my
cherished daughter. Their unwavering support, unyielding faith in my abilities, and unbounded
love have consistently served as my beacon during this rigorous academic journey. This work is a
testament to their unwavering support and I dedicate it to them, with profound respect and heartfelt
love.
                                                 iii


                               TABLE OF CONTENTS
CHAPTER 1       INTRODUCTION        . . . . .  . . . . . . . . . . . . . . . . . . . . . . . .  1
      1.1 Background . . . . . . .  . . . . .  . . . . . . . . . . . . . . . . . . . . . . . .  1
      1.2 Contributions . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . .  1
      1.3 Thesis Organization . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . .  2
CHAPTER 2       FEDAIOT: A FEDERATED LEARNING BENCHMARK FOR ARTIFICIAL
                INTELLIGENCE OF THINGS . . . . . . . . . . . . . . . . . . . . . . 3
      2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
      2.2 Design of FedAIoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
      2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
      2.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
CHAPTER 3       FEDROLEX: MODEL-HETEROGENEOUS FEDERATED LEARNING
                WITH ROLLING SUB-MODEL EXTRACTION . . . . . . . . . . . .                    . 23
      3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 23
      3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 24
      3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 33
CHAPTER 4       LIMITATIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . 50
CHAPTER 5       CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
                                             iv


                                              CHAPTER 1
                                           INTRODUCTION
1.1   Background
    The advent and proliferation of the Internet of Things (IoT) has dramatically changed the
way we interact with the world. A vast array of IoT devices, such as smartphones, smartwatches,
drones, and sensors deployed in homes, collect massive amounts of data daily [1]. These devices,
combined with advances in Artificial Intelligence (AI), have driven the integration of AI and IoT,
giving rise to the Artificial Intelligence of Things (AIoT). However, IoT-collected data often contain
privacy-sensitive information, making federated learning (FL) an increasingly crucial approach in
handling AIoT data [2, 3].
    Traditionally, FL studies have concentrated on the model-homogeneous setting, where the server
model and the client models across all participating client devices are identical [4, 5, 6, 7]. However,
given the diversity of client devices in terms of on-device resources and the state-of-the-art trend
towards larger machine learning models [8], constraints have emerged with this approach for AIoT
applications.
    Additionally, a majority of existing FL works are conducted on well-known datasets such as
CIFAR-10 and CIFAR-100. These datasets, however, do not originate from authentic IoT devices
and thus fail to capture the unique modalities and inherent challenges associated with real-world
IoT data. To assess the effectiveness of FL algorithms, a benchmark for IoT devices is crucial.
1.2   Contributions
    Our contributions of this thesis are twofold. Firstly, acknowledging the critical gap that remains
in the FL field: the datasets typically used do not originate from authentic IoT devices, leading
to a significant discrepancy as they fail to capture the unique modalities and inherent challenges
associated with real-world IoT data. To fill this critical gap, we introduce FedAIoT, an FL benchmark
for AIoT. FedAIoT comprises eight datasets collected from a diverse range of authentic IoT devices
and encapsulates a variety of unique modalities targeting representative AIoT applications.
                                                     1


     Secondly, to relax the constraints of heterogeneity and handle emerging challenges for large
models, we introduce FedRolex, a novel partial training(PT)-based model-heterogeneous FL
approach. By using a rolling sub-model extraction scheme, FedRolex ensures that all parameters
of the global server model are evenly trained over the local data of client devices [9, 10, 11]. This
innovative approach offers several merits, including mitigation of client drift, the reduction of
communication overheads, and compatibility with secure aggregation protocols that enhance the
privacy properties of FL systems [12].
1.3   Thesis Organization
     Our objective is to address both the challenges of device and model heterogeneity in federated
learning and the lack of authentic IoT datasets in current FL studies, fostering advancement in this
rapidly evolving field of FL for AIoT. The thesis focuses on these two aspects and distinctively as
two chapters. Chapter 2 discusses the contributions and results of FedAIoT and how it provides
a unified end-to-end FL framework for AIoT, from non-IID data partitioning, data preprocessing,
AIoT-friendly models and FL hyperparameters. Chapter 3 evolves around the FedRolex algorithm,
its performance and its relevance. Finally, we conclude the thesis after discussing the limitations
and future work.
                                                  2


                                           CHAPTER 2
         FEDAIOT: A FEDERATED LEARNING BENCHMARK FOR ARTIFICIAL
                                 INTELLIGENCE OF THINGS
2.1   Related Work
    The importance of data to FL research pushes the development of FL benchmarks on a variety of
data modalities. Existing FL benchmarks, however, predominantly center around curating federated
datasets in the domain of computer vision (CV) [13, 14, 15, 16, 17], natural language processing
(NLP) [18, 14, 16, 17], medical imaging [19], speech and audio [20, 17], and graph neural networks
[21]. For example, LEAF [14] is one of the earliest FL benchmarks which comprises six datasets
dedicated to CV and NLP; FedCV [13], FedNLP [18], and FedAudio [20] focuses on CV, NLP, and
audio-related datasets and tasks respectively; FedScale [16] provides an assortment of 20 federated
datasets mainly in CV and NLP applications, placing a distinct emphasis on system-related aspects;
FLUTE [17] covers a mix of datasets from CV, NLP, and audio; and FLamby [19] presents seven
healthcare-related datasets including five medical imaging datasets.
    Although these benchmarks have significantly contributed to FL research, a dedicated FL
benchmark explicitly tailored for IoT data is absent. FedAIoT is specifically designed to fill this
critical gap by providing a dedicated benchmark that focuses on data collected by a wide range of
authentic IoT devices.
2.2   Design of FedAIoT
2.2.1    Datasets
    Table 2.1 provides an overview of the eight datasets included in FedAIoT. In this section, we
provide a brief overview of each included dataset.
    WISDM: The widely used Wireless Sensor Data Mining (WISDM) dataset [22, 31] offers
accelerometer and gyroscope sensor data collected from smartphones and smartwatches for daily
activity recognition. Data was collected from 51 participants who performed 18 daily activities
in 3-minute sessions. Activities like eating soup, chips, pasta, and sandwiches were unified into
a single category - "eating", while activities such as kicking, catching, or dribbling balls were
                                                  3


                       Table 2.1 Overview of the datasets included in FedAIoT.
         Dataset        IoT Platform     Data Modality     Data Dimension Dataset Size # Training Samples # Clients
     WISDM-W [22]        Smartwatch      Accelerometer          200 × 6    294 MB            16,569           80
                                           Gyroscope
     WISDM-P [22]        Smartphone      Accelerometer          200 × 6    253 MB            13,714           80
                                           Gyroscope
      UT-HAR [23]       Wi-Fi Router     Wireless Signal     3 × 30 × 250  854 MB             3,977          20
      Widar [24, 25]    Wi-Fi Router     Wireless Signal     22 × 20 × 20   3.3 GB           11,372           40
      VisDrone [26]         Drone           Images          3 × 224 × 224   1.8 GB            6,471          30
      CASAS [27]         Smart Home      Motion Sensor         2000 × 1    233 MB            12,190          60
                                          Door Sensor
                                          Thermostat
        AEP [28]         Smart Home     Energy, Humidity         18 × 1     12 MB            15,788           80
                                          Temperature
 EPIC-SOUNDS [29, 30] Augmented Reality    Acoustics          400 × 128     34 GB            60,055          210
eliminated due to their rarity. For our training and test set partition, we selected 45 participants
for the training set and the remaining 6 for the test set. To accommodate real-life scenarios where
individuals may not always use a smartphone and wear a smartwatch simultaneously, we divided
WISDM into WISDM-W (smartwatch data) and WISDM-P (smartphone data). The sample count
for the training and test set is 16, 569 and 4, 103 for WISDP-W and 13, 714 and 4, 073 for WISDP-P
respectively. Licensing details are not explicitly mentioned on the dataset homepage.
     UT-HAR: UT-HAR dataset [23] provides Channel State Information (CSI) for contactless
activity recognition tasks. The CSI is collected via three pairs of antennas and an Intel 5300
Network Interface Card (NIC), each antenna pair capturing 30 subcarriers of CSI. The dataset
incorporates activities like walking and running performed by various participants. The UT-HAR
comes with a pre-set training and test set, totalling 3, 977 and 500 samples respectively.
     Widar: The Widar dataset [24, 25] is a Wi-Fi dataset designed for contactless gesture recognition,
and records Wi-Fi signal strength measurements collected via Wi-Fi access points. Data is collected
from 17 participants performing 22 distinct gestures. However, to maintain consistency, only those
gestures performed by more than three users are included in the experimental dataset. As a result,
our balanced dataset encompasses nine gestures with 11, 372 training and 5, 222 test samples. Widar
is licensed under the Creative Commons Attribution-NonCommercial 4.0 International Licence (CC
BY 4).
     VisDrone: The VisDrone dataset [26] is an extensive collection dedicated to object detection
                                                         4


in aerial images captured by drone cameras. It consists of 263 video clips, containing 179, 264
frames and 2, 512, 357 labeled objects. The objects fall into 12 categories and are recorded in
various scenarios like crowded urban areas, highways, and parks. The training and test sets
comprise 6, 471 and 1, 610 samples respectively and are licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 License.
    CASAS: Derived from the CASAS smart home project, the CASAS dataset [27] uses sensor
data for recognizing activities of daily living (ADL) to support independent living. Data from three
distinct apartments equipped with motion, temperature, and door sensors are collected. We selected
five specific datasets - Milan’, Cairo’, Kyoto2’, Kyoto3’, and ‘Kyoto4’ - for their uniform sensor
data representation. The activities were consolidated into 11 home activities categories. The dataset,
split into training and test sets with an 80-20 ratio, includes 12, 190 training and 3, 048 test samples.
No explicit license information is provided for this dataset.
    AEP: The Appliances Energy Prediction (AEP) dataset [28] collects data from energy, temperature,
and humidity sensors installed in a home for the task of predicting home energy usage. The data,
captured every 10 minutes over 4.5 months, includes 15, 788 training and 3, 947 test samples.
Licensing information is not explicitly mentioned.
    EPIC-SOUNDS: The EPIC-SOUNDS dataset [29] is a large-scale collection of audio recordings
for audio-based human activity recognition in Augmented Reality applications. It offers over 100k
categorized segments across 44 distinct classes, captured via a head-mounted microphone. The
dataset includes pre-determined training and test sets with 60, 055 and 40, 175 samples respectively,
and is licensed under CC BY 4.
2.2.1.1    WISDM
    The WISDM dataset comprises raw accelerometer and gyroscope data collected from 51 subjects
performing 18 activities for three minutes each. Data were gathered at a 20Hz sampling rate from
both a smartphone (Google Nexus 5/5x or Samsung Galaxy S5) and a smartwatch (LG G Watch).
Data for each device and sensor type are stored in different directories, resulting in four directories
overall. Each directory contains 51 files, each corresponding to a subject. The data entry format is:
                                                     5


      Non-IID Data                 Data        AIoT-friendly             FL
                                                                                        IoT Factors
       Partitioning         Preprocessing         Models       Hyperparameters
     Non-IID Partition over   Normalization   LSTM      ResNet   Data Heterogeneity
        Output Labels                                                   Level
                                                                                      Erroneous Labels
                            Data Augmentation
     Non-IID Partition over                   YOLO      BiLSTM      FL Optimizer
        Input Features
                               Spectrogram                                            Quantized Training
     Non-IID Partition over                             ResNet
                              Body Velocity   MLP          +    Client Sampling Ratio
      Output Distribution        Profiling               LSTM
    Figure 2.1 Architecture of the end-to-end FL framework for AIoT incorporated in FedAIoT.
<subject-id, activity-code, timestamp, x, y, z>. Separate files for the gyroscope and accelerometer
readings are provided and are later combined by matching timestamps. Subject ID is given from
1600 to 1650 and the activity code is an alphabetical character between ‘A’ and ‘S’ excluding
‘N’. The timestamp is in Unix time. The code to read and partition the data into 10s segments is
provided by our benchmark. The input shape of the processed data is 200 × 6. The original dataset
is available at https://rb.gy/xla1i.
2.2.2    Input-Output Formats and Sourcing
2.2.2.1     UT-HAR
    The UT-HAR dataset was collected using the Linux 802.11n Channel State Information (CSI)
Tool for the task of Human Activity Recognition (HAR). The original data consist of two file
types: “input" and “annotation". “input" files contain Wi-Fi CSI data. The first column indicates
the timestamp in Unix. Columns 2-91 represent amplitude data for 30 subcarriers across three
antennas, and columns 92-181 contain the corresponding phase information. “annotation" files
provide the corresponding activity labels, serving as the ground truth for HAR. In our benchmark,
only amplitude is used. The final samples are created by taking a sliding window of size 250 where
each sample consists of amplitude information across three antennas and from 30 subcarriers and
has shape 3 × 30 × 250. The original dataset is available at https://github.com/ermongroup/Wifi
_Activity_Recognition/tree/master.
                                                     6


2.2.2.2    Widar
    The Widar dataset (Widar3.0) was collected with a system comprising one transmitter and three
receivers, all equipped with Intel 5300 wireless NICs. The system uses the Linux CSI Tool to record
the Wi-Fi data. Devices operate in monitor mode on channel 165 at 5.825 GHz. The transmitter
broadcasts 1, 000 Wi-Fi packets per second while receivers capture data using their three linearly
arranged antennas. In our benchmark, we use the processed body velocity profile (BVP) features
extracted from the dataset. The size of each data sample after processing is 22 × 20 × 20 consisting
of 22 samples over time each having 20 BVP features each in both x and y directions. The raw
dataset is available for download at http://tns.thss.tsinghua.edu.cn/widar3.0/index.html.
2.2.2.3    VisDrone
    The VisDrone dataset was collected by the AISKYEYE team at Tianjin University, China. It
comprises 288 video clips with 261,908 frames and 10,209 static images captured by cameras
mounted on drones at 14 different cities in China in diverse environments, scenarios, weather, and
lighting conditions. The frames were manually annotated with over 2.6 million bounding boxes
of common targets like pedestrians, cars, and bicycles. Additional attributes like scene visibility,
object class, and occlusion are also provided for enhanced data utilization. The dataset is available
at https://github.com/VisDrone/VisDrone-Dataset.
2.2.2.4    CASAS
    The CASAS dataset is a collection of data generated in smart home environments, where
intelligent software uses sensors deployed at homes to monitor resident activities and conditions
within the space. The CASAS project considers environments as intelligent agents and employs
custom IoT hardware known as Smart Home in a Box (SHiB), which encompasses the necessary
sensors, devices, and software. The sensors in SHiB perceive the status of residents and their
surroundings, and through controllers, the system acts to enhance living conditions by optimizing
comfort, safety, and productivity. The CASAS dataset includes the date (in yyyy-mm-dd format),
time (in hh:mm:ss.ms format), sensor name, sensor readings, and an activity label in string format.
The data were collected in real-time as residents go about their daily activities. The code to extract
                                                   7


categorical sensor readings to create input sequences and labels is provided in our benchmark. The
CASAS dataset can be downloaded from https://casas.wsu.edu/datasets/.
2.2.2.5   AEP
    The AEP dataset, collected over 4.5 months, comprises readings taken every 10 minutes from
a ZigBee wireless sensor network monitoring house temperature and humidity. Each wireless
node transmitted data around every 3.3 minutes, which were then averaged over 10-minute periods.
Additionally, energy data was logged every 10 minutes via m-bus energy meters. The dataset
includes attributes such as date and time (in year-month-day hour:minute:second format), the energy
usage of appliances and lights (in Wh), temperature and humidity in various rooms including
the kitchen (T1 , RH1 ), living room (T2 , RH2 ), laundry room (T3 , RH3 ), office room (T 4, RH4 ),
bathroom (T5 , RH5 ), ironing room (T7 , RH7 ), teenager room (T8 , RH8 ), and parents room (T9 , RH9 ),
and temperature and humidity outside the building (T6 , RH6 ) - all with temperatures in Celsius
and humidity in percentages. Additionally, weather data from Chievres Airport, Belgium was
incorporated, consisting of outside temperature (To in Celsius), pressure (in mm Hg), humidity
(RHout in %), wind speed (in m/s), visibility (in km), and dew point (Tdewpoint in °C). The dataset is
available at https://archive.ics.uci.edu/dataset/374/appliances+energy+prediction.
2.2.2.6   EPIC-SOUNDS
    As an extension of the EPIC-KITCHENS-100 dataset, the EPIC-SOUNDS dataset focuses on
annotating distinct audio events in the videos of EPIC-KITCHENS-100. The annotations include
the time intervals during which each audio event occurs, along with a text description explaining the
nature of the sound. Given the variation in video lengths in the dataset, which range from 30 seconds
to 1.5 hours, the videos are segmented into clips of 3-4 minutes each to make the annotation process
more manageable. In order to ensure that annotators concentrate solely on the audio aspects, only
the audio stream is provided to them. This decision is taken to prevent bias that could be introduced
by the visual and contextual elements in the videos. Additionally, annotators are given access to
the plotted audio waveforms. These visual representations of the audio data help the annotators
by guiding them in pinpointing specific sound patterns, thus making the annotation process more
                                                   8


efficient and targeted. The EPIC-SOUNDS dataset can be extracted from the EPIC-KITHENS-100
dataset with the GitHub repo at https://github.com/epic-kitchens/epic-sounds-annotations. The
extracted audio data in the form of HDF5 file format can also be requested.
2.2.3    End-to-End Federated Learning Framework for AIoT
     To benchmark the performance of the datasets and facilitate future research in FL for AIoT, we
have designed and developed an end-to-end FL framework for AIoT as another key part of FedAIoT.
As illustrated in Figure 2.1, our framework covers the complete FL-for-AIoT pipeline, which
includes five components: (1) non-IID data partitioning, (2) data preprocessing, (3) AIoT-friendly
models, (4) FL hyperparameters, and (5) IoT-factor emulator. In this section, we describe each
component within the framework in detail.
2.2.3.1   Partitioning Non-IID Data
     The primary goal of non-IID data partitioning is to split the training set in a manner that results
in each client having data that follows a non-IID distribution. The eight datasets incorporated
into FedAIoT cover three fundamental tasks: classification, regression, and object detection.
Consequently, FedAIoT employs three non-IID data partitioning methods tailored to these tasks.
     Scheme#1: Non-IID Partitioning Based on Output Labels. This scheme is applied to
classification tasks (WISDM-W, WISDM-P, UT-HAR, Widar, CASAS, EPIC-SOUNDS), which
involve C classes. We initially establish a distribution over these classes for each client, utilizing
a Dirichlet Distribution with a parameter α [32]. Lower values of α create a skewed distribution
favoring a few classes, while higher values lead to a more balanced class distribution. This parameter
is also used to determine the number of samples received by each client. In addition, we generate a
distribution over the total number of samples using the same Dirichlet Distribution. This distribution
is then used to distribute a varying number of samples to each client. This methodology enables us
to create non-IID data partitions that more accurately represent real-world scenarios where the class
distribution and the number of samples can differ across clients.
     Scheme#2: Non-IID Partitioning Based on Input Features. This scheme is used for object
detection tasks (VisDrone), where there are no specific classes. Here, we use the input features
                                                   9


to create non-IID partitions. We employ ImageNet features generated from a VGG19 model
[33], which capture the essential visual information needed for further analysis. Using these
ImageNet features, we conduct clustering in the feature space using K-nearest neighbors to divide
the dataset into ten distinct clusters, treating each cluster as a pseudo-class. A Dirichlet Allocation
is subsequently used on these pseudo-classes to generate the non-IID distribution across different
clients.
    Scheme#3: Non-IID Partitioning Based on Output Distribution. For regression tasks
(e.g., AEP dataset) characterized by a continuous output, we use Quantile Binning to convert the
continuous variable into a categorical one. This process divides the output variable’s range into
ten equal groups, or quantiles, ensuring roughly equal sample sizes in each bin. These bins are
then treated as pseudo-classes. After converting the continuous output into ten categories, we apply
Dirichlet Allocation to generate a non-IID distribution of data across the clients.
2.2.3.2   Data Preprocessing
    FedAIoT encompasses eight datasets, each requiring different data preprocessing techniques
due to their unique data modalities. Given the diversity in sensor data and data modalities, the
preprocessing techniques are tailored to remove outliers and reduce noise.
    WISDM: We utilize standard preprocessing techniques used in accelerometer and gyroscope-based
activity recognition. Specifically, we extract samples from the raw accelerometer and gyroscope
data sequences using a 10-second sliding window with a 50
    UT-HAR: We follow the method in [34], applying a sliding window of 250 packets with a 50
    Widar: We use the body velocity profile (BVP) processing technique, as outlined in [34, 25], to
effectively handle and remove environmental variations from the data. We then apply standard scalar
normalization for further refinement. This process creates data samples with the shape (22x20x20),
reflecting time axis, x, and y velocity features respectively.
    VisDrone: We first normalize the images to the range of 0 to 1 to standardize pixel values. Data
augmentation techniques such as random shifts in Hue, Saturation, and Value color space, image
compression, shearing transformations, scaling transformations, horizontal and vertical flipping,
                                                    10


            Table 2.2 Non-IID data partitioning schemes and models used for each dataset.
  Dataset   WISDM-W       WISDM-P        UT-HAR         Widar         VisDrone       CASAS              AEP          EPIC-SOUNDS
 Partition Output Labels Output Labels Output Labels Output Labels  Input Features Output Labels Output Distribution  Output Labels
  Model       LSTM          LSTM         ResNet18      ResNet18       YOLOv8n        BiLSTM            MLP             ResNet18
               [36]          [36]       [40, 34, 41]  [40, 34, 41]       [37]          [35]             [38]              [29]
and MixUp are then applied to enhance the diversity and generalizability of the dataset.
     CASAS: We follow the approach in [35], transforming sensor readings into categorical sequences
for semantic encoding. Unique temperature settings, instances of motion sensors, and door sensors
activations are each assigned distinct categorical values. We extract a sequence of the previous 2000
sensor activations for each recorded activity for modeling and prediction.
     AEP: Temperature data are log-transformed for skewness, and ’visibility’ is binarized. Outliers
below the 10th or above the 90th percentile are replaced with corresponding percentile values.
Central tendency and date features are added to capture time-related patterns. Principal component
analysis is used for data reduction, and the output is normalized using a standard scaler.
     EPIC-SOUNDS: We first apply the Short-Time Fourier Transform to raw audio segments, then
apply a Hann window of 10ms duration with a 5ms step size to ensure optimal spectral resolution.
We extract 128 Mel Spectrogram features, which are a popular choice for audio classification tasks
due to their ability to mimic the human auditory system. We apply natural logarithm scaling to the
Mel Spectrogram output to further refine the data, and each segment is padded to reach a consistent
length of 400.
2.2.3.3     AIoT-friendly Models
     Our selection of models is informed by both state-of-the-art results, as referenced in [36, 34,
37, 35, 38, 39, 29], and the resource constraints of IoT platforms. It is unrealistic to expect
that IoT platforms could accommodate large Transformer-based models for FL, so we prioritize
AIoT-friendly models in FedAIoT. Table 2.2 lists the chosen models, and a detailed breakdown of
each model’s architecture can be found in Section 2.3.3.
                                                                 11


2.2.3.4    Configuring FL Parameters
    Degree of Data Heterogeneity. Non-IID data can significantly disrupt FL training due to issues
such as gradient skew, which may impair the performance of the resulting model. In Section 2.2.3.1,
we explained how FedAIoT can construct various data partitions, enabling researchers to simulate
different levels of data heterogeneity as per the requirements of their experiments.
    FL Optimizer Selection. FedAIoT is compatible with several frequently used FL optimizers. In
our experimental section, we demonstrate the benchmark outcomes of two FL optimizers: FedAvg
[4] and FedOPT [42].
    Client Sampling Ratio. The client sampling ratio refers to the fraction of clients chosen for
local training in each FL round. This critical hyperparameter can impact both the computational and
communication costs of FL training. With FedAIoT, you can create various client sampling ratios
and assess their influence on model performance and the speed of convergence during FL training.
2.2.3.5    Emulation of IoT Conditions
    Simulation of Real-world Label Errors. In actual FL deployments on IoT devices, label noise
is a common problem due to sources such as annotator bias, varying skill levels, and errors during
labeling. To realistically simulate label errors in FL, we suggest modifying the original labels of
a dataset using a confusion matrix, denoted as Q. Here, Qi j gives the probability of changing the
correct label i to an incorrect label j, that is, P(ŷ = j | y = i). Contrary to previous benchmark
studies [20] which randomly built the confusion matrix Q, our strategy is to construct the confusion
matrix based on the outcomes of centralized training. More specifically, we ascertain the elements
of Q (i.e., Qi j ) by determining the proportion of samples labeled as j by the centrally trained
machine learning model relative to the total number of samples with the correct label i. This method
of constructing the confusion matrix ensures that it accurately reflects the labeling patterns seen
during centralized training. We utilize the confusion matrix Q as a guide to produce error labels. To
introduce different levels of erroneous label ratio ε, we randomly select the necessary number of
data samples and apply the label alterations based on the probabilities given in the confusion matrix
Q. By integrating such realistic label errors in our FL simulations, we aim to offer a more robust
                                                   12


evaluation of FL algorithms under realistic and challenging conditions.
    Training with Quantization. IoT devices frequently have significant resource limitations,
making model quantization a necessity. This approach reduces the numerical precision of computations
and data in AI models, enhancing memory usage and computational effectiveness. In FedAIoT,
we implement two precision levels, full (float32) and half (float16). Although most research has
primarily focused on applying quantization during the inference stage, it is equally important to
understand the impact of training models under quantized conditions in the context of FL. Hence,
our models were trained using both precision types. The goal was to explore the trade-off between
computational efficiency and model accuracy, which is essential for overcoming the resource
limitations of IoT devices and enabling FL for AIoT.
2.3   Experimental Setup
2.3.1   Experimental Hyperparameters
    Hyperparameters for Table 2.3. For WISDM-W, the learning rate for centralized training was
0.01 and we trained for 200 epochs with batch size 64. For FedAvg, in both low and high data
heterogeneity scenarios, we used a client learning rate of 0.01 and trained for 400 communication
rounds with batch size 32. For FedOPT, in both low and high data heterogeneity scenarios, we
used a client learning rate of 0.01 and a server learning rate of 0.01. We also trained for 400
communication rounds. For WISDM-P, the learning rate for centralized training was 0.01 and we
trained for 200 epochs with batch size 128. For FedAvg, in both low and high data heterogeneity
scenarios, we used a client learning rate of 0.008 and trained for 400 communication rounds with
batch size 32. For FedOPT, in both low and high data heterogeneity scenarios, we used a client
learning rate of 0.01 and a server learning rate of 0.01. We also trained for 400 communication
rounds. For UT-HAR and Widar, the learning rate for centralized training was 0.001 and the number
of epochs was 500 and 200 for UT-HAR and Widar respectively with a batch size of 32. For both
low and high data heterogeneity in both FedAvg and FedOPT, the client learning rate was 0.01
and the server learning rate for FedAvg and FedOPT was 1 and 0.01 respectively. The number of
communication rounds was 1200 and 900 for UT-HAR and Widar respectively with a batch size of
                                                 13


32. For VisDrone, we used a cosine learning rate scheduler with T0 = 10, Tmult = 2 and trained for
200 epochs with a learning rate of 0.1 and batch size 12. For all the experiments on VisDrone, the
client learning rate was also 0.1 and the batch size was 12. For FedOPT, the server learning rate
was 0.1. For CASAS, the centralized learning rate was 0.1 with batch size 128. For the federated
setting, the client learning rate was 0.005, and the batch size was 32. We trained for 400 rounds.
For FedOPT, the server learning rate was 0.01. For AEP, the learning rate for centralized training
was 0.001 and the batch size was 32 and it was trained for 1200 epochs. For federated experiments,
the client learning rate was 0.01, and the batch size was 32. For FedOPT, the server learning rate
was 0.1. For EPIC-SOUNDS, for centralized training, the learning rate was 0.1 with batch size 512.
The number of epochs was 120. For federated settings, we used a client learning rate of 0.1 and
batch size 32. For FedOPT, the server learning rate was 0.01.
    Hyperparameters for Table 2.4. The setup for all the datasets with 10% client sampling rate is
the same as that of Table 2.3 under high data heterogeneity. For the 30% client sampling rate, the
hyperparameters were kept the same as that of the 10% client sampling rate experiments, with the
exception of CASAS, where the learning rate was set to 0.15.
    Hyperparameters for Table 2.5. The hyperparameters were the same as that of Table 2.3 with
10% sampling rate under high data heterogeneity scenario.
    Hyperparameters for Table 2.6. The hyperparameters were same as that of Table 2.3 with
10% client sampling rate under high data heterogeneity scenario.
2.3.2   Base Data Partition Schemes
    We implemented three base data partitioning schemes for simulating data partitioning in a
federated setting.
    Uniform Partition Uniform distribution basically samples from the main training dataset and
assigns data to clients in a uniform nature. This partitioning can be used as a baseline best-case
scenario or to debug the functionality of federated algorithms in the benchmark.
    Dirichlet Partition This partition as explained in Section 2.2 is designed to partition a dataset
into subsets for simulating a federated learning environment with multiple clients. It is the basis
                                                 14


for all the partitioning techniques used in the analyses. It uses the Dirichlet distribution to allocate
samples of different classes across clients, attempting to maintain a Dirichlet distribution while
ensuring that each client receives at least a minimum number of samples.
     Disjoint Label Partition The Disjoint Label Partition scheme partitions a dataset into multiple
subsets such that each subset is allocated to a different client for the purpose of simulating federated
learning. In this setup, each client is assigned a limited number of unique classes, Cl from the
dataset. Characterized by the maximum number of unique classes that can be assigned to each user,
it systematically organizes the dataset entries according to their labels and distributes these among
the clients, ensuring that each client gets disjoint sets of labels. Each client receives indices from Cl
unique classes, with shards of data being divided among the clients. If there is any leftover data, it
is evenly distributed among the shards.
     Manual Partition We also provide ways to induce custom partitions using data-client mapping.
This can be used to test out unique partitioning schemes or use natural partitioning if available in
the datasets.
2.3.3    Model Architectures
2.3.3.1    WISDM
     For WISDM, we use a custom LSTM model that consists of an LSTM layer followed by a
feed-forward neural network. The LSTM layer has an input dimension of 6 and a hidden dimension
of 6. After the LSTM layer, the output is flattened and passed through a dropout layer with a rate
of 0.2 for regularization. It then goes through a fully connected linear layer with an input size of
1, 200 (6 hidden units * 200 timesteps) and an output size of 128, followed by a ReLU activation
function. Another dropout layer with a rate of 0.2 is applied before the final fully connected linear
layer with an input size of 128 and an output size of 12.
2.3.3.2    UT-HAR
     For UT-HAR, we use a ResNet-18 model with custom architecture designed for the Wi-Fi based
Human Activity Recognition (HAR) task. The model consists of an initial convolutional layer that
                                                    15


reshapes the input into a 3-channel tensor followed by the main ResNet architecture with 18 layers.
This main architecture includes a series of convolutional blocks with residual connections, Group
Normalization layers, ReLU activations, and max-pooling. Finally, there is an adaptive average
pooling layer followed by a fully connected layer that outputs the class probabilities. The model
utilizes 64 output channels in the initial layer and doubles the number of channels as it goes deeper.
The last fully connected layer has 7 output units corresponding to the number of classes for the
UT-HAR task.
2.3.3.3   Widar
     For Widar, we also use a custom ResNet-18 model tailored for the Widar dataset. The model
starts by reshaping the 22-channel input to 3 channels using two convolutional transpose layers,
followed by a convolutional layer with 64 filters, Group Normalization, ReLU activation, and
max-pooling. The core of the model consists of four layers of residual blocks (similar to the
standard ResNet18) with 64, 128, 256, and 512 filters. Each basic block within these layers contains
two convolutional layers, Group Normalization, and ReLU activations. Finally, an adaptive average
pooling layer reduces spatial dimensions to 1 × 1, followed by a fully connected layer to output
class scores.
2.3.3.4   VisDrone
     For VisDrone, we use the default YOLOv8n model from Ultralytics library. YOLOv8n is the
smallest YOLOv8 model variant with the three scale parameters: depth, width, and the maximum
number of channels set to 0.33, 0.25, and 1024 respectively.
2.3.3.5   CASAS
     For CASAS, we use a BiLSTM neural network which is composed of an embedding layer,
a bidirectional LSTM, and a fully connected layer. The embedding layer takes input sequences
with dimensions equal to the input dimension and converts them to dense vectors of size 64. The
bidirectional LSTM layer has an input size equal to 64, the same number of hidden units, and
processes the embedded sequences in both forward and backward directions. The output of the
                                                   16


LSTM layer is connected to a fully connected layer with an input size of 128 (to account for the
bidirectional LSTM concatenation) and outputs the logits for 12 activities in the CASAS dataset.
2.3.3.6   AEP
    For AEP, we use a custom multi-layer perceptron (MLP) neural network with an architecture
comprising five hidden layers and an output layer. The input layer accepts 18 features and passes
them through a linear transformation to the first hidden layer with 210 units. Each of the following
hidden layers progressively scales the number of units by factors of 2 and 4 and then scales down.
Specifically, the sizes of the hidden layers are 210, 420, 840, 420, and 210 units respectively. Each
hidden layer uses a ReLU activation function followed by a dropout layer with a dropout rate of 0.3
for regularization. The output layer has a single unit, and the output of the network is obtained by
passing the activations of the last hidden layer through a final linear transformation.
2.3.3.7   EPIC-SOUNDS
    For EPIC-SOUNDS, we again use a custom ResNet-18 model which consists of a stack of
convolutional layers followed by batch normalization and ReLU activation. The architecture
begins with a 7 × 7 convolutional layer with stride 2, followed by a max pooling layer. Then,
it contains four blocks, each comprising a sequence of basic blocks with a residual connection;
specifically, each block contains two basic blocks, with output channel sizes of 64, 128, 256, and
512 respectively. Each basic block comprises two sets of 3x3 convolutional layers, each followed
by batch normalization and ReLU activation. The first convolutional layer in the basic block has
a stride of 2 in the second, third, and fourth blocks. Finally, the model has an adaptive average
pooling layer, which reduces the spatial dimensions to 1x1, followed by a fully connected layer
with an output size of 44 classes.
2.4   Experiments and Analysis
    We implemented FedAIoT using PyTorch [43] and Ray [44], and conducted our experiments on
a combination 8×NVIDIA A6000 GPU cluster, 8×NVIDIA RTX8000 GPU, 8×NVIDIA A6000
GPU, 8×NVIDIA RTX3090 GPU and 10×NVIDIA A100 GPU clusters as needed. For each
                                                  17


                                      Table 2.3 Overall performance.
                                            Low Data Heterogeneity (α = 0.5) High Data Heterogeneity (α = 0.1)
     Dataset       Metric      Centralized
                                               FedAvg         FedOPT            FedAvg          FedOPT
    WISDM-W      Accuracy (%) 74.05 ± 2.47   70.03 ± 0.13   71.50 ± 1.52     68.51 ± 2.21      65.76 ± 2.42
    WISDM-P      Accuracy (%) 36.88 ± 1.08   36.21 ± 0.19   34.32 ± 0.84     34.28 ± 3.28      32.99 ± 0.55
     UT-HAR      Accuracy (%) 95.24 ± 0.75   94.03 ± 0.63   94.10 ± 0.84     74.24 ± 3.87      87.78 ± 5.48
      Widar      Accuracy (%) 61.24 ± 0.56   59.21 ± 1.79   56.26 ± 3.11     54.76 ± 0.42      47.99 ± 3.99
     VisDrone    MAP-50 (%)    34.26 ± 1.56 32.70 ± 1.19    32.21 ± 0.28     31.23 ± 0.70      31.51 ± 2.18
     CASAS       Accuracy (%) 83.70 ± 2.21   75.93 ± 2.82   76.40 ± 2.20     74.72 ± 1.32      75.36 ± 2.40
       AEP           R2       0.586 ± 0.006 0.502 ± 0.024   0.503 ± 0.011    0.407 ± 0.003    0.475 ± 0.016
  EPIC-SOUNDS Accuracy (%)     45.67 ± 0.12 45.51 ± 1.07    42.39 ± 2.01     33.02 ± 5.62      37.21 ± 2.68
experiment, we run three times based on three random seeds, and report both mean and standard
deviation values.
2.4.1    Overall Performance
    First, we benchmark the FL performance under two FL optimizers, FedAvg and FedOPT, under
low (α = 0.5) and high (α = 0.1) data heterogeneity levels, and compare it against centralized
training.
Benchmark Results: Table 2.3 summarizes our results. We make three observations. (1) Data
heterogeneity level and FL optimizer have different impacts on different datasets. In particular,
the performance of UT-HAR and Widar are very sensitive to the data heterogeneity level. In
contrast, WISDM-P does not show a noticeable accuracy difference under FedAvg at different data
heterogeneity levels. (2) Under low data heterogeneity, FedAvg provides a more stable performance
compared to FedOPT and consistently achieves performance closer to centralized training across
diverse data modalities. (3) Compared to the other datasets, CASAS, AEP, and WISDM-W have
higher accuracy margins between centralized training and low data heterogeneity. This indicates the
need for more advanced FL algorithms for CASAS, AEP, and WISDM-W datasets.
2.4.2    Impact of Client Sampling Ratio
    IoT devices usually have significant communication restrictions and hence the client sampling
ratio is a critical hyperparameter for FL systems operating AIoT devices. In this experiment,
                                                       18


                                Table 2.4 Impact of client sampling ratio.
                                     Low Client Sampling Ratio (10%)            High Client Sampling Ratio (30%)
    Dataset    Training Rounds
                               50% Rounds     80% Rounds     100% Rounds   50% Rounds     80% Rounds    100% Rounds
   WISDM-W            400      58.81 ± 1.43    63.82 ± 1.53   68.51 ± 2.21  65.57 ± 2.10   67.23 ± 0.77  69.21 ± 1.13
   WISDM-P            400      29.49 ± 3.65    31.65 ± 1.42   34.28 ± 3.28 33.73 ± 2.77    34.01 ± 2.27  36.01 ± 2.23
    UT-HAR           2000      61.81 ± 7.01    70.76 ± 2.23   74.24 ± 3.87 86.46 ± 10.90  90.84 ± 4.42   92.51 ± 2.65
     Widar           1500      47.55 ± 1.20    50.65 ± 0.24   54.76 ± 0.42  53.93 ± 2.90   55.74 ± 2.15  57.39 ± 3.14
    VisDrone          600       27.07 ± 3.09   31.05 ± 1.55   31.23 ± 0.70 30.56 ± 2.71    33.52 ± 2.90  34.85 ± 0.83
    CASAS             400      71.68 ± 1.96    74.19 ± 1.26   74.72 ± 1.32  73.89 ± 1.16   74.68 ± 1.50  76.12 ± 2.03
      AEP            3000      0.325 ± 0.013  0.371 ± 0.017  0.407 ± 0.003 0.502 ± 0.006  0.523 ± 0.014  0.538 ± 0.005
 EPIC-SOUNDS          300       20.99 ± 5.19   25.73 ± 1.99   28.89 ± 2.82  23.70 ± 6.25  31.74 ± 7.83   35.11 ± 1.99
we focus on two client sampling ratios: 10% and 30%. Our exploration involved recording the
maximum accuracy reached after completing 50%, 80%, and 100% of the total training rounds for
both these ratios under high data heterogeneity, thereby offering empirical evidence of how the
model’s performance and convergence rate are affected by the client sampling ratio.
Benchmark Results: Table 2.4 summarizes our results. We make two observations. (1) An
increased client sampling ratio is highly correlated with superior model accuracy (i.e., highest
accuracy within 100% training rounds) across different IoT data modalities. This demonstrates
the importance of the client sampling ratio to the final model performance at the end of FL. (2)
However, a higher sampling ratio does not inherently guarantee faster model convergence. For
example, WISDM-P, Widar, and EPIC-SOUNDS achieve higher model performance with a lower
client sampling ratio at 50% training rounds compared to a higher client sampling ratio. This result
underscores the complex dynamics between client participation and learning efficiency for different
IoT data modalities.
2.4.3     Impact of Erroneous Labels
     As elaborated in Section 2.2.3.5, we investigate the implications of erroneous labels. We assess
the performance of our models under circumstances where the label error ratio is set at 10% and
30%, juxtaposing these results with the control scenario that involves no label errors. Note that
we only showcase this for WISDM, UT-HAR, WIDAR, CASAS, and EPIC-SOUNDS as these are
classification tasks, and the concept of erroneous labels only apply to classification tasks.
                                                          19


                                  Table 2.5 Impact of erroneous labels.
 Erroneous Label Ratio  WISDM-W       WISDM-P      UT-HAR         Widar       CASAS      EPIC-SOUNDS
           0%           68.51 ± 2.21 34.28 ± 3.28 74.24 ± 3.87 54.76 ± 0.42 74.72 ± 1.32  28.89 ± 2.82
          10%           50.63 ± 4.19 28.85 ± 1.44 73.75 ± 5.67 34.03 ± 0.33 65.01 ± 2.98  21.43 ± 3.86
          30%           47.90 ± 3.05 27.68 ± 0.39 70.55 ± 3.27 27.20 ± 0.56 63.16 ± 1.34  13.30 ± 0.42
Benchmark Results: Table 2.5 summarizes our results. We make two observations. (1) As the
ratio of erroneous labels increases, the performance of the models decreases across all the datasets,
and the impact of erroneous labels varies across different datasets. For example, WISDM-W only
experiences a little performance drop at 10% label error ratio, but its performance significantly
drops when the label error ratio increases to 30%. In contrast, CASAS exhibits a more gradual
decline in performance as the error ratio increases from 0% to 10% and from 10% to 30%. (2)
UT-HAR and EPIC-SOUNDS are very sensitive to label error and show significant accuracy drop
even at 10% label error ratio.
2.4.4   Performance on Quantized Training
    Lastly, we examine the impact of model quantization on federated learning, specifically using
half-precision (FP16). We assess the models’ accuracy and memory usage under this quantization,
comparing these results to those from the full-precision (FP32) models. Memory is measured
by analyzing the GPU memory usage of a model when trained with the same batch size under a
centralized setting.
Benchmark Results: Table 2.6 summarizes the model performance and memory usage at two
precision levels. We make three observations: (1) As expected, the memory usage significantly
decreases when using FP16 precision, ranging from 57.0% to 63.3% reduction across different
datasets. (2) As shown in the previous work [45], the model performance associated with the
precision levels varies depending on the dataset. For WISDM-W, CASAS, and EPIC-SOUNDS, the
FP16 models maintain or even improve the performance compared to the FP32 models. (3) Widar,
VisDrone, and AEP have a significant decline in performance when quantized to FP16 precision.
                                                   20


                              Table 2.6 Performance on quantized training.
                                                FP32                                 FP16
    Dataset         Metric
                               Model Performance Memory Usage       Model Performance      Memory Usage
   WISDM-W       Accuracy (%)      68.51 ± 2.21       1444 MB          60.31 ± 5.38       564 MB (↓ 60.9%)
   WISDM-P       Accuracy (%)     34.28 ± 3.28        1444 MB          30.22 ± 2.05       564 MB (↓ 60.9%)
    UT-HAR       Accuracy (%)     74.24 ± 3.87        1716 MB          72.86 ± 4.49       639 MB (↓ 62.8%)
     Widar       Accuracy (%)     54.76 ± 0.42        1734 MB          34.03 ± 0.33       636 MB (↓ 63.3%)
    VisDrone     MAP-50 (%)       31.23 ± 0.70        8369 MB          29.17 ± 4.70      3515 MB (↓ 60.0%)
    CASAS        Accuracy (%)     74.72 ± 1.32        1834 MB          72.86 ± 4.49       732 MB (↓ 60.1%)
      AEP             R2          0.407 ± 0.003       1201 MB          0.469 ± 0.044      500 MB (↓ 58.4%)
 EPIC-SOUNDS     Accuracy (%)     33.02 ± 5.62        2176 MB          35.43 ± 6.61       936 MB (↓ 57.0%)
2.4.5   Insights from Benchmark Results
    Need for Resilience on High Data Heterogeneity: As presented in Table 2.3, datasets can
exhibit a notable response to changes in data heterogeneity. We observe that CASAS, AEP, and
EPIC-SOUNDS show a significant impact even at a low data heterogeneity. UT-HAR and Widar
see a drastic decline in high data heterogeneity. These findings emphasize the need for developing
advanced FL algorithms for data modalities that are sensitive to high data heterogeneity.
    Need for Balancing between Client Sampling Ratio and Resource Consumption of IoT
Devices: Table 2.4 reveals that a higher sampling ratio can lead to improved performance in the long
run. However, higher client sampling ratios generally entail increased communication bandwidth
and energy consumption, which may not be desirable for IoT devices. Therefore, it is crucial
to identify the sweet spot that strikes a balance between the client sampling ratio and resource
consumption.
    Need for Resilience on Erroneous Labels: As demonstrated in Table 2.5, certain datasets
exhibit high sensitivity to label errors, resulting in a significant drop in FL performance. Notably,
both UT-HAR and EPIC-SOUNDS experience a drastic decrease in accuracy when faced with a 10%
erroneous label ratio. Given the inevitability of label errors in real FL deployments, where private
data remains unmonitored and uncalibrated except by the respective data owners, the development
of label error resilient techniques becomes crucial for achieving reliable FL performance.
                                                     21


                                  Table 2.7 Analysis of quantization demands.
     Application         Dataset      IoT Platform      Representative Devices     Hardware RAM Size Need Quantization
                        WISDM-W        Smartwatch           Apple Watch 8            512 MB to 1 GB        Yes
 Activity Recognition   WISDM-P        Smartphone              iPhone 14                  6 GB             No
                         UT-HAR        Wi-Fi Router        TP-Link AX1800             64 MB to 1 GB        Yes
 Gesture Recognition      Widar        Wi-Fi Router        TP-Link AX1800             64 MB to 1 GB        Yes
 Independent Living      CASAS         Smart Home           Raspberry Pi 4             1 GB to 8 GB        No
  Energy Prediction        AEP         Smart Home           Raspberry Pi 4             1 GB to 8 GB        No
 Objective Detection     VisDrone         Drone       Dji Mavic 3 + Raspberry Pi 4     1 GB to 8 GB        Yes
 Augmented Reality    EPIC-SOUNDS Head-mounted Device     GoPro / AR Headset           1 GB to 8 GB        No
     Need for Quantization: Table 2.7 highlights the importance of quantization in FL for all eight
datasets. Notably, certain IoT devices, such as drones, lack sufficient RAM storage capacity for
FL. Hence, external hardware interfaces like Raspberry Pi 4 has to be incorporated as assistive
computing platforms. Analysis from Table 2.6 reveals that the performance of VisDrone drops
significantly from 32FP precision to 16FP precision, and WISDM-W, UT-HAR, and VisDrone
require computing memory size that exceeds the representative hardware RAM limits when using
32FP precision, underscoring the necessity of quantized training.
                                                         22


                                            CHAPTER 3
FEDROLEX: MODEL-HETEROGENEOUS FEDERATED LEARNING WITH ROLLING
                                  SUB-MODEL EXTRACTION
3.1    Related Work
    Knowledge Distillation (KD) Techniques in Heterogeneous FL. Knowledge distillation (KD)
is a significant strategy for implementing heterogeneous model Federated Learning (FL) across
various devices [46]. Specifically, FedDF [47] implements KD by distilling knowledge from
multiple classifiers, trained with private data from different client devices. The logits from each
classifier are applied to an unlabeled public dataset to facilitate the server in training a student
model via KD. DS-FL [48] took a similar approach but also proposed a semi-supervised FL method
that employs pseudo-labeling of public data to enhance performance. Group knowledge transfer,
as introduced by FedGKT [49], facilitates knowledge transmission to a substantial model on the
server from client devices without utilizing public data. Furthermore, Fed-ET [50] formulated a
weighted consensus distillation method with diversity regularization, enabling the server to train a
larger model with the aid of smaller client models. Nevertheless, KD-based methods present certain
challenges: They often necessitate public data to attain competitive accuracy, with performance
depending on the size and domain similarity of public and client data [47, 50, 51]. Moreover, the use
of client model weights in KD makes these methods misaligned with secure aggregation protocols,
exposing them to potential backdoor attacks [52].
    Heterogeneous FL via Partial Training (PT) Methods. To mitigate the limitations associated
with KD-based techniques, partial training (PT) has been explored as a viable alternative for
heterogeneous model FL. Current PT-based approaches can typically be sorted into two main types:
random and static sub-model extraction. In particular, Federated Dropout [9] innovatively employs
a random extraction technique inspired by the commonly-used dropout method in centralized
training [53]. Although this integrates seamlessly into existing FL frameworks, Federated Dropout’s
effectiveness diminishes with increasing data heterogeneity and a smaller client cohort, as observed
in [54]. On the other hand, HeteroFL [10] and FjORD [11] proposed a static extraction method where
                                                  23


sub-models are always taken from a fixed portion of the global server model. This strategy, however,
encounters two primary drawbacks. Firstly, it restricts the global server model to the same size as
the largest client model, thereby limiting its potential due to client resource constraints. Secondly,
this method mandates that different sub-models must only be trained on clients with corresponding
resources, leading to different parts of the global model being trained on various data distributions,
potentially harming the overall performance, especially in high data heterogeneity scenarios. In
response to these challenges, our research introduces a rolling sub-model extraction mechanism that
effectively addresses the issues found in both random and static sub-model extraction methods.
3.2   Methodology
3.2.1   Formulation of Model-Heterogeneous FL
    Let N denote N client devices with non-IID (non-identically and independently distributed)
local data D = {D1 , D2 , ..., DN }. Model-homogeneous FL trains a global model of parameter θ by
solving the following optimization problem:
                                                         N
                                          min F (θ ) ≜  ∑ pnFn(θ )                                (3.1)
                                            θ           n=1
with
                                                     1 mn
                                         Fn (θ ) ≜     ∑ l(θ ; dn,k ),                            (3.2)
                                                    mn k=1
where Dn ≜ {dn,1 , dn,2 , dn,3 ...dn,mn } is the set of local data samples of client n and pn is its
corresponding weight such that pn ≥ 0 and ∑N        n=1 pn = 1.
    In comparison, in model-heterogeneous FL, clients train local models with heterogeneous
capacities β = {β1 , β2 , ..., βN }, and the local objective function of the nth client becomes
                                                     1 mn
                                        Fn′ (θn ) ≜    ∑ l(θn; dn,k ).                            (3.3)
                                                    mn k=1
Here, βn denotes the model capacity of client n, and we define it as the proportion of nodes extracted
from each layer in θ for client n. The size of θn depends on βn , and the parameter θn is obtained by
selecting a sub-model from the global model θ , which can change from one round to another. If
                                                      24


                                                 Global Server Model
                           Round
                           Round
                           Round
                                         Large-capacity       Small-capacity
                                          Client Model         Client Model
            Figure 3.1 Overview of the rolling sub-model extraction scheme in FedRolex.
θn changes, the objective function also changes. For simplicity, we use the same notation l for the
loss function for all clients and rounds, though they differ between clients and rounds. The key to
model-heterogeneous FL is selecting θn from the global model θ given model capacity βn .
3.2.2    FedRolex: Model-Heterogeneous FL with Rolling Sub-Model Extraction
    In the context of partial training (PT), FedRolex operates by training a sub-model at each client,
extracted from the global server model, and then transmitting the relevant sub-model updates back to
the server for aggregation. Figure 3.1 provides an illustrative depiction of how FedRolex functions,
showing three cycles of federated training across two heterogeneous clients. In this scenario, one
client is responsible for training a larger capacity sub-model (on the left), while the other focuses
on a smaller capacity one (on the right). At a broad level, during each round, the server takes
sub-models of varying capacities from the overall global model and individually sends them to the
clients that possess the necessary capabilities to handle them. Each client then proceeds to train the
                                                    25


received sub-model on its local data and subsequently sends the heterogeneous updates back to the
server. The server, in turn, compiles these updates, using the aggregated result to refresh the global
model in preparation for the next round. A detailed breakdown of the FedRolex procedure can be
found in Algorithm 3.1. Central to the architecture of FedRolex are two critical design decisions. In
the subsequent section, we delve into a comprehensive description of these choices.
     (1) What sub-models to be extracted for each client across different rounds? In the server,
FedRolex employs a rolling window to methodically extract the sub-model from the global model.
This rolling window progresses with each round, sequentially traversing all components of the
global model across different rounds, looping in a manner that ensures the global model is uniformly
trained until it reaches convergence.
     Consider Figure 3.1 as a reference: during round j, the global model’s large-capacity and
small-capacity client models extracted are a, b, c, d and c, d, e respectively. When moving to round
 j + 1, the rolling window shifts by one step1 , transforming the large-capacity and small-capacity
client models to b, c, d, e and d, e, a correspondingly. In a similar manner, when proceeding to
round j + 2, the rolling window progresses yet another step, resulting in the large-capacity and
small-capacity client models becoming c, d, e, a and e, a, b respectively. Such a rolling sub-model
extraction scheme can be formalized as follows.
             ( j)
     Let θn denote the parameters of the sub-model extracted from the global model for client n in
                                                                                                           ( j)
round j, Ki denote the total number of nodes in layer i of the global model, and Sn,i denote the
node indices of layer i of the global model that belongs to the extracted sub-model for client n in
round j. Then the layer i of the sub-model extracted by the rolling sub-model extraction scheme for
client n in round j is given by:
                
                
                
                 { jˆ, jˆ + 1, . . . , jˆ + ⌊βn Ki ⌋ − 1}                                if jˆ + ⌊βn Ki ⌋ ≤ Ki ,
                
                
       ( j)
    Sn,i    =                                                                                                        (3.4)
                
                
                 { jˆ, jˆ + 1, . . . , Ki − 1} ∪ {0, 1, . . . , jˆ + ⌊βn Ki ⌋ − 1 − Ki } else.
                
                
     1This  step size is a particular hyperparameter of FedRolex. Further insights on this can be found in Section 3.3.7 as
part of our ablation study.
                                                                  26


                         Global Server Model                           Global Server Model
   Round                                                                         c
   Round                                                                         c
                Small-capacity         Large-capacity    Small-capacity   Large-capacity Global-model-capacity
                 Client Model           Client Model      Client Model     Client Model      Client Model
                     Random Sub-model Extraction                   Static Sub-model Extraction
Figure 3.2 Illustration of how sub-models are extracted by random sub-model extraction scheme
(Left) and static sub-model extraction scheme (Right) over two rounds.
where jˆ = j mod Ki .
    (2) How to aggregate heterogeneous sub-model updates to update the global model?
    FedRolex employs a straightforward selective averaging scheme with no client weighting
to aggregate heterogeneous sub-model updates sent from the clients to update the global model.
Specifically, it computes the average of the updates for each parameter of the global model separately
based on how many clients in a round updated that parameter. The parameter remains unchanged if
no clients updated it. Taking Figure 3.1 again as an example: in round j, the updates for a and b are
obtained from the large-capacity model and the update for e is from the small-capacity model only.
In contrast, since c and d are part of both models, the update is computed by taking the average
from both models.
3.2.3    Comparison with Random and Static Sub-model Extraction Schemes
    Existing sub-model extraction schemes can be grouped as random-based (Federated Dropout)
and static-based (HeteroFL, FjORD) methods. In this section, we describe the differences between
them and the proposed rolling-based scheme employed in FedRolex. For comparison, the pseudocodes
of both Federated Dropout and HeteroFL are included in Section 3.3.11.
                                                      27


                                                Algorithm 3.1 FedRolex
Require: Dn βn ∀n ∈ N
Ensure: θ J
  1: Initialization: θ (0) , N
  2: for j = 0 to J − 1 do
  3:      Sample subset M from N
                          ( j)
  4:      Broadcast θ           ( j) to client m ∈ M
                          m,Sm,i
                  ( j)
           ∀i, Sm,i    from Equation (3.4)
  5:      for each client m ∈ M do
                                      ( j)
  6:           CLIENT S TEP (θm , Dm )
  7:      end for
                           ( j+1)
  8:      Aggregate θ[i,k] according to Equation (3.10)
  9:  end for
                                          ( j)
10:   function CLIENT S TEP(θn , Dn )
11:       mn ← len(Dn )
12:       for k = 0 to mn do
13:            θn ← θn − η∇l(θn ; dn,k )
14:       end forreturn θn
15:   end function
3.2.3.1     Comparison with Random Sub-Model Extraction Scheme
     In a random sub-model extraction scheme, in each round, the sub-models are extracted from the
global model in a random manner. As such, the layer i of the sub-model extracted by the random
sub-model extraction scheme for client n in round j is given by:
                               ( j)
                          Sn,i = {kc | integer kc ∈ [0, Ki − 1] for 1 ≤ c ≤ ⌊βn Ki ⌋},          (3.5)
where a total number of ⌊βn Ki ⌋ nodes are randomly chosen from the global model.
     Discussion: As shown in Figure 3.2(left), similar to the proposed rolling-based scheme,
the sub-models extracted across different rounds by the random-based scheme have different
architectures. However, due to its randomness in selecting sub-models in each round, the global
model is trained less evenly, making it more vulnerable to client drift. In short, although the
expected value of the frequency for updating each index is the same for all the indices, their exact
frequencies are not the same due to randomness. Consequently, the random-based scheme cannot
balance the update frequencies of different parts of the global model, and it inevitably takes more
rounds to update the whole global model. Moreover, as we show in Section 3.2.4, the expected
                                                          28


number of rounds for Federated Dropout selecting all I sub-models at least m times is in the order
of I log(I) + I(m − 1) log log I, which is larger than that of FedRolex, mI.
3.2.3.2   Comparison with Static Sub-Model Extraction Scheme
    In static sub-model extraction scheme, in each round, the sub-models are always extracted from
a designated part of the global model. As such, the layer i of the sub-model extracted by the static
sub-model extraction scheme for client n in round j is given by:
                                      ( j)
                                   Sn,i = {0, 1, 2, . . . , ⌊βn Ki ⌋ − 1}.                            (3.6)
               ( j)
Note that Sn,i does not depend on j. In other words, as shown in Figure 3.2(right), the same
sub-model is extracted for each client in every round. Moreover, the client model with smaller
capacity and client model with larger capacity are not independent. As shown in Figure 3.2(right),
the small-capacity model {a, b, c} is a part of the large-capacity model {a, b, c, d}, which again, is
a part of the global-capacity model {a, b, c, d, e}.
    These are the two key differences between both the random-based and the proposed rolling-based
scheme.
    Discussion: Given that, the static-based scheme, however, has two primary drawbacks. First, to
cover the whole global model, there must be clients to train the full-size global model {a, b, c, d, e}.
As such, the global model is restricted to the same size as the largest client model. Second, as shown
in Figure 3.2(right), while a, b and c will be trained on data on all three types of clients, d will not be
trained on data on small-capacity clients, and e will only be trained on data on global-model-capacity
clients. As a consequence, different parts of the global model are trained on data with different
distributions, which inevitably degrades the global model training quality.
3.2.4   Statistical Analysis
Lemma 1. Given I indices, and one index is chosen at each round equally randomly. The expected
number of rounds of choosing all indices at least once is
                                                                  
                                            1     1              1
                                        I     +       +···+           ,
                                            I I −1               1
                                                    29


which is the same as
                                         ˆ  ∞
                                              1 − (1 − e−t )I dt.
                                                                
                                       I
                                          0
Proof. We denote the expected number of rounds to choose exactly i indices at least once as E(i).
Then we have E(1) = 1, because, after the first round, one index is chosen. After the first round, the
                                                            I
expected number of rounds to choose a new index is        I−1 , because one of the remaining I − 1 out of
                                                                      I
the total I indices needs to be chosen. That is, E(2) = E(1) + I−1        . Similarly, we have
                                                     I
                            E(i) = E(i − 1) +              ,      ∀i = 2, . . . , I.
                                                 I +1−i
Thus, we have
                                                                                            
                                                   I I                  1         1        1
              E(I) = E(I − 1) + I = E(I − 2) + + = · · · = I               +         +···+     .
                                                   2 1                  I I −1             1
The lemma is proved.
     It shows that the expected number of rounds to choose all indices at least once is I log(I) when
I → ∞. This proof can not be generalized to the case for choosing all indices at least m times for
m ≥ 2. Therefore, we provide alternative proof for it [55, Example 5.17].
Alternative proof of Lemma 1. This proof considers picking the indices as Poisson processes. Assume
that the Poisson process to choose one index has a rate λ = 1. Since the index is chosen equally
randomly, choosing the jth index also follows a Poisson process with a rate 1/I for any j [55,
Proposition 5.2]. We let X j be the time to choose the first index j, and
                                              X = max X j                                           (3.7)
                                                    1≤ j≤I
is the time all indices are chosen at least once. Since all X j are independent with rate 1/I, we have
                      P{X < t} =P{ max X j < t} = P{X j < t, for j = 1, . . . , I}
                                     1≤ j≤I
                                =(1 − e−t/I )I .
                                                    30


Therefore, we have
                                ˆ   ∞                 ˆ   ∞                   
                       E[X] =         P{x > t}dt =          1 − (1 − e−t/I )I dt
                                  0                     0
    We let N be the number of rounds to choose all indices at least one, and Ti be the ith interarrival
time of the Poisson process for choosing one index. Then we have
                                                      N
                                               X = ∑ Ti ,
                                                     i=1
and Ti are independent. Thus we have
                                         E[X|N] = NE[Ti ] = N,
and which gives
                                      E[X] = E{E[X|N]} = E[N].
Thus we have
                           ˆ  ∞                             ˆ  ∞
                                             −t/I I
                                                                     1 − (1 − e−t )I dt.
                                                                                    
                   E[N] =        1 − (1 − e      )    dt = I
                            0                                  0
The lemma is proved.
    Next, we will present the lemma for choosing each index at least m times.
Lemma 2. Given I indices, and one index is chosen at each round equally randomly. The expected
number of rounds of choosing all indices at least m times is
                                      ˆ  ∞
                                           1 − (1 − Sm (t)e−t )I dt,
                                                                   
                                    I
                                       0
where
                                                                         m−1 l
                                              y2             ym−1            y
                          Sm (y) := 1 + y + + · · · +                  = ∑ .                      (3.8)
                                              2!           (m − 1)! l=0 l!
Proof. We consider picking the indices as Poisson processes again. Assume that the Poisson process
to choose one index has a rate λ = 1. Since the index is chosen equally randomly, choosing the jth
                                                    31


index also follows a Poisson process with a rate of 1/I for any j. We let X j be the time to choose
index j for the mth time, and
                                            X = max X j                                         (3.9)
                                                  1≤ j≤I
is the time all indices are chosen at least m times. Since all X j are independent with rate 1/I, we
have
                      P{X < t} =P{ max X j < t} = P{X j < t, for j = 1, . . . , I}
                                    1≤ j≤I
                                =(1 − Sm (t/I)e−t/I )I .
Therefore, we have
                                               ˆ  ∞
                                       E[X] =       P{x > t}dt.
                                                0
    We let N be the number of rounds to choose all indices at least m times, and Ti be the ith
interarrival time of the Poisson process for choosing one index. Then we have
                                                    N
                                              X = ∑ Ti ,
                                                   i=1
and Ti are independent. Thus we have
                                        E[X|N] = NE[Ti ] = N,
and which gives
                                    E[X] = E{E[X|N]} = E[N].
Thus we have
                      ˆ  ∞                                 ˆ  ∞
                                             −t/I I
                                                                  1 − (1 − Sm (t)e−t )I dt.
                                                                                       
              E[N] =       1 − (1 − Sm (t/I)e    )    dt = I
                       0                                      0
The lemma is proved.
    It shows that the expected number of rounds to choose all indices at least once is I log(I) +
I(m − 1) log log I when I → ∞ [56].
                                                  32


3.2.5   Formal Definition of Selective Aggregation Scheme
    Formally speaking, let M ⊂ N be the set of selected clients from the client pool from which
the server pulls model parameters at round j. Let θ[i,k] be the kth parameter of layer i of the global
model and θm,[i,k] be the kth parameter of layer i of client m. We denote Mk ⊂ M as the set of
clients updating the kth parameter. The model parameters are aggregated as follows:
                                                      1
                                       θ[i,k] =                  ∑    pm θm,[i,k] ,                                 (3.10)
                                                 ∑m∈Mk pm m∈Mk
The client weight pm is assigned based on factors like the client model capacity, the number of data
points a client has, etc. Throughout the paper, unless otherwise stated, the weight of all clients is
assumed to be the same, i.e, pm = 1/N.
3.3   Experiments
    Datasets and Models. We evaluate the performance of FedRolex under two regimes. Under
small-model small-dataset regime, we train pre-activated ResNet18 (PreResNet18) models [57]
on CIFAR-10 and CIFAR-100 [58]. We replace the batch Normalization in PreResNet18 with
static batch normalization [10, 59] and add a scalar module after each convolution layer [10].
Under large-model large-dataset regime, we use Stack Overflow [60] and followed [3] to train a
modified 3-layer Transformer [61] with a vocabulary of 10, 000 words, where the dimension of
token embeddings is 128, and the hidden dimension of the feed-forward network (FFN) block is
2048. We use ReLU activation and use 8 heads for the multi-head attention where each head is
based on 12-dimensional (query, key, value) vectors. The statistics of the datasets are listed in Table
3.1.
    Data Heterogeneity. We modeled non-IID distributions for CIFAR-10 and CIFAR-100 in line
with HeteroFL [10], limiting each client to have L labels. In our assessment, two degrees of data
                                            Table 3.1 Dataset statistics.
         Dataset     Train Clients Train Examples Validation Clients Validation Examples Test Clients Test Examples
        CIFAR-10         100           50,000            N/A                 N/A             N/A          10,000
        CIFAR-100        100           50,000            N/A                 N/A             N/A          10,000
      Stack Overflow   342,477       135,818,730       38,758             16,491,230       204,088      16,586,035
                                                           33


heterogeneity are considered. For CIFAR-10, high data heterogeneity is defined as L = 2 and low
data heterogeneity as L = 5. Similarly, for CIFAR-100, high and low data heterogeneity correspond
to L = 20 and L = 50 respectively. These levels roughly align with a Dirichlet distribution DirK (α),
with α being 0.1 and 0.5. With the Stack Overflow dataset, non-IID distribution naturally occurs
since data is partitioned by user IDs.
     Model Heterogeneity. In our evaluation, we contemplate five different client model capacities,
denoted as β = 1, 1/2, 1/4, 1/8, 1/16. Here, for instance, 1/2 indicates that the client model capacity
is half the size of the largest client model (full model). For ResNet18, we alter the number of
kernels in convolution layers while maintaining the nodes in the output layers. In the case of the
Transformer, the number of nodes in the hidden layer of the attention heads is varied.
     Baselines. FedRolex is compared with both state-of-the-art PT-based model-heterogeneous FL
methods such as Federated Dropout [9] and HeteroFL [10]2 , and KD-based model-heterogeneous FL
methods including FedDF [47], DS-FL [48], and Fed-ET [50]3 . To guarantee fairness, all PT-based
baselines underwent training using identical learning rates, communication round numbers, and
multi-step learning rate decay schedules. Specific details are provided in Section 3.3.10.
     Configurations and Platform. We used bounding box crop [62] to augment images for
CIFAR-10 and CIFAR-100. During each communication round, a random 10% of clients are
selected from a pool of 100 clients. For Stack Overflow, following [3], a 10% dropout rate is
applied to prevent overfitting, and 200 clients are randomly chosen from a pool of 342, 477 clients
in each round. Details on hyper-parameters for model training are available in Section 3.3.10. Our
experiments, conducted on eight NVIDIA A6000 GPUs, were implemented using PyTorch [63]
and Ray [64] for FedRolex and PT-based baselines.
     Evaluation Metrics. Global and local model accuracy serve as our assessment metrics. The
global model accuracy refers to the server model’s performance on the test set, while the local
model accuracy measures the server model’s performance on each client’s individual datasets. For
    2 Comparison with FjORD was omitted as its code is not open-source and results could not be replicated following
the paper.
    3 FedGKT [49] was excluded as it is solely compatible with CNN models.
                                                        34


CIFAR-10 and CIFAR-100, classification accuracy is reported. In the case of Stack Overflow,
the next word prediction accuracy is provided, encompassing both out-of-vocabulary (OOV) and
end-of-sentence (EOS) tokens. Experiments are carried out with five different seeds for
CIFAR-10and CIFAR-100 and three seeds for Stack Overflow.
3.3.1   Performance Comparison with State-of-the-Art Model-Heterogeneous FL Methods
    First, we compare the performance of FedRolex with state-of-the-art PT and KD-based model-
heterogeneous FL methods. For a fair comparison, we followed the experimental settings used in
prior arts where the distributions of client model capacities are uniform and the global server
model is the same as the largest client model.
    Evaluation Results: Table 3.2 summarizes our results. We have two observations. (1) In
comparison with state-of-the-art PT-based methods, under the small-model small-dataset regime,
FedRolex consistently outperforms HeteroFL and Federated Dropout under both low and more
challenging high data heterogeneity scenarios. In particular, under high data heterogeneity,
Federated Dropout which extracts sub-model randomly has worse performance than FedRolex
and HeteroFL which both extract sub-models in a deterministic manner. Under a large-model
large-dataset regime, FedRolex also outperforms both HeteroFL and Federated Dropout. These
results together demonstrate the superiority of FedRolex under both regimes. (2) In comparison
with state-of-the-art KD-based methods, FedRolex only performs worse than Fed-ET and FedDF
on CIFAR-10 under high data heterogeneity but outperforms all the KD-based methods on the more
challenging CIFAR-100 which has a larger number classes than CIFAR-10 under both low and high
data heterogeneity scenarios. It is important to note that KD-based methods leverage public data to
boost their model accuracy while FedRolex does not.
3.3.2   Performance Comparison with Model-Homogeneous FL Methods
    We also compare the global model accuracy of FedRolex with two model-homogeneous cases
where all the clients have the largest capacity model (β = {1}) and the smallest capacity model
(β = {1/16}), representing the upper and lower-bound performance, respectively.
    Evaluation Results: As listed in Table 3.2, compared with other PT-based methods, FedRolex
                                                  35


Table 3.2 Global model accuracy comparison between FedRolex, PT and KD-based
model-heterogeneous FL methods, and model-homogeneous FL methods. Note that the results of
KD-based methods were obtained from [50]. For Stack Overflow, since KD-based methods cannot
be directly used for language modeling tasks, their results are marked as N/A.
                                    High Data Heterogeneity       Low Data Heterogeneity
           Method                                                                             Stack Overflow
                                   CIFAR-10       CIFAR-100      CIFAR-10       CIFAR-100
           FedDF                  73.81 (± 0.42) 31.87 (± 0.46) 76.55 (± 0.32) 37.87 (± 0.31)       N/A
KD-based DS-FL                    65.27 (± 0.53) 29.12 (± 0.51) 68.44 (± 0.47) 33.56 (± 0.55)       N/A
           Fed-ET                 78.66 (± 0.31) 35.78 (± 0.45) 81.13 (± 0.28) 41.58 (± 0.36)       N/A
           HeteroFL               63.90 (± 2.74) 52.38 (± 0.80) 73.19 (± 1.71) 57.44 (± 0.42)  27.21 (± 0.22)
PT-based   Federated Dropout      46.64 (± 3.05) 45.07 (± 0.07) 76.20 (± 2.53) 46.40 (± 0.21)  23.46 (± 0.12)
           FedRolex               69.44 (± 1.50) 56.57 (± 0.15) 84.45 (± 0.36) 58.73 (± 0.33)  29.22 (± 0.24)
           Homogeneous (smallest) 38.82 (± 0.88) 12.69 (± 0.50) 46.86 (± 0.54) 19.70 (± 0.34)  27.32 (± 0.12)
           Homogeneous (largest)  75.74 (± 0.42) 60.89 (± 0.60) 84.48 (± 0.58) 62.51 (± 0.20)  29.79 (± 0.32)
reduces the gap in global model accuracy between model-heterogeneous and upper-
bound model-homogeneous settings. In particular, FedRolex is on par with the upper-bound
model-homogeneous case for Stack Overflow, whereas both HeteroFL and Federated Dropout
perform even worse than the model homogeneous case using the smallest model. This result
indicates that with FedRolex, we will not be constrained to only using high-end devices to achieve
competitive global model accuracy. Note that Fed-ET achieves a higher global model accuracy
than the model-homogeneous upper bound on CIFAR-10 under high data heterogeneity, which
showcases the advantage of using public data.
3.3.3 Impact of Client Model Heterogeneity Distribution
    In our previous experiments, the distributions of model capacities across client devices are set to
be uniform. In this experiment, we aim to understand the impact of the client model heterogeneity
distribution. To do so, without loss of generality, we use two client model capacities β = {1, 1/16}
and vary the distribution ratio between the two (denoted as ρ) where ρ = 1 represents the case in
which all the clients have the largest capacity model (β = {1}) and ρ = 0 represents the case in
which all the clients have the smallest capacity model (β = {1/16}).
Evaluation Results: Figure 3.3 shows how global model accuracy changes when ρ varies from 0
                                                    36


                  (i)                              (ii)                              (iii)
Figure 3.3 Impact of client model heterogeneity distribution on global model accuracy for (i)
CIFAR-10, (ii) CIFAR-100, and (iii) Stack Overflow.
to 1 for CIFAR-10, CIFAR-100 and Stack Overflow. We have three observations. (1) For CIFAR-10
(Figure 3.3(i)), there is a large gap in global model accuracy between high and low data
heterogeneity for a wide range of ρ (from 0.1 to 1). This is because CIFAR-10 is a relatively simple
task and hence the global model accuracy is bottlenecked by the level of data heterogeneity instead
of model capacity. This result indicates that having more high-capacity models in the cohort has only
limited contribution to global model accuracy. (2) For the more challenging CIFAR-100 (Figure
3.3(ii)), the gap in global model accuracy is much lower between high and low data heterogeneity.
In contrast to CIFAR-10, the global model accuracy is bottlenecked by the highest capacity of the
models rather than the level of data heterogeneity. (3) For both regimes (Figure 3.3(i)(ii) vs.
Figure 3.3(iii)), we observe that having a small fraction of large-capacity models significantly boosts
the global model accuracy, but keeping increasing the ratio of large-capacity models has limited
contribution to the accuracy.
3.3.4    Performance on Training Larger Server Model
    Similar to Federated Dropout, one advantage of FedRolex over static sub-model extraction
methods (HeteroFL and FjORD) is that FedRolex is able to train a global model that is larger than
the largest client model. In this experiment, we aim to evaluate the performance of FedRolex on
training larger server models. So, we consider the case where the size of the global server model is γ
= {2, 4, 8, 16} times the size of client models. For simplicity, all client models have the same size.
                                                  37


     Low Heterogeneity    FedRolex          Low Heterogeneity   FedRolex          FedRolex
     High Heterogeneity   Federated Dropout High Heterogeneity  Federated Dropout Federated Dropout
                       (i)                                   (ii)                                 (iii)
Figure 3.4 Performance on training larger server model when the server model is γ times the size
of the client model for (i) CIFAR-10, (ii) CIFAR-100, and (iii) Stack Overflow.
    Evaluation Results: Figure 3.4(i) and Figure 3.4(ii) compare FedRolex with Federated Dropout
in terms of global model accuracy when γ for CIFAR-10 and CIFAR-100, respectively.
    As shown, although the global model accuracy drops for both FedRolex and Federated Dropout
when γ increases, especially from 1 to 4, FedRolex consistently achieves higher global model
accuracy than Federated Dropout across γ = {2, 4, 8, 16} under both low and high data heterogeneity.
    For Stack Overflow (Figure 3.4(iii)), the global model accuracy has a much smaller drop when
γ increases. This demonstrates the superiority of using large models on large-scale datasets for
training larger server models.
3.3.5     Enhance Inclusiveness of FL in Real-world Distribution
    A primary vision of FedRolex is to enhance the inclusiveness of FL. To demonstrate this, in
this experiment, we use real-world household income distribution to emulate real-world device
distribution. Specifically, we retrieve household income distribution information from [65]. We
map βn = 1/16 with the income group with earning less than $75, 000 and assign proportions of
remaining groups in $25, 000 increments with increasing values of βn . Detailed mapping of this
distribution to the corresponding income distribution is provided in Figure 3.7 in Section 3.3.10.
    Evaluation Results: Table 3.3 shows both the global and local model accuracies of FedRolex
for CIFAR-10 and CIFAR-100 as well as the global model accuracy on Stack Overflow under
                                                           38


         Table 3.3 Performance of FedRolex under emulated real-world device distribution.
                                            High Data Heterogeneity                Low Data Heterogeneity
    Dataset           Method
                                       Local Accuracy  Global Accuracy        Local Accuracy  Global Accuracy
                Homogeneous (smallest)  85.90 (± 0.46)   38.82 (± 0.88)        66.02 (± 0.52)   46.86 (± 0.54)
   CIFAR-10     Homogeneous (largest)   95.54 (± 0.26)   75.74 (± 0.41)        93.54 (± 0.44)   84.48 (± 0.58)
                     FedRolex           94.05 (± 1.01)   63.17 (± 1.45)        91.03 (± 0.36)   80.14 ± 0.52)
                Homogeneous (smallest)  34.51 (± 0.56)   12.69 (± 0.50)        33.22 (± 0.10)   19.70 (± 0.34)
  CIFAR-100     Homogeneous (largest)   81.99 (± 0.78)   60.89 (± 0.60)        76.43 (± 0.54)   62.51 (± 0.20)
                     FedRolex           73.33 (± 0.96)   45.78 (± 1.71)        66.31 (± 0.34)   48.44 (± 0.51)
                Homogeneous (smallest)                              27.32 (± 0.12)
 Stack Overflow Homogeneous (largest)                               29.79 (± 0.32)
                     FedRolex                                       29.55 (± 0.41)
the emulated real-world device distribution. Again, we compare with two model-homogeneous
cases where all clients have the smallest and largest model capacities, representing lower and
upper-bound accuracy, respectively. We make two observations. (1) Looking at the global model
accuracy, FedRolex consistently outperforms the lower-bound model-homogeneous case across
CIFAR-10, CIFAR-100, and Stack Overflow. This result indicates that FedRolex enhances the
inclusiveness of FL and improves the accuracy of the global model, which would otherwise not be
able to achieve. (2) Looking at the local model accuracy, FedRolex significantly outperforms the
lower-bound model-homogeneous case on CIFAR-10 and CIFAR-100 under both low and high data
heterogeneity. This result indicates that FedRolex effectively boosts the performance of low-end
devices, which would otherwise not benefit from FL. A detailed illustration of how local model
accuracy distribution of individual clients shifts when FedRolex is used compared to the smallest
model-homogeneous case with the same client outreach is shown in Figure 3.5.
3.3.6   Impact of Different Weighing Schemes
    [50] reported that weighting clients is important to improving model accuracy. Therefore,
we did an ablation study and evaluated three client weighting schemes: (1) Model size-based
weighting scheme: Client weight is proportional to the number of kernels in the model; (2) Model
update-based weighting scheme: Client weight is proportional to the number of updates; and (3)
                                                    39


                     CIFAR10                                        CIFAR10                                          CIFAR100                               CIFAR100
                 Low Heterogeneity                             High Heterogeneity                                Low Heterogeneity                      High Heterogeneity
                                                       30
                                                                                                       20                                        20
         15
                                                       20                                              15                                        15
Counts   10                                   Counts                                          Counts                                    Counts   10
                                                                                                       10
                                                       10
         5                                                                                             5                                         5
         0                                             0 70          80          90     100            0                                         0 20      40           60   80
                 70      80       90   100                            Accuracy
                                                                                                            20        40         60                             Accuracy
                       Accuracy                                                                                       Accuracy
                        (i)                                               (ii)                                         (iii)                                     (iv)
Figure 3.5 Local model accuracy distribution of FedRolex (orange color) vs. the smallest
model-homogeneous case (blue color) for CIFAR-10 and CIFAR-100 under low and high data
heterogeneity.
              Table 3.4 Impact of weighting schemes on model accuracy under high data heterogeneity.
                                        Weighting Scheme                              Local Model Accuracy Global Model Accuracy
                                              Non-Weighting                               95.95 (±0.81)                               69.44 (±1.50)
                                             Model Size-based                             95.98 (±0.67)                               69.09 (±1.42)
                      CIFAR-10
                                        Model Update-based                                96.01 (±0.71)                               68.83 (±0.89)
                                                            Hybrid                        96.05 (±0.96)                               68.78 (±0.89)
                                              Non-Weighting                               81.58 (±0.59)                               56.57 (±0.15)
                                             Model Size-based                             81.23 (±1.56)                               56.99 (±0.27)
                      CIFAR-100
                                        Model Update-based                                81.23 (±1.07)                               56.63 (±0.36)
                                                            Hybrid                        81.49 (±1.07)                               56.71 (±0.20)
Hybrid weighting scheme: Client weight is proportional to both (1) model size and (2) model
update. Table 3.4 lists the results. As shown, the performance of the three weighting schemes is not
significantly better than the non-weighting scheme. Therefore, we used the non-weighting scheme
in FedRolex.
3.3.7            Impact of Overlapping Kernels
              We also studied the impact of overlapping kernels between rounds using ResNet-18 and
CIFAR-10/CIFAR-100 as an example. Specifically, we extracted sub-models using a rolling
window that advances and loops over all the kernels of each convolution layer in the global model
in strides. Let the degree of overlap between each stride of the rolling window be r ∈ [0, 1]. In each
iteration, each convolution layer in the global model is advanced by 1 + ⌊βn (1 − r) Ki ⌋ where ⌊ · ⌋
                                                                                          40


is the floor function. In FedRolex, r = 1, i.e., the kernels are advanced by 1 from one iteration to
the next iteration.
     Figure 3.6 shows the impact of different r on global model accuracy. As shown, the value
of r does have some influence on the global model accuracy, but the impact is non-linear and
inconsistent.
3.3.8    Impact of Client Participation Rate
     In our main paper, we followed prior arts [10, 53, 11, 66, 49] and used a 10% client participation
rate. To examine the effect of client participation rate, we conducted experiments with both lower
(5%) and higher (20%) client participation rates using CIFAR-10 as an example for FedRolex,
HeteroFL and Federated Dropout.
     The results are summarized in Table 3.5. As shown, FedRolex consistently outperforms both
Federated Dropout and HeteroFL across 5%, 10% and 20% client participation rates.
3.3.9    Communication and Computation Costs of FedRolex
     To calculate the communication cost, we use the average size of the models sent by all the
participating clients per round as the metric. To calculate the computation overhead, we calculate
the FLOPs and numbers of parameters in the models of all the participating clients per round and
                        (i)                                                 (ii)
Figure 3.6 Impact of inter-round kernel overlap on global model accuracy under low and high data
heterogeneity for (i) CIFAR-10 and (ii) CIFAR-100.
                                                   41


Table 3.5 Performance of FedRolex, HeteroFL, and Federated Dropout under different client
participation rates.
                                                               Client Participation Rate
                                                        5%               10%               20%
                            HeteroFL              48.43 (+/- 1.78)  63.90 (+/-2.74)   65.07 (+/- 2.17)
              CIFAR-10 Federated Dropout          42.06 (+/- 1.29)  46.64 (+/-3.05)   55.20 (+/- 4.64)
                            FedRolex              57.90 (+/- 2.72) 69.44 (+/-1.50) 71.85 ( +/- 1.22)
Table 3.6 Computation and communication costs of FedRolex compared to upper and lower bounds
represented by homogeneous settings with largest and smallest models respectively.
                                                    Homogeneous (largest) FedRolex          Homogeneous (smallest)
Average Number of Parameters per Client (Million)                   11.1722     2.9781232                  0.04451
Average FLOPs per Client (Million)                                  557.656 149.048384                     2.41318
Average Model Size per Client (MB)                                    42.62          11.36                     0.17
take the average as the metric. To put these metrics in context, we also calculate the upper and lower
bounds of the communication cost and computation overhead (i.e., all the clients were using the
same largest model and smallest model, respectively).
     Table 3.6 lists the results. As shown, compared to the upper bound, FedRolex significantly
reduces the communication cost and computation overhead while being able to achieve comparable
model accuracy. Compared to the lower bound, although FedRolex has higher communication
cost and computation overhead, the model accuracy achieved is much higher than the lower bound.
These results indicate that FedRolex is able to achieve comparable high model accuracy as the upper
bound with much less communication cost and computation overhead.
3.3.10    Detailed Experimental Setup Details
     Experimental Setup Details for Table 3.2. The experimental setup for PT-based methods is
listed in Table 3.7. The experimental setup for model-homogeneous baselines was slightly different
from the PT-based methods and hence is listed separately in Table 3.8.
     Experimental Setup Details for Figure 3.3. The experimental setup details are tabulated in
Tables 3.9 and 3.10.
     Experimental Setup Details for Figure 3.4. The experimental setup details are tabulated in
                                                          42


Table 3.11.
3.3.11   Algorithm Pseudocodes
    The pseudocodes for HeteroFL and Federated Dropout are given in Algorithms 3.2 and 3.3
respectively. Their differences from FedRolex are marked using blue color.
                                            Algorithm 3.2 HeteroFL
Require: Dn βn ∀n ∈ N
Ensure: θ J
 1: Initialization: θ (0) , N
 2: for j = 0 to J − 1 do
 3:     Sample subset M from N
                      ( j)
 4:     Broadcast θm,[i ; 0,1, ... ⌊βn Ki ⌋−1] ∀i and m ∈ M
 5:     for each client m ∈ M do
                               ( j)
 6:          CLIENT S TEP (θm , Dm )
 7:     end for
                       ( j+1)
 8:     Aggregate θ[i,k] according to Equation (3.10)
 9: end for
                                   ( j)
10: function CLIENT S TEP(θn , Dn )
11:     mn ← len(Dn )
12:     for k = 0 to mn do
13:          θn ← θn − η∇l(θn ; dn,k )
14:     end forreturn θn
15: end function
                                                        43


                                       Algorithm 3.3 Federated Dropout
Require: Dn βn ∀n ∈ N
Ensure: θ J
 1: Initialization: θ (0) , N
 2: for j = 0 to J − 1 do
 3:     Sample subset M from N
                      ( j)
 4:     Broadcast θm,[i ; k ,...,k          ]
                                              ∀i and m ∈ M
                             1     ⌊βn Ki ⌋
 5:     for each client m ∈ M do
                                ( j)
 6:          CLIENT S TEP (θm , Dm )
 7:     end for
                       ( j+1)
 8:     Aggregate θ[i,k] according to Equation (3.10)
 9: end for
                                    ( j)
10: function CLIENT S TEP(θn , Dn )
11:     mn ← len(Dn )
12:     for k = 0 to mn do
13:          θn ← θn − η∇l(θn ; dn,k )
14:     end forreturn θn
15: end function
                                                              1/4×
                                      1/8×
                                                                      1/2×
                                                           11%
                                                  18%            10%
                                                                          1×
                                                                    6%
                                                        55%
                                                  1/16×
                               Figure 3.7 Device heterogeneity Distribution
                                                          44


Table 3.7 Experimental setup details of PT-based methods in Table 3.2 on CIFAR-10, CIFAR-100
and Stack Overflow.
                                                    CIFAR-10 CIFAR-100      Stack Overflow
   Local Epoch                                           1           1             1
   Cohort SIze                                          10          10            200
   Batch Size                                           10          24             24
   Initial Learning Rate                            2.00E-04     1.00E-04      2.00E-04
                           High Data Heterogeneity  800, 1500   1000, 1500
   Decay Schedule                                                              600, 800
                           Low Data Heterogeneity   800, 1250   1000, 1500
   Decay Factor                                        0.1          0.1           0.1
                           High Data Heterogeneity    2500         3500
   Communication Rounds                                                          1200
                           Low Data Heterogeneity     2000         3500
   Optimizer                                          SGD          SGD           SGD
   Momentum                                            0.9          0.9           0.9
   Weight Decay                                     5.00E-04     5.00E-04      5.00E-04
Table 3.8 Experimental setup details of model-homogeneous baselines in Table 3.2 on CIFAR-10
and CIFAR-100 and Stack Overflow.
                                                    CIFAR-10 CIFAR-100 Stack Overflow
  Local Epoch                                             1           1              1
  Cohort Size                                            10          10            200
  Batch Size                                             10          24             24
  Initial Learning Rate                              2.00E-04     1.00E-04      2.00E-04
                            High Data Heterogeneity  500, 1000   1000, 1500
  Decay Schedule                                                                   300
                            Low Data Heterogeneity   500, 1000   1000, 1500
  Decay Factor                                          0.1          0.1            0.1
                            High Data Heterogeneity    1250         3500
  Communication Rounds∼                                                           1000
                            Low Data Heterogeneity     1500         3500
  Optimizer                                            SGD          SGD           SGD
  Momentum                                              0.9          0.9            0.9
  Weight Decay                                       5.00E-04     5.00E-04      5.00E-04
                                               45


Table 3.9 Experimental setup for results shown in Figure 3.3. ρ between 0.0 and 0.5 in 0.1
increments.
      Dataset                      ρ              0.0            0.1        0.2        0.3        0.4
                                    Decay
                     High                          500, 1000  500, 1000  500, 1000  700, 1200  700, 1200
                                    Schedule
                     Heterogeneity
                                    Communication
      CIFAR-10                                        1250      1250       1250       1500       1500
                                    Rounds
                                    Decay
                     Low                           500, 1000  500, 1000  500, 1000  700, 1200  700, 1200
                                    Schedule
                     Heterogeneity
                                    Communication
                                                      1250      1250       1250       1500       1500
                                    Rounds
                                    Decay
                     High                         1000, 1500 1000, 1500 1000, 1500 1000, 1500 1000, 1500
                                    Schedule
                     Heterogeneity
                                    Communication
      CIFAR-100                                       2000      2000       2000       2000       2000
                                    Rounds
                                    Decay
                     Low                          1000, 1500 1000, 1500 1000, 1500 1000, 1500 1000, 1500
                                    Schedule
                     Heterogeneity
                                    Communication
                                                      2000      2000       2000       2000       2000
                                    Rounds
                                    Decay
                     High                              800       800        800        800        800
                                    Schedule
                     Heterogeneity
      Stack Overflow                Communication
                                                      1500      1500       1500       1500       1500
                                    Rounds
                                    Decay
                     Low                               800       800        800        800        800
                                    Schedule
                     Heterogeneity
                                    Communication
                                                      1500      1500       1500       1500       1500
                                    Rounds
                                                      46


Table 3.10 Experimental setup for results shown in Figure 3.3. ρ between 0.5 and 1.0 in 0.1
increments.
                                   ρ              0.6            0.7        0.8        0.9        1.0
      Dataset
                                    Decay
                     High                          700, 1200  700, 1200  500, 1000  500, 1000  500, 1000
                                    Schedule
                     Heterogeneity
      CIFAR-10                      Communication
                                                      1500      1500       1250       1250       1250
                                    Rounds
                                    Decay
                     Low                           700, 1200  700, 1200  500, 1000  500, 1000  500, 1000
                                    Schedule
                     Heterogeneity
                                    Communication
                                                      1500      1500       1250       1250       1250
                                    Rounds
                                    Decay
                     High                         1000, 1500 1000, 1500 1000, 1500 1000, 1500 1000, 1500
                                    Schedule
                     Heterogeneity
      CIFAR-100                     Communication
                                                      2000      2000       2000       2000       2000
                                    Rounds
                                    Decay
                     Low                          1000, 1500 1000, 1500 1000, 1500 1000, 1500 1000, 1500
                                    Schedule
                     Heterogeneity
                                    Communication
                                                      2000      2000       2000       2000       2000
                                    Rounds
                                    Decay
                     High                              800       800        800        800        800
                                    Schedule
                     Heterogeneity
      Stack Overflow                Communication
                                                      1500      1500       1500       1500       1500
                                    Rounds
                                    Decay
                     Low                               800       800        800        800        800
                                    Schedule
                     Heterogeneity
                                    Communication
                                                      1500      1500       1500       1500       1500
                                    Rounds
                                                      47


         Table 3.11 Experimental setup for results shown in Figure 3.4
Dataset                       γ                 2             4         8         16
                               Decay
               High                             800, 1200 800, 1200 800, 1200 800, 1200
                               Schedule
               Heterogeneity
CIFAR-10                       Communication
                                                  1500      1500      1500      1500
                               Rounds
                               Decay
               Low                              800, 1200 800, 1200 800, 1200 800, 1200
                               Schedule
               Heterogeneity
                               Communication
                                                  1500      1500      1500      1500
                               Rounds
                               Decay
               High                             800, 1200 800, 1200 800, 1200 800, 1200
                               Schedule
               Heterogeneity
CIFAR-100                      Communication
                                                  1500      1500      1500      1500
                               Rounds
                               Decay
               Low                              800, 1200 800, 1200 800, 1200 800, 1200
                               Schedule
               Heterogeneity
                               Communication
                                                  1500      1500      1500      1500
                               Rounds
                               Decay
               High                                800       800       800       800
                               Schedule
               Heterogeneity
Stack Overflow                 Communication
                                                  1500      1500      1500      1500
                               Rounds
                               Decay
               Low                                 800       800       800       800
                               Schedule
               Heterogeneity
                               Communication
                                                  1500      1500      1500      1500
                               Rounds
                            Table 3.12 Income distribution
                    Model Capacity       Annual Household Income
                          1/16×          < $75, 000
                           1/8×          $75, 000 − $100, 000
                           1/4×          $100, 000 − $150, 000
                           1/2×          $150, 000 − $200, 000
                            1×           > $200, 000
                                           48


Table 3.13 Experimental setup for Table 3.3 for CIFAR-10, CIFAR-100 and Stack Overflow.
                                                 CIFAR-10  CIFAR-100  Stack Overflow
        Local Epoch                                  1          1             1
        Cohort SIze                                 10         10           200
        Batch Size                                  10         24            24
        Initial Learning Rate                    2.00E-04   1.00E-04     2.00E-04
                              High Heterogeneity 800, 1500 1000, 1500
        Decay Schedule                                                   600, 800
                              Low Heterogeneity  800, 1250 1000, 1500
        Decay Factor                                0.1        0.1          0.1
                              High Heterogeneity   2500       3500
        Communication Rounds                                               1200
                              Low Heterogeneity    2000       3500
        Optimizer                                  SGD        SGD          SGD
        Momentum                                    0.9        0.9          0.9
        Weight Decay                             5.00E-04   5.00E-04     5.00E-04
                                              49


                                            CHAPTER 4
                              LIMITATIONS AND FUTURE WORK
While the benchmark presented has been instrumental in elucidating the role of different factors
on model accuracy, the scope of AIoT (Artificial Intelligence of Things) extends further. A
holistic understanding of AIoT demands an examination of its infrastructural aspects, including
the computational prowess and energy utilization of IoT platforms, along with the efficiency and
security of their communication protocols. These are equally vital dimensions in the AIoT landscape
that contribute to the rich complexity of this field.
    As part of our ongoing commitment to advancing the field, we intend to continually expand the
scope of the benchmark, incorporating additional datasets from a more diverse set of applications,
integrating new algorithms, and conducting deeper analytical validations. Our aspiration is to
build upon our existing work, fostering collaboration and innovation, and thereby contribute to the
nuanced and evolving world of AIoT.
    Furthermore, our work has also provided a statistical analysis of FedRolex, a specific model
designed to train a global server model using a federation of heterogeneous client models. However,
the full convergence analysis of FedRolex is intricate and will be an area for future investigation.
An additional challenge is determining what models to deploy onto each client after the global
server model is trained, especially when that model is substantial. This task is separate from our
current focus but is something we will pursue in our future work.
                                                   50


                                            CHAPTER 5
                                           CONCLUSION
In this thesis, we have introduced two key contributions: FedAIoT and FedAIoT.
    First, we presented FedAIoT, a Federated Learning (FL) benchmark specifically tailored for
AIoT (Artificial Intelligence of Things). This benchmark encompasses eight datasets, harvested
from a diverse array of genuine IoT devices, and incorporates a unified end-to-end FL framework
for AIoT that spans the full FL-for-AIoT pipeline. Through our benchmarking of these datasets, we
have been able to shed light on the unique opportunities and challenges that arise in applying FL
within the AIoT context.
    Second, we introduced FedAIoT, a partial training (PT)-based model-heterogeneous FL approach,
designed to train a global server model that surpasses the size of the largest client model. By
proposing a rolling sub-model extraction scheme, FedAIoT facilitates the equitable training of
the parameters of the global server model, thereby reducing client drift resulting from model
heterogeneity. We also furnished a theoretical statistical analysis to articulate its advantage over
existing techniques like Federated Dropout. The experimental results have confirmed that FedAIoT
consistently excels over other state-of-the-art PT-based methods across various models and datasets,
at both minor and substantial scales. Additionally, we demonstrated its efficacy on an emulated
real-world device distribution, underscoring how FedAIoT furthers the inclusiveness of FL.
                                                  51


                                         BIBLIOGRAPHY
[1] S. Nižetić, P. Šolić, D. López-de-Ipiña González-de Artaza, and L. Patrono, “Internet of things
     (iot): Opportunities, issues and challenges towards a smart and sustainable future,” Journal of
     Cleaner Production, vol. 274, p. 122877, 2020.
[2] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz,
     Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated
     learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
[3] J. Wang, Z. Charles, Z. Xu, G. Joshi, H. B. McMahan, B. A. y Arcas, M. Al-Shedivat,
     G. Andrew, S. Avestimehr, K. Daly, D. Data, S. Diggavi, H. Eichner, A. Gadhikar, Z. Garrett,
     A. M. Girgis, F. Hanzely, A. Hard, C. He, S. Horvath, Z. Huo, A. Ingerman, M. Jaggi, T. Javidi,
     P. Kairouz, S. Kale, S. P. Karimireddy, J. Konecny, S. Koyejo, T. Li, L. Liu, M. Mohri, H. Qi,
     S. J. Reddi, P. Richtarik, K. Singhal, V. Smith, M. Soltanolkotabi, W. Song, A. T. Suresh, S. U.
     Stich, A. Talwalkar, H. Wang, B. Woodworth, S. Wu, F. X. Yu, H. Yuan, M. Zaheer, M. Zhang,
     T. Zhang, C. Zheng, C. Zhu, and W. Zhu, “A field guide to federated optimization,” 2021.
[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
     “Communication-efficient learning of deep networks from decentralized data,” in Artificial
     intelligence and statistics. PMLR, 2017, pp. 1273–1282.
[5] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization
     in heterogeneous networks,” Proceedings of Machine Learning and Systems, vol. 2, pp.
     429–450, 2020.
[6] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold:
     Stochastic controlled averaging for federated learning,” in International Conference on
     Machine Learning. PMLR, 2020, pp. 5132–5143.
[7] H.-Y. Chen and W.-L. Chao, “Fedbe: Making bayesian model ensemble applicable to federated
     learning,” arXiv preprint arXiv:2009.01974, 2020.
[8] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein,
     J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,”
     arXiv preprint arXiv:2108.07258, 2021.
[9] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar, “Expanding the reach of federated
     learning by reducing client resource requirements,” arXiv preprint arXiv:1812.07210, 2018.
[10] E. Diao, J. Ding, and V. Tarokh, “Heterofl: Computation and communication efficient federated
     learning for heterogeneous clients,” arXiv preprint arXiv:2010.01264, 2020.
[11] S. Horvath, S. Laskaridis, M. Almeida, I. Leontiadis, S. Venieris, and N. Lane, “Fjord: Fair
     and accurate federated learning under heterogeneous targets with ordered dropout,” Advances
     in Neural Information Processing Systems, vol. 34, 2021.
                                                  52


[12] Z. Charles, K. Bonawitz, S. Chiknavaryan, B. McMahan et al., “Federated select: A
     primitive for communication-and memory-efficient federated learning,” arXiv preprint
     arXiv:2208.09432, 2022.
[13] C. He, A. D. Shah, Z. Tang, D. F. N. Sivashunmugam, K. Bhogaraju, M. Shimpi, L. Shen,
     X. Chu, M. Soltanolkotabi, and S. Avestimehr, “Fedcv: a federated learning framework for
     diverse computer vision tasks,” arXiv preprint arXiv:2111.11066, 2021.
[14] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and
     A. Talwalkar, “Leaf: A benchmark for federated settings,” arXiv preprint arXiv:1812.01097,
     2018.
[15] C. Song, F. Granqvist, and K. Talwar, “Flair: Federated learning annotated image repository,”
     ArXiv, vol. abs/2207.08869, 2022.
[16] F. Lai, Y. Dai, S. Singapuram, J. Liu, X. Zhu, H. Madhyastha, and M. Chowdhury, “Fedscale:
     Benchmarking model and system performance of federated learning at scale,” in International
     Conference on Machine Learning. PMLR, 2022, pp. 11 814–11 827.
[17] D. Dimitriadis, M. H. Garcia, D. Diaz, A. Manoel, and R. Sim, “Flute: A scalable, extensible
     framework for high-performance federated learning simulations,” ArXiv, vol. abs/2203.13789,
     2022.
[18] B. Y. Lin, C. He, Z. Zeng, H. Wang, Y. Huang, M. Soltanolkotabi, X. Ren, and S. Avestimehr,
     “Fednlp: A research platform for federated learning in natural language processing,” arXiv
     preprint arXiv:2104.08815, 2021.
[19] J. O. d. Terrail, S.-S. Ayed, E. Cyffers, F. Grimberg, C. He, R. Loeb, P. Mangold, T. Marchand,
     O. Marfoq, E. Mushtaq et al., “Flamby: Datasets and benchmarks for cross-silo federated
     learning in realistic healthcare settings,” arXiv preprint arXiv:2210.04620, 2022.
[20] T. Zhang, T. Feng, S. Alam, S. Lee, M. Zhang, S. S. Narayanan, and S. Avestimehr, “Fedaudio:
     A federated learning benchmark for audio tasks,” in ICASSP 2023-2023 IEEE International
     Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[21] C. He, K. Balasubramanian, E. Ceyani, C. Yang, H. Xie, L. Sun, L. He, L. Yang, P. S. Yu,
     Y. Rong et al., “Fedgraphnn: A federated learning system and benchmark for graph neural
     networks,” arXiv preprint arXiv:2104.07145, 2021.
[22] G. M. Weiss, K. Yoneda, and T. Hayajneh, “Smartphone and smartwatch-based biometrics
     using activities of daily living,” IEEE Access, vol. 7, pp. 133 190–133 202, 2019.
[23] S. Yousefi, H. Narui, S. Dayal, S. Ermon, and S. Valaee, “A survey on behavior recognition
     using WiFi channel state information,” IEEE Communications Magazine, vol. 55, no. 10, pp.
     98–104, Oct. 2017. [Online]. Available: https://doi.org/10.1109/mcom.2017.1700082
[24] Z. Yang,          “Widar3.0 dataset:            Cross-domain gesture recognition with
     wi-fi,”      2020.       [Online].    Available:          https://ieee-dataport.org/open-access/
     widar30-dataset-cross-domain-gesture-recognition-wi-fi
                                                   53


[25] Y. Zheng, Y. Zhang, K. Qian, G. Zhang, Y. Liu, C. Wu, and Z. Yang, “Zero-effort
     cross-domain gesture recognition with wi-fi,” in Proceedings of the 17th Annual International
     Conference on Mobile Systems, Applications, and Services. ACM, Jun. 2019. [Online].
     Available: https://doi.org/10.1145/3307334.3326081
[26] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet
     drones challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1,
     2021.
[27] M. Schmitter-Edgecombe and D. J. Cook, “Assessing the quality of activities in a smart
     environment,” Methods of Information in Medicine, vol. 48, no. 05, pp. 480–485, 2009.
     [Online]. Available: https://doi.org/10.3414/me0592
[28] L. M. Candanedo, V. Feldheim, and D. Deramaix, “Data driven prediction models of energy
     use of appliances in a low-energy house,” Energy and Buildings, vol. 140, pp. 81–97, 2017.
     [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0378778816308970
[29] J. Huh, J. Chalk, E. Kazakos, D. Damen, and A. Zisserman, “EPIC-SOUNDS: A Large-Scale
     Dataset of Actions that Sound,” in IEEE International Conference on Acoustics, Speech, &
     Signal Processing (ICASSP), 2023.
[30] D. Damen, H. Doughty, G. M. Farinella, , A. Furnari, J. Ma, E. Kazakos, D. Moltisanti,
     J. Munro, T. Perrett, W. Price, and M. Wray, “Rescaling egocentric vision: Collection, pipeline
     and challenges for epic-kitchens-100,” International Journal of Computer Vision (IJCV), vol.
     130, p. 33–55, 2022. [Online]. Available: https://doi.org/10.1007/s11263-021-01531-2
[31] J. W. Lockhart, G. M. Weiss, J. C. Xue, S. T. Gallagher, A. B. Grosner, and T. T. Pulickal,
     “Design considerations for the wisdm smart phone-based sensor mining architecture,” in
     Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data,
     ser. SensorKDD ’11. New York, NY, USA: Association for Computing Machinery, 2011, p.
     25–33. [Online]. Available: https://doi.org/10.1145/2003653.2003656
[32] T.-M. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution
     for federated visual classification,” arXiv preprint arXiv:1909.06335, 2019.
[33] S. Liu and W. Deng, “Very deep convolutional neural network based image classification
     using small training sample size,” in 2015 3rd IAPR Asian Conference on Pattern Recognition
     (ACPR), 2015, pp. 730–734.
[34] J. Yang, X. Chen, H. Zou, C. X. Lu, D. Wang, S. Sun, and L. Xie, “Sensefi: A library
     and benchmark on deep-learning-empowered wifi human sensing,” Patterns, vol. 4, no. 3,
     p. 100703, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/
     S2666389923000405
[35] D. Liciotti, M. Bernardini, L. Romeo, and E. Frontoni, “A sequential deep learning application
     for recognising human activities in smart homes,” Neurocomputing, 2019. [Online]. Available:
     http://www.sciencedirect.com/science/article/pii/S0925231219304862
                                                  54


[36] C. Reinbothe, “Wisdm—biometric-time-series-data-classification,” https://github.com/
     Chrissi2802/WISDM---Biometric-time-series-data-classification, 2023.
[37] J. Terven and D. Cordova-Esparza, “A comprehensive review of yolo: From yolov1 and
     beyond,” 2023.
[38] S. Seyedzadeh, F. P. Rahimian, I. Glesk, and M. Roper, “Machine learning for estimation of
     building energy consumption and performance: a review,” Visualization in Engineering, vol. 6,
     no. 1, p. 5, 2018. [Online]. Available: https://doi.org/10.1186/s40327-018-0064-7
[39] Sholahudin, A. G. Alam, C. I. Baek, and H. Han, “Prediction and analysis of building energy
     efficiency using artificial neural network and design of experiments,” Applied mechanics and
     materials, vol. 819, pp. 541–545, 2016.
[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
     Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
     770–778.
[41] K. Hsieh, A. Phanishayee, O. Mutlu, and P. B. Gibbons, “The non-iid data quagmire of
     decentralized machine learning,” in Proceedings of the 37th International Conference on
     Machine Learning, ser. ICML’20. JMLR.org, 2020.
[42] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, and H. B.
     McMahan, “Adaptive federated optimization,” arXiv preprint arXiv:2003.00295, 2020.
[43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,
     Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito,
     M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and
     S. Chintala, “Pytorch:          An imperative style, high-performance deep learning
     library,” in Advances in Neural Information Processing Systems 32.                   Curran
     Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/
     paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[44] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang,
     W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging ai
     applications,” in Proceedings of the 13th USENIX Conference on Operating Systems Design
     and Implementation, ser. OSDI’18. USA: USENIX Association, 2018, p. 561–577.
[45] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg,
     M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” ArXiv,
     vol. abs/1710.03740, 2017.
[46] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv
     preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
[47] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion
     in federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp.
     2351–2363, 2020.
                                                 55


[48] S. Itahara, T. Nishio, Y. Koda, M. Morikura, and K. Yamamoto, “Distillation-based
     semi-supervised federated learning for communication-efficient collaborative training with
     non-iid private data,” arXiv preprint arXiv:2008.06180, 2020.
[49] C. He, M. Annavaram, and S. Avestimehr, “Group knowledge transfer: Federated learning
     of large cnns at the edge,” Advances in Neural Information Processing Systems, vol. 33, pp.
     14 068–14 080, 2020.
[50] Y. J. Cho, A. Manoel, G. Joshi, R. Sim, and D. Dimitriadis, “Heterogeneous ensemble
     knowledge transfer for training large models in federated learning,” International Joint
     Conference on Artificial Intelligence (IJCAI), 2022.
[51] S. D. Stanton, P. Izmailov, P. Kirichenko, A. A. Alemi, and A. G. Wilson, “Does
     knowledge distillation really work?” in Advances in Neural Information Processing Systems,
     A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021. [Online]. Available:
     https://openreview.net/forum?id=7J-fKoXiReA
[52] H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J.-y. Sohn, K. Lee,
     and D. Papailiopoulos, “Attack of the tails: Yes, you really can backdoor federated
     learning,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato,
     R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020,
     pp. 16 070–16 084. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/
     b8ffa41d4e492f0fad2f13e29e1762eb-Paper.pdf
[53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a
     simple way to prevent neural networks from overfitting,” The journal of machine learning
     research, vol. 15, no. 1, pp. 1929–1958, 2014.
[54] G. Cheng, Z. Charles, Z. Garrett, and K. Rush, “Does federated dropout actually work?” in
     Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022,
     pp. 3387–3395.
[55] S. M. Ross, Introduction to Probability Models, 11th ed. San Diego, CA, USA: Academic
     Press, 2014.
[56] D. J. Newman, “The double dixie cup problem,” The American Mathematical Monthly, vol. 67,
     no. 1, pp. 58–61, 1960.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
     Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
     770–778.
[58] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto,
     Tech. Rep., 2009.
[59] M. Andreux, J. O. d. Terrail, C. Beguier, and E. W. Tramel, “Siloed federated learning for
     multi-centric histopathology datasets,” in Domain Adaptation and Representation Transfer,
     and Distributed and Collaborative Learning. Springer, 2020, pp. 129–139.
                                                56


[60] TFF, “Tensorflow federated stack overflow dataset,” Online: https://www. tensorflow.
     org/federated/api_docs/python/tff/simulation/datasets/stackoverflow, 2019.
[61] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
     I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems,
     vol. 30, 2017.
[62] B. Zoph, E. D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, and Q. V. Le, “Learning data
     augmentation strategies for object detection,” in European conference on computer vision.
     Springer, 2020, pp. 566–583.
[63] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
     N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
     S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style,
     high-performance deep learning library,” in Advances in Neural Information Processing
     Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett,
     Eds. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.
     cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[64] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul,
     M. I. Jordan et al., “Ray: A distributed framework for emerging {AI} applications,” in 13th
     USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp.
     561–577.
[65] U. C. Bureau, “Percentage distribution of household income in the u.s. in 2020,” In Statista,
     September 2021, retrieved May 18, 2022, from https://www.statista.com/statistics/203183/
     percentage-distribution-of-household-income-in-the-us.
[66] M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary, “Federated learning with
     personalization layers,” arXiv preprint arXiv:1912.00818, 2019.
                                                 57