TOWARDS POST-HOC HUMAN-INTERPRETABILITY OF MULTIMODAL NEURAL NETWORKS FOR HEALTHCARE APPLICATIONS By Muneeza Azmat A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineering —Doctor of Philosophy 2023 ABSTRACT The use of artificial intelligence (AI) in healthcare has rapidly expanded in recent years. Multimodal neural networks (MNNs) that analyze diverse data types like images, lab reports, and genomics data often outperform unimodal approaches in healthcare applications. However, owing to their complex architecture, the decision-making logic of these large AI models is often unknown. This raises serious concerns surrounding model reliability, accountability, patient autonomy, and bias. The black-box nature of these models often makes them unsuitable for high-risk healthcare applications. Research in explainable AI (XAI) is therefore critical for the safe implementation of these models. This dissertation develops a two-phase approach to improve explainability and reliability assessment of MNNs in healthcare: Phase 1 - Explainability via feature importance: we develop a unified framework that quantifies the relative importance of multimodal inputs using post-hoc model-agnostic methods. The estimated importances are validated through importance- known-exactly simulations and agreement between multiple attribution methods. Experiments with multimodal breast tumor and cardiomegaly classifiers demonstrate the technique explains model behavior across diverse data types with high agreement scores and alignment with expert intuition. Phase 2 - Quantifying prediction reliability: we use multimodal input importance to predict the impact of missing inputs on MNN performance. This impact is presented with interpretable performance metrics, including accuracy reduction, providing measures closely tied to model reliability. We also propose an extension of the average model reliability to more fine grain patient-specific reliability estimates using reliability calibration curves. The methods developed in this dissertation offer promising approaches to improve interpretability and quantify reliability of complex MNNs, potentially facilitating their safe adoption in high-risk clinical settings. Copyright by MUNEEZA AZMAT 2023 We make our world significant by the courage of our questions and the depth of our answers. - Carl Sagan. iv ACKNOWLEDGEMENTS This has been an exciting, sometimes painful but mostly rewarding journey that would not have been possible without the support of many. I am especially grateful for my dad, Azmat Ullah, who always supported us in the pursuit of our dreams and my mom, Shaheen, who taught us not to take life too seriously. My husband, Aseem, is a constant source of inspiration, his unmatched faith in my abilities lifts me up in my most difficult moments. I also want to thank my incredible siblings - Muneeba, Zarmeen, Osama, and Ayesha, who are my biggest hype-squad. They are always rooting for me with unconditional love and occasional reality checks to keep me grounded. Visits and video-calls with my niece, Amal, have kept me sane during this final stretch. I am incredibly thankful to my amazing loving family. I also want to sincerely thank my advisor, Prof. Alessio, for being an incredible mentor in research and in life. Whether it be understanding image reconstruction or learning how to refill windshield fluid, every interaction has been a meaningful learning experience. His integrity, empathy, and compassion have profoundly shaped who I am, and who I aspire to be, as a researcher and as a human being. I want to thank my committee members Prof. Arjun Krishnan, Prof. Selin Aviyente, and Prof. Yuying Xie for their invaluable feedback, and support. I am thankful for the privilege of learning from so many wise mentors. I also want to thank the incredible people I’ve met at MSU and East Lansing. A shout-out to the ‘Bridge gang’ for providing sanity during COVID, Uswa for putting up with my antics and for introducing me to so many cool things around the area, and Wenjie for being the best pep-talk-giver and my partner-in-cry. Also big thank you to Jonathan for enduring my phone-call rants, to Henry for being my brunch and Wharton buddy (the finer things club), and to all the MIDI lab folks for their consistent warmth and support. Lastly, I want to acknowledge and honor my late grandfather, Hakeem Qudrat Ullah Khan. He was a lifelong learner and a visionary, whose passion for knowledge paved the way for my father, and consequently for me, to embark on this journey of lifelong learning. v LIST OF TABLES . . . LIST OF FIGURES . . . . . . . . . . . . . TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 1 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Black Box Problem in Artificial Intelligence 1.2 The Growing Drive for Explainable AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Lexicon of Explainable AI . 1.4 Illuminating the Black Box: Strategies in Explainable AI . . . . . . . . . . . x 1 1 4 6 7 CHAPTER 2 METHODS FOR POST-HOC FEATURE IMPORTANCE . . . . . . . . 11 2.1 Permutation Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Gradient Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Locally Interpretable Model Agnostic Explanations (LIME) . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . 16 2.4 SHapley Additive exPlanations (SHAP) CHAPTER 3 Introduction . QUANTIFYING THE BENEFIT OF USING PATIENT-SPECIFIC BLOOD FLOW FOR ASSESSMENT OF CORONARY ARTERY DISEASE RISK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Methods . . 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Limitations and Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Summary . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . 21 . 26 . 28 . 31 . . . . . . . . . . . . . . . . CHAPTER 4 Introduction . UNIFIED FRAMEWORK FOR MULTIMODAL FEATURE IMPORTANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 4.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Simulation Platform . . 41 . 4.4 Results . . 44 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 5 Introduction . ESTIMATING MODEL RELIABILITY WITH MISSING DATA THROUGH MULTIMODAL IMPORTANCE . . . . . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 . 47 . 48 . 50 . 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 . 5.1 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Data Collection and Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 5.4 Model Setup and Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Results . . . 5.6 Conclusion . . . . . . . . . . . . . . . . CHAPTER 6 TOWARDS SAMPLE-LEVEL RELIABILITY ESTIMATION . . . . . . 66 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Introduction . . . . . 6.1 vi . . . . . 67 6.2 Methods . . 72 6.3 Results . . 6.4 RCCs in the Case of Missing Inputs . 78 6.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 BIBLIOGRAPHY . . . . APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 vii LIST OF TABLES Table 2.1: Overview of feature importance and attribution methods in explainable AI. The categorization of methods can vary with implementation. The data types listed are the most common ones for each method, some of these methods can be used with other data types with modifications. . . . . . . . . . . . . . . . . . 12 Table 3.1: Description of the FFR estimation models used in the comparative study. . . . . 21 Table 3.2: Geometric and flow parameters used in analytical FFR estimation models. . . . . 22 Table 3.3: Material properties for blood and arterial walls used in the CFD simulation. . . . 24 Table 3.4: Range of simulated patient-specific values of blood flow through LAD artery. . . 25 Table 3.5: Average feature importance across all methods. . . . . . . . . . . . . . . . . . . 30 Table 3.6: Pairwise cosine similarity and RMSE between normalized input importance estimated by permutation (PERM), input gradient (GRAD), LIME, SHAP, and average of the four methods (AVG) for the CAD risk classifier. . . . . . . . . . . 31 Table 4.1: Overview of Nomenclature in Multimodal Neural Networks. . . . . . . . . . . . 35 Table 4.2: Synthetic decision functions and the corresponding normalized ground truth input importance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Table 4.3: Description of inputs and encoders used for training the multimodal classifiers for synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Table 4.4: Percent relative error and RMSE in feature importance estimates for the pro- posed methods compared to synthetic ground truth from synthetic decisions functions employing four multimodal inputs. . . . . . . . . . . . . . . . . . . . 43 Table 4.5: Pairwise cosine similarity and RMSE between normalized input importance estimated by proposed methods for the synthetic classification problems. . . . . . 43 Table 5.1: Description of Inputs used for training the multimodal breast tumor classifier and the corresponding encoding schemes. . . . . . . . . . . . . . . . . . . . . . 51 Table 5.2: Description of Inputs used for training the multimodal cardiomegaly classifier and the corresponding encoding schemes. . . . . . . . . . . . . . . . . . . . . . 51 Table 5.3: Pairwise cosine similarity and RMSE between normalized input importance estimated by proposed methods for the multimodal breast tumor classifier. . . . . 54 Table 5.4: Predicted performance of multimodal breast tumor classifier in the case of a single missing input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 viii Table 5.5: Predicted and true performance of multimodal breast tumor classifier in the case of multiple missing input. Each row corresponds to a different experiment with a unique subset of missing inputs. The predicted accuracy is obtained using (5.1), and the experimental accuracy is the computed accuracy on an imputed test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 5.6: Pairwise Cosine similarity and RMSE between normalized input importances estimated by different methods for the multimodal cardiomegaly classifier. . . . . 59 Table 5.7: Predicted performance of multimodal cardiomegaly classifier in the case of a single missing input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 5.8: Predicted and true performance of multimodal cardiomegaly classifier in the case of multiple missing input. Each row corresponds to a different experiment with a unique subset of missing inputs. The predicted accuracy is obtained using (5.1), and the experimental accuracy is the computed accuracy on an imputed test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Table 6.1: Validation results of the Mahalanobis distance based reliability calibration curve (M-RCC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Table 6.2: Validation results of the cosine similarity based calibration curve (C-RCC). . . . 75 Table 6.3: Validation results of the UMAP-based reliability calibration curve (U-RCC). . . 76 Table 6.4: Validation results of the prediction probability based calibration curve (P-RCC). . 77 ix LIST OF FIGURES Figure 1.1: Timeline of the evolution of artificial intelligence systems with a focus on applications in healthcare. Key: AI - Artificial Intelligence; DL - Deep Learning; FDA - U.S. Food and Drug Administration; CAD - Computer- Aided Diagnosis. Reprinted from [1] with permission from Elsevier. . . . . . . Figure 1.2: Plot of the exponential growth in published research related to ‘Explainable AI’ over the past decade. This surge in research activity is partially fueled by various international and national incentives and regulatory frameworks. The data for this analysis was sourced from the Web of Science. . . . . . . . . . Figure 1.3: The trade-off between model interpretability and performance is illustrated, showing that highly interpretable models like linear regression tend to have lower performance while high-performing opaque models like deep neural networks have low interpretability. Explainable AI methods and tools have promise to increase the interpretability of high-performing opaque models without significantly sacrificing their performance. Reprinted from [2] with permission from Elsevier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.4: Taxonomy of XAI methods combining the conceptual, functioning, and result approaches. The conceptual dimensions of stage, scope, and applicability form the upper levels. The functioning and result of methods are added as dimensions on the lower level. Additional dimensions like output format can be incorporated. Categories are not assumed to be mutually exclusive. Used under CC 4.0 from [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6 8 8 Figure 3.1: Model of a single blunt-plug stenosis in the artery. . . . . . . . . . . . . . . . . 21 Figure 3.2: Illustration of the three-dimensional blunt-plug arterial stenosis modeled us- ing Ansys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . Figure 3.3: Schematic of the virtual clinical trial to quantify the added benefit of using patient-specific blood flow rate for CAD assessment. . . . . . . . . . . . . . . . 26 Figure 3.4: Velocity profiles of the blood flow along different cross sections in an unob- structed artery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 3.5: Mean static pressure values along the arterial centerline, as determined by the converged CFD solution. The term curve length refers to distance along the length of the artery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . x Figure 3.6: Flow solution of the analytical model 𝐹𝐹 𝑅𝑃 implemented in Matlab; (a) shows the stenosed geometry which is the a 2D projection of the geometry used for the CFD analysis in Figure 3.5; (b) shows the static pressure along the artery, where the red dots represent the probes at which the proximal and distal pressures were measured for calculating FFR. . . . . . . . . . . . . . . . 27 Figure 3.7: ROC curves for classification into high risk versus low risk for CAD, for varying levels of noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 3.8: Performance summary of the trained CAD risk classifier on test set. . . . . . . . 29 Figure 3.9: Relative importance of patient-specific features for classification of CAD risk. . 30 Figure 4.1: Model architecture for various multimodal fusion strategies. The left diagram illustrates early fusion, where original or extracted features are merged at the input level. The middle diagram represents hybrid or joint fusion, where original or extracted features are combined at the input level and the model is trained end-to-end. The right diagram shows late fusion, where predictions are consolidated at the decision level. Used under CC 4.0 from [4]. . . . . . . . 33 Figure 4.2: Proposed method for multimodal feature importance. A hybrid fusion ar- chitecture supporting multimodal inputs is trained in an end-to-end manner. Features in the fusion layer are used to estimate feature importance of the upstream inputs. The post-fusion architecture typically consists of fully con- nected layers. The feature importance module can be replaced with any post-hoc attribution method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 4.3: Plots of normalized ground truth versus average feature importance returned by four estimation methods plus an average value across the methods. Each subplot represents a different test case corresponding to decision functions given in Table. 4.2. The predicted feature important values closely estimate known ground truth and display a consistent ranking of features. . . . . . . . . 42 Figure 5.1: Architecture of the hybrid fusion model used for classifying breast MRI’s using multimodal data. Resnet50 is used to extract fusion features from images while tabular inputs are pre-processed using standard scalar and one hot encoding. All weights are learnable and the model is trained end-to-end. . . 51 Figure 5.2: Performance of the the trained breast tumor classifier on test set. . . . . . . . . 52 Figure 5.3: Examples of different classification outcomes of the trained breast tumor classifier on the test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 xi Figure 5.4: Comparison of normalized feature importance results and associated feature ranks using gradient, permutation, LIME, and shapely values based methods for the multimodal breast tumor classifier. AVG reports the mean importance across the four methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 5.5: Comparison of predicted and true breast tumor classification performance reduction as a function of missing input importance using gradient (GRAD), permutation (PERM), LIME, and shapely values (SHAP). The Pearson cor- relation coefficient, 𝜌, is between the model test performance and aggregated importance of missing inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 5.6: Breast tumor classification performance reduction as a function of missing input importance. This presents predictions, "Pred", using AVG method. BLUE represents the best linear fit of the true drop in model test accuracy. The Pearson correlation coefficient, 𝜌, between the drop in model test performance and the sum importance of missing inputs. . . . . . . . . . . . . . . . . . . . . 56 Figure 5.7: Performance of the the trained cardiomegaly classifier on test set. . . . . . . . . 58 Figure 5.8: Examples of different classification outcomes of the trained cardiomegaly classifier on the test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 5.9: Comparison of normalized feature importance results and associated feature ranks using gradient, permutation, LIME, and shapely values based methods for the multimodal cardiomegaly classifier. AVG reports the mean importance across the four methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Figure 5.10: Comparison of predicted and true cardiomegaly classification performance reduction as a function of missing input importance in the case of one or more missing inputs using proposed methods. 𝜌 is the Pearson correlation coefficient between the model test performance and aggregated importance of missing inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . Figure 5.11: Cardiomegaly classification performance reduction as a function of missing input importance. Presents predictions, "Pred", using AVG importances. BLUE represents the best linear fit of the true drop in model test accuracy. 𝜌 is the Pearson correlation coefficient between the drop in model test performance and sum importance of missing inputs. . . . . . . . . . . . . . . . . . . . . . . 62 Figure 6.1: Generated M-RCCs for problems 1,7, and 8 (L-R) in Table 6.1. The mean calibration curve, depicted by the blue line, is constructed using validation data, the blue shaded region represents a 95% confidence interval. The red dots represent pairs of average Mahalanobis distance and local accuracy of test data, calculated over local neighborhoods, repeated for multiple bootstrap iterations. The test data is weighted based on the density of samples in the neighborhood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 xii Figure 6.2: Generated C-RCCs for problems 1,7, and 8 (L-R) in Table 6.2 The mean calibration curve, depicted by the blue line, is constructed using validation data, the blue shaded region represents a 95% confidence interval. The red dots represent pairs of average cosine similarity and local accuracy of test data, calculated over local neighborhoods, repeated for multiple bootstrap iterations. The test data is weighted based on the density of samples in the neighborhood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Figure 6.3: RCCs based on the euclidean distance of a sample from mean of training data in UMAP-projected space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 6.4: RCCs based on model prediction probability. . . . . . . . . . . . . . . . . . . . 77 Figure 6.5: Comparison of estimated and true P-RCCs for holdout test data in the case of missing multimodal inputs. (a) shows results for problem 1, (b) shows results for problem 7, and (c) shows results for problem 8 in Table 6.4. . . . . . . . . . 79 xiii CHAPTER 1 BACKGROUND 1.1 The Black Box Problem in Artificial Intelligence Artificial intelligence (AI) systems based on machine learning and deep neural networks (DNNs) have become increasingly popular in recent years, revolutionizing a wide range of sectors with their ability to learn from vast data. These computational models, inspired by the human brain, are designed to automatically learn and improve from experience. They have been utilized in various fields like finance, marketing, self-driving cars, voice recognition systems, and healthcare. The use of AI in healthcare has rapidly expanded in recent years, though its origins trace back decades. Early AI systems were developed in the 1960s to replicate aspects of medical reasoning and decision-making [5]. Expert systems that encoded rules to perform diagnostic tasks emerged in the 1970s and 1980s [6]. Machine learning then enabled statistical pattern recognition for tasks like imaging analysis in the 1990s [7]. With modern advanced deep learning, AI now matches or exceeds clinicians on select diagnostic tasks, as seen across various sub-specialties like dermatology, ophthalmology, and radiology [8, 9, 10]. Beyond diagnosis, AI has applications across healthcare, including automated patient monitoring, personalized treatment recommendations, robotic surgery, and drug discovery [11]. Despite their impressive capabilities and growing prevalence, deep neural networks pose signif- icant challenges, particularly when it comes to understanding their decision-making logic. These models are often referred to as "black boxes" because, while they can make highly accurate predic- tions, their internal workings that lead to these predictions are not easily interpretable or transparent. The intricacy of neural networks with millions of parameters makes them essentially opaque, even to experts [12]. The black-box problem arises from the complex, nonlinear nature of these models. For example, a neural network might consist of hundreds of layers, each containing numerous neu- rons with different weights and biases. The interactions between these layers and neurons create a highly intricate web of computations that is not human interpretable. This complexity, while contributing to the model’s predictive power, makes it challenging to explain why the model made 1 Figure 1.1: Timeline of the evolution of artificial intelligence systems with a focus on applications in healthcare. Key: AI - Artificial Intelligence; DL - Deep Learning; FDA - U.S. Food and Drug Administration; CAD - Computer-Aided Diagnosis. Reprinted from [1] with permission from Elsevier. a particular decision. The opacity of these models poses significant challenges for trust, ethics, and accountability. If the reasoning behind a model’s outputs and predictions cannot be understood, it becomes difficult to verify that the model is making decisions based on fair, unbiased logic rather than using problematic shortcuts or proxies. This poses a significant hurdle in the broader adoption of these AI systems in healthcare, where decisions can have life-altering consequences. The use of black box models can lead to erroneous predictions with severe consequences. In their 2019 study, Obermeyer et al. showed that an AI algorithm used to guide healthcare decisions was less likely to recommend additional medical care for black patients than for white patients with the same health conditions [13]. Further investigation revealed that the algorithm relied on healthcare costs as a proxy for health needs. Because less money is spent on healthcare for black patients in the U.S., the algorithm incorrectly inferred that they were healthier than equally sick white patients. A different study revealed that the classifiers used for computer-aided diagnosis trained on medical imaging datasets performed poorly for underrepresented genders [14]. This 2 problem was consistent across various network architectures and datasets, underscoring the need for data diversity and model fairness. Röösli et al. also noted performance disparities in the MIMIC benchmarking model. The model demonstrated less accurate predictions for Black patients and those on Medicaid as compared to in- dividuals with private insurance. Furthermore, the model tended to underestimate risk for Medicare patients and those with a higher number of comorbidities, suggesting possible inequities [15]. An- other study demonstrated that an image recognition based AI system designed for cancer treatment recommendations was unable to replicate its performance across various healthcare settings. This study underscores the risks and challenges associated with adapting such systems across diverse environments. It cautions against using these models as black boxes without understanding their limitations and recognizing the importance of region-specific data and inclusive training [16]. The risk of AI systems making errors in healthcare, such as recommending the wrong medication or misdiagnosing a condition, is highlighted in [17]. The authors note that reactions to errors made by AI may differ from those made by humans, and the widespread use of an erroneous AI system could potentially lead to injuries in a large number of patients, unlike errors made by individual healthcare providers. Black box models in healthcare also present a threat to individual autonomy by obstructing meaningful patient involvement in decision-making processes. In order for patients to exercise their agency, they need to understand the processes and potential outcomes of AI recommendations [18]. The opacity of these models impedes shared decision-making, a crucial aspect of ethical healthcare. Without grasping the underlying logic, patients cannot ensure that algorithms are aligned with their values. In extreme cases, this could even compromise a patient’s confidence to refuse treatment. Hence relying on inscrutable AI compromises patients’ ability to understand the forces influencing them. Moreover, healthcare decisions often need to be justified to patients, their families, and in some cases, to legal entities. If a decision made based on a black box model leads to an adverse outcome, it could result in legal issues. The inability of these models to explain their decision making logic, 3 makes it challenging to assign liability [19, 20]. Holzinger and others propose that explainability could be the solution to this problem [21]. If healthcare professionals are provided with human interpretable explanations of the models logic, then the model becomes similar to other diagnostic tools already in use. Safety challenges of machine learning systems in healthcare, particularly their black box nature, are also discussed in [22]. The authors argue that while certain aspects of AI systems, like design decisions and training activities, can be examined and explained, the precise workings of an algorithm often remain inscrutable. They suggest that safety governance of AI in healthcare will require frameworks that can explain broader sociotechnical processes, not just the underlying mechanics of the algorithms. However, they note that the inherent inscrutability of some machine learning approaches may make them unsuitable for safety-critical applications. 1.2 The Growing Drive for Explainable AI In light of the aforementioned risks, there is a growing global demand for explainability in AI to enable oversight and accountability as these transformative technologies become increasingly inte- grated across healthcare, government, industry, and other domains. Efforts to promote explainable AI have gained momentum across sectors in recent years. In 2019, the Organisation for Economic Co-operation and Development (OECD) AI Principles endorsed by over 40 countries emphasized the need for trustworthy AI systems. These principles highlight transparency and explainability as key enablers of trust in AI systems to support their widespread diffusion and adoption [23]. The National Institute of Standards and Technology (NIST) in the United States is at the forefront of developing standards, metrics, benchmarks, and tools to address explainable AI as a core component of trustworthy AI. NIST held a virtual workshop on Explainable AI in 2021, bringing together stakeholders from industry, academia, and government to discuss technical needs, challenges, and collaborative opportunities related to explainable AI [24]. The National AI Initiative Act signed into law in January 2021 underscores explainability as an important research priority, calling for coordination across the Federal government to accelerate advances in AI [25]. The Defense Advanced Research Projects Agency’s (DARPA) Explainable 4 AI (XAI) program aims to create a suite of machine learning techniques that yield more explain- able models without sacrificing learning performance. A core goal is enabling human users to understand, trust, and manage AI partners through model explainability [26]. On the regulatory front, there have been various initiatives to codify requirements for explain- able AI. In the United States, the proposed federal Algorithmic Accountability Act would require companies to conduct assessments of high-risk automated systems for biases and discrimination potentials. It also mandates that companies take corrective actions based on the assessments. The capacity to explain algorithmic decisions in a meaningful way to affected individuals would be important for compliance [27]. The Federal Trade Commission (FTC) has noted that explain- ability is critical for evaluating important AI properties like fairness, as opaque models preclude assessing underlying biases [28]. In the United Kingdom, guidelines from the Information Com- missioner’s Office (ICO) state that organizations must be able to explain the decisions, predictions, or recommendations produced by AI systems to affected individuals upon request in non-technical language [29]. The European Union’s General Data Protection Regulation (GDPR) has frequently been referenced as establishing a broad "right to explanation" for citizens subject to algorithmic decisions [30]. The EU is working to impose transparency requirements on high-risk AI systems under proposed regulations like the Artificial Intelligence Act [31]. For AI in healthcare, the Food and Drug Administration (FDA) has proposed guiding principles for good machine learning practice in medical device development [32]. The FDA principles stress that interpretability of outputs is critical for the usability and safety of machine learning- based devices and techniques, particularly those involving collaboration between humans and AI algorithms. Interpretability can be promoted through using visualizations to explain the model’s predictions, generating natural language rationales, or other strategies to make the model more understandable for users. Consequently, a primary focus of ongoing research in AI is to develop methods that can enhance the interpretability of these black box models. The goal is to demystify the black box, making the decision-making process of these powerful tools more comprehensible and accountable, thereby 5 Figure 1.2: Plot of the exponential growth in published research related to ‘Explainable AI’ over the past decade. This surge in research activity is partially fueled by various international and national incentives and regulatory frameworks. The data for this analysis was sourced from the Web of Science. addressing the ethical, legal, trust, and safety issues that currently limit their broader use. 1.3 Lexicon of Explainable AI As evident by the global push for AI regulation, the ideas of interpretability and explainability have gained significant attention. However, the nomenclature and vocabulary used in this area of research can often be confusing due to the interchangeable use of terms. In this section, we give a brief overview to provide clarity on the key terms and concepts used in the scientific literature on interpretable and explainable AI. Interpretable AI refers to models whose predictions can be readily understood by humans [33]. The term Intrinsic or Inherent Interpretability is often associated with simpler models, such as linear regression or decision trees, where the relationship between input and output is known and can be easily understood. These models have clear and explicit rules that relate input features to output predictions. On the other hand, Explainable AI (XAI) is a broader concept that not only includes inter- pretability but also the techniques and methods that can provide clear, understandable explanations for the decisions made by black box AI models. Post-hoc Explanations are methods used to explain model outputs after the model has been trained, without modifying the model itself. Local Expla- nations focus on explaining individual predictions, allowing users to understand why a particular 6 instance was classified or predicted in a certain way. On the other hand, global explanations aim to provide a comprehensive understanding of the overall functioning and logic of the entire model. Another term that often appears in the literature is transparency. In the context of AI, trans- parency refers to the openness and clarity of an AI system’s operations. A transparent AI system is one where all aspects, including the data used for training, the learning algorithm, and the decision- making process, are open and accessible. Transparent models are also referred to as white-box models or glass-box models. These models provide visibility into their decision-making process, allowing humans to comprehend the factors that influence their predictions. Lastly, fairness and trust are key concepts in AI research that are closely tied to interpretability and explainability. Fairness refers to the ability of an AI system to make decisions without bias or discrimination whereas trust in AI refers to the confidence users have in an AI system’s reliability and integrity. 1.4 Illuminating the Black Box: Strategies in Explainable AI In the early days of AI between 1980s-90s, symbolic or rule-based systems dominated, offering transparency and interpretability in their outputs [34]. As machine learning technologies advanced in the 2000s, popular approaches focused on intrinsic interpretability. Strategies developed in this era included sparse linear models, rule-based, and tree-based models. The rise of deep learning in the 2010s, however, introduced high performing opaque models, necessitating the emergence of "Explainable AI" (XAI) to interpret these black-box systems [35]. To illustrate these concepts, Figure 1.3 presents a currently accepted relationship that model performance is often inversely related to interpretability. The domain of explainable AI encompasses a variety of methodologies, each differing based on specific functional attributes. To organize these diverse concepts, scholars have proposed various taxonomies. Figure 1.4 illustrates a recognized categorization scheme for XAI methods [3]. Popular XAI approaches consist of post-hoc interpretation which can vary from simple visu- alization methods like partial dependence plots [36] to feature importance or feature attribution methods that quantify the contribution of input features to the model’s output. Gradient-based 7 Figure 1.3: The trade-off between model interpretability and performance is illustrated, showing that highly interpretable models like linear regression tend to have lower performance while high- performing opaque models like deep neural networks have low interpretability. Explainable AI methods and tools have promise to increase the interpretability of high-performing opaque models without significantly sacrificing their performance. Reprinted from [2] with permission from Elsevier. Figure 1.4: Taxonomy of XAI methods combining the conceptual, functioning, and result ap- proaches. The conceptual dimensions of stage, scope, and applicability form the upper levels. The functioning and result of methods are added as dimensions on the lower level. Additional dimensions like output format can be incorporated. Categories are not assumed to be mutually exclusive. Used under CC 4.0 from [3]. 8 methods are also used to identify important input regions. Saliency maps using gradients like Grad-CAM are typically used for interpreting image based convolutional neural networks [37]. Approaches like occlusion analysis and counterfactual explanations manipulate the inputs and the model itself to facilitate a deeper understanding of their functioning. Occlusion analysis methodically masks parts of the input to determine their importance [38]. On the other hand, counterfactual explanations offer insights into model predictions by identifying the minimal change to the input data that would lead to a change in the outcome [39, 40]. Decomposition approaches explain models in terms of their individual components. Methods like testing with concept activation vectors (TCAV) identify concepts that are highly relevant to a prediction [41]. Concept activation vectors link internal neural representations to human- interpretable concepts. Example-based techniques explain predictions by retrieving similar instances from training data [42]. These include prototype-based methods like influence functions [43] and activation minimization [44]. Influence functions identify the training samples that contribute the most towards predicting a given test sample. Activation minimization identifies examples that strongly activate the function of interest. Other prototype-based methods can learn characteristic prototypes for different classes and are able to identify to the closest training input with those prototypes as explanations at inference time [45, 46]. Model induction involves generating an entirely new model that approximates the behavior of a black-box model. This new model is typically a simpler, interpretable model such as a decision tree, rule set, or linear model. Model induction creates this new model by learning directly from the inputs and outputs of the black-box model. Thus, the new model serves as an approximation or surrogate of the black-box model, allowing researchers to study its decision-making process more readily than they could with the original black-box model. Similarly, knowledge distillation can also be used as an explainability method. It involves transferring knowledge from a larger, more complex teacher network to a smaller, less intricate student network [47, 48]. The primary objective is to manage the size and complexity of large- 9 scale neural networks but it can also be used to impart ’distilled’ knowledge to a more interpretable model. Other approaches focus solely on the development of inherently interpretable rule based and symbolic models [49]. Symbolic reasoning emphasizes explicit representation of knowledge using symbolic rules or logical formulas. Recent development of neuro-symbolic AI offer the best of both worlds: the performance of deep learning models and the explainability of symbolic AI. Neuro-symbolic models use symbolic AI for high-level reasoning and neural networks for low-level perception tasks [50]. In medicine analytical XAI approaches, such as predefined kinetic and linear models, feature extraction via correlations and clustering, and sensitivity analysis through perturbations, are also gaining popularity. These models can reveal patterns in genetics and neuroimaging data. Signal inversion methods are less explored despite their potential to probe neural mechanisms [51]. Verbal rule-based systems have also shown promise for interpretable medical predictions like pneumonia risk models [52]. Despite the recent advances in XAI, the black-box problem in deep learning remains unsolved. One of the reasons behind this is that model understanding is subjective and, therefore, difficult to formalize [53]. Furthermore, the insights needed from a model to make it transparent are domain- specific [54]. Consequently, it is challenging to design universal methods for model transparency. Emerging directions include building standardized benchmarks and rigorously evaluating expla- nations [55]. Improving human-centered design and explanations for non-experts are also crucial areas [56]. Developing theories and formal grounding for interpretability remains an open chal- lenge [57]. This thesis is contributing to XAI research through the presentation of methods offering explanations for black-box models that input multiple modalities and output medical decisions. 10 CHAPTER 2 METHODS FOR POST-HOC FEATURE IMPORTANCE As mentioned earlier, predictions alone are not enough for medical applications. The model must also provide some insight into the prediction generation process. In particular, it is often necessary to understand the model in order to perform debugging, bias detection, and failure analysis. Furthermore, insights into the model help the user assess if, when, and how much to trust model predictions. This is a crucial requirement for deploying these models in the real world. Feature importance or feature attribution is a widely used and well-studied explainability tech- nique [58, 59, 60]. The term feature in explainable AI research refers to a measurable property of the model input. Understanding the contribution of an input feature towards a particular decision builds trust with users and can lead to novel scientific discoveries. Based on their interaction with the predictive models, feature importance methods are classified into filter, wrapper and embedded methods [61]. Filter methods use input data only and are mostly applied as a preprocessing step before training the predictive model. Examples include similarity- based methods, correlation criteria, mutual information, clustering, principal component analysis, and linear discriminant analysis [62]. Wrapper methods such as permutation methods, local model approximations [63], and some gradient-based methods [64] are model agnostic but use model predictions for ranking features. Embedded methods require intricate manipulation of the model. In some direct-objective-optimization-based methods feature ranks are learned in addition to the model parameters [65]. Some methods propagate the feature relevance layer-wise through the DNN [66], while others use special network structures like bijection-layers [67] or self-attention layers [68] to rank features. Table 2.1 provides a comprehensive overview of existing feature attribution methods. While filter, wrapper, and embedded-based feature importance methods each have their advan- tages and shortcomings, wrapper methods are often preferred due to their model-agnostic nature. Unlike other methods, they only examine model input and output, making them suitable for a wide variety of model architectures. In this context, we focus on exploring several popular and 11 Table 2.1: Overview of feature importance and attribution methods in explainable AI. The cate- gorization of methods can vary with implementation. The data types listed are the most common ones for each method, some of these methods can be used with other data types with modifications. Method Permutation Importance Partial Dependence Plots (PDP) Saliency Maps Occlusion Data Tabular, Series Tabular, Series Image Category References Time Filter Breiman, L. (2001) [69] Time Filter Friedman, J. H. (2001) [36] Embedded Simonyan et al. (2013) [70] Text, Image Wrapper Zeiler, Fergus (2014) [38] Layer-wise Propagation (LRP) Relevance Tabular, Image Text, Embedded Bach et al. (2015) [66] Guided Backpropagation Image Embedded Springenberg et al. [71] (2015) Input Gradients LIME (Local Interpretable Model-agnostic Explana- tions) Grad-CAM Quantitative Input ence (QII) Influ- SHAP (SHapley Additive exPlanations) Integrated Gradients SmoothGrad DeepLIFT Influence Functions Tabular, Image Tabular, Image Image Tabular, Image Tabular, Image Tabular, Image Tabular, Image Tabular, Image Tabular Extremal Perturbations Text, Image Contextual Decomposition Text Text, Wrapper Hechtlinger (2016) [64] Text, Wrapper Ribeiro et al. (2016) [63] Embedded Selvaraju et al. (2016) [37] Text, Wrapper Datta et al. (2016) [72] Text, Wrapper Lundberg, Lee (2017) [60] Text, Embedded Sundararajan et al. [73] (2017) Text, Wrapper Smilkov et al. (2017) [74] Text, Wrapper Shrikumar et al. (2017) [58] Wrapper Wrapper Wrapper Koh, Liang (2017) [75] Fong, Vedaldi (2017) [76] Murdoch, Szlam (2017) [77] Contrastive Explanations Method (CEM) Anchors Model Agnostic suPer- vised Local Explanations (MAPLE) Tabular, Image Wrapper Dhurandhar et al. (2018) [78] Tabular Tabular Filter Wrapper Ribeiro et al. (2018) [79] Plumb et al. (2018) [80] 12 effective wrapper-based methods. These methods leverage slightly different definitions of feature importance. To effectively describe these methods, we first establish a comprehensive framework. Consider a binary classification problem: 𝐹 : 𝑋 ∈ R𝑑 −→ 𝑦 ∈ R, (2.1) where 𝐹 represents the classifier, 𝑋 is an input sample with 𝑑 features and 𝑦 is the predicted (cid:105)𝑇 (cid:104) probability for 𝑐𝑙𝑎𝑠𝑠1 1, where 𝑋 = 𝑥1 · · · 𝑥𝑑 , and for all 𝑖 = 1, . . . , 𝑑, 𝑥𝑖 ∈ R is a feature. Note that no assumptions are made about the structure of classifier 𝐹. All methods discussed below are model-agnostic and 𝐹 represents an arbitrary binary classifier. These methods can also be adopted for multi-class classification with minor modifications. 2.1 Permutation Feature Importance The primary goal in classification problems is maximizing the performance of the classifier, which is typically quantified by some score. Therefore, a natural way to estimate feature importance is to study the effect a feature has on a classifier score. One such method is the permutation-based feature importance [69]. It defines the importance of the 𝑘th feature as the average decrease in classifier score as the 𝑘th feature is permuted 𝑚 times. For tabular data, let 𝐷 𝑋 = (cid:104) 𝑋1 (cid:105)𝑇 · · · 𝑋𝑛 ∈ R𝑛×𝑑 be the data matrix with 𝑛, 𝑑 dimensional samples. Rows of 𝐷 𝑋 represent a new sample and columns represent a feature. Let Score(𝐷 𝑋) be the average classification performance score of the classifier 𝐹 on data 𝐷 𝑋. The importance of 𝑘th feature is given by Permutation Importance PERM imp(𝑘) = Score(𝐷 𝑋) − 1 𝑀 𝑀 ∑︁ 𝑖=1 Score(𝐷 𝑋 (𝑘,𝑖)), (2.2) where Score(𝐷 𝑋 (𝑘,𝑖)) is the performance score on the 𝑖th permutation of 𝐷 𝑋. In each permutation, values in the 𝑘th column of 𝐷 𝑋 are randomly shuffled. The permutation algorithm is detailed in Algorithm 2.1. 1Probability of 𝑐𝑙𝑎𝑠𝑠2 = 1 − Probability of 𝑐𝑙𝑎𝑠𝑠1 13 Algorithm 2.1 Permutation Importance 1: Input: Trained model (𝐹), Training set (𝐷 𝑋), Number of permutations (𝑀). 2: Calculate the reference performance Score(𝐷 𝑋) on original data. 3: for each input feature 𝑘 do for 𝑖 ← 1 to 𝑀 do 4: 5: 6: 7: 8: 9: end for 10: Scale importances using 𝑙1 normalization. 11: return Normalized permutation importance. Randomly shuffle values of the 𝑘th feature in 𝐷 𝑋. Calculate model performance Score(𝐷 𝑋 (𝑘,𝑖)) on shuffled data. end for Calculate the importance as average drop in score for permutation in 𝑘th feature (2.2). 2.2 Gradient Feature Importance Studying the impact of a change in input feature on the predicted output can provide insights about feature importance. Hechtlinger [64] uses the gradient to quantify such an impact. The absolute value of the gradient indicates magnitude change of the predicted output for an infinitesimal change in the input feature. The gradient of 𝐹 with respect to input 𝑋 is ∇𝐹 (𝑥) = (cid:20) 𝜕𝐹 (𝑥) 𝜕𝑥1 · · · (cid:21)𝑇 . 𝜕𝐹 (𝑥) 𝜕𝑥𝑑 (2.3) This method restricts the choice of model 𝐹 to differentiable classifiers only. The differentiability of deep neural networks depends on the choice of activation function. Common activation functions like sigmoid, Relu, and Tanh are differentiable almost everywhere 2. We use a central difference approach to numerically approximate the gradient of 𝐹 at 𝑋 and define where 𝜕𝐹 (𝑋) 𝜕𝑥𝑘 = 𝐹 (𝑋 (𝑘+)) − 𝐹 (𝑋 (𝑘−)) 2𝛿𝑥 , 𝑋 (𝑘+) = 𝑋 + 𝛿𝑥 · 𝑒𝑘 , 𝑋 (𝑘−) = 𝑋 − 𝛿𝑥 · 𝑒𝑘 , (2.4) (2.5) 𝛿𝑥 ∈ R is the step size, and for all 𝑘 = 1, . . . , 𝑑, 𝑒𝑘 ∈ R𝑑 is the standard basis vector. The terms 𝐹 (𝑋 (𝑘+)) and 𝐹 (𝑋 (𝑘−)) are obtained from two forward passes of the model. 2differentiable everywhere except on a set of measure zero 14 The importance of the 𝑘th feature is then defined as the absolute value of the partial derivative with respect to 𝑥𝑘 . The sample feature importance for a single test sample using the gradient vector is given by GRAD imp𝑆 (𝑋𝑖, 𝑘) = (cid:12) (cid:12) (cid:12) (cid:12) 𝜕𝐹 (𝑋) 𝜕𝑥𝑘 (cid:12) (cid:12) (cid:12) (cid:12)𝑋𝑖 . (2.6) To get the global feature importance, we average all sample feature importances over a test set {𝑋𝑖}𝑛 𝑖 . In particular, we define GRAD imp𝐺 (𝑘) = 1 𝑛 𝑛 ∑︁ 𝑖=1 (cid:12) (cid:12) (cid:12) (cid:12) 𝜕𝐹 (𝑋) 𝜕𝑥𝑘 (cid:12) (cid:12) (cid:12) (cid:12)𝑋𝑖 . (2.7) The input gradient importance algorithm is detailed in Algorithm 2.2. Compute the gradient of 𝐹 (𝑋) w.r.t 𝑥𝑘 (2.4). Algorithm 2.2 Input Gradient Feature Importance 1: Input: Trained model (𝐹), Training set 𝐷 𝑋. 2: for each training sample 𝑋 in 𝐷 𝑋 do for each input feature 𝑘 do 3: 4: 5: 6: 7: 8: end for 9: Average over all samples to get global gradient importance (2.7). 10: return Local and global input gradient feature importance. end for Take absolute value of gradient vector to get feature importance. Scale importances using 𝑙1 normalization. 2.3 Locally Interpretable Model Agnostic Explanations (LIME) Another way to get feature importance is to use locally interpretable surrogate models. Linear models are a good choice for surrogate models due to their interpretability and low complexity [63]. For the classifier in (2.1) we can build locally linear surrogate models 𝐺. For all 𝑋𝑖 ∈ {𝑋𝑖}𝑛 𝑖=1, we can construct 𝐺𝑖 (𝑋) : 𝑋 ∈ N (𝑋𝑖) −→ 𝑦 ∈ R, (2.8) where N (𝑋𝑖) is a set containing samples in the neighborhood of 𝑋𝑖. To populate N (𝑋𝑖), we sample from a Gaussian distribution centered at 𝑋𝑖. The surrogate model has the form 𝐺𝑖 (𝑋) = 𝑊𝑇 𝑖 𝑋, 15 (2.9) where 𝑊𝑖 = (cid:2)𝑤𝑖,1 · · · 𝑤𝑖,𝑑(cid:3)𝑇 learn the weights 𝑊𝑖 by minimising the weighted least squares loss function . To ensure that 𝐺𝑖 approximates the actual model 𝐹 centered at 𝑋𝑖, we L (𝑊𝑖) = ∑︁ 𝑒−∥ 𝑋−𝑋𝑖 ∥2 ∥𝐹 (𝑋) − 𝑊𝑇 𝑖 𝑋 ∥2. (2.10) 𝑋∈N (𝑋𝑖) For the 𝑖th test sample, the importance of 𝑘th feature is the absolute value of its corresponding weight defined as LIME impS(𝑋𝑖, 𝑘) = |𝑤𝑖,𝑘 |. (2.11) In order to estimate global feature importance using local surrogate models we propose an extension to the LIME. The global feature importance of the 𝑘th feature is defined as the average of the local importances estimated by each surrogate model given by LIME impG(𝑘) = 1 𝑛 𝑛 ∑︁ 𝑖=1 |𝑤𝑖,𝑘 |, (2.12) where 𝑤𝑖,𝑘 is the weight corresponding to the 𝑘th feature for the 𝑖th surrogate model. The LIME algorithm is detailed in Algorithm 2.3. Generate model prediction 𝐹 (𝑋′). Compute the distance ∥ 𝑋 − 𝑋′∥2. Generate 𝑀 random samples around 𝑋 to get the neighborhood set N (𝑋). for each sample 𝑋′ in N (𝑋) do Algorithm 2.3 LIME 1: Input: Trained model (𝐹), Input sample (𝑋), size of neighborhood set (𝑀). 2: for each training sample 𝑋 do 3: 4: 5: 6: 7: 8: 9: 10: 11: end for 12: Average over all samples to get global LIME importance (2.12). 13: return Local and global LIME feature importance. end for Fit a linear model 𝐺 (2.9) using the predictions and weights from the previous step (2.10). Take absolute values of coefficients of 𝐺 to get feature importance (2.11). Scale importances using 𝑙1 normalization. 2.4 SHapley Additive exPlanations (SHAP) The Shapley value, a concept from cooperative game theory, forms the basis of SHAP, where the classification task is the ‘game’ and the input features are the ‘players’. The Shapley value of 16 a feature is the average marginal contribution of that feature across all possible combinations of features. The Shapley value for a feature 𝑘 can be calculated using the following equation: 𝜙𝑘 (𝑣) = ∑︁ 𝐶⊆{1,··· ,𝑑}\{𝑘 } |𝐶 |!(|𝑑| − |𝐶 | − 1)! |𝑑|! [𝑣(𝐶 ∪ {𝑘 }) − 𝑣(𝐶)], (2.13) where 𝑑 is the total number of features, 𝐶 is a subset of features with |𝐶 | features. 𝑣(𝐶 ∪ {𝑘 }) is the value function for the (coalition) subset 𝐶 including feature 𝑘, 𝑣(𝐶) is the prediction for features present in set 𝐶 marginalized over features that are not included in set 𝐶 and is given by ∫ 𝑣𝑥 (𝐶) = 𝐹 (𝑋)𝑑P𝑥∉𝐶 − 𝐸 𝑋 (𝐹 (𝑋)). (2.14) The exact computation of Shapley values can be computationally expensive as it involves summing over all possible subsets of features and computing the value function. This becomes infeasible for a large number of features. To overcome this, a Monte Carlo estimation of Shapley values can be used [81], which involves randomly sampling permutations of the features. The Monte Carlo estimation of the Shapley value for a feature 𝑘 can be calculated by using 𝜙𝑘 (𝑋) = 1 𝑀 ∑︁ 𝑧∈𝑍 ′ 𝐹 (ℎ𝑥 (𝑧+𝑘 )) − 𝐹 (ℎ𝑥 (𝑧−𝑘 )), (2.15) where 𝑀 = |𝑍′| is the number of sampled combinations. 𝑍′ represents a subset of all possible combinations of features, the prime symbol is used to denote that these are not the actual feature values, but a binary vector indicating the presence or absence of a feature in a particular set. Each element 𝑧 of 𝑍′ is an instance with a subset of its features missing, 𝑧−𝑘 is constructed from 𝑧 by setting the 𝑘th indicator off (𝑧𝑘 = 0) and 𝑧+𝑘 is constructed from 𝑧 by setting the 𝑘th indicator on (𝑧𝑘 = 1). The mapping function ℎ𝑥 (𝑧) is used to create a synthetic instance by replacing the values in 𝑧 with the corresponding feature values from the original instance 𝑋, defined as ℎ𝑥 (𝑧) =    𝑋𝑖 if 𝑧𝑖 = 1 for 𝑖 = 1, · · · , 𝑑, 0 or the mean of feature 𝑖 if 𝑧𝑖 = 0 for 𝑖 = 1, · · · , 𝑑. (2.16) 17 As outlined in algorithm 2.4, the resulting instance ℎ𝑥 (𝑧) is then fed into the machine learning model 𝐹 to get the prediction. The difference between the feature-present predictions 𝐹 (ℎ𝑥 (𝑧+𝑘 )) and the feature-absent predictions 𝐹 (ℎ𝑥 (𝑧−𝑘 )) is then used to compute the SHAP value for the 𝑘th feature. Algorithm 2.4 SHAP 1: Input: Model (𝐹), Training set 𝐷 𝑋, Instance 𝑋, Number of Monte Carlo samples 𝑀 2: for each input feature 𝑘 do 3: 4: 5: Sample 𝑀 random combinations from all possible combinations of features. for 𝑧 in 𝑍′ do Generate synthetic instance ℎ𝑥 (𝑧) by replacing missing values with expected values from 𝐷 𝑋 (2.16). Original value for 𝑘th feature ℎ𝑥 (𝑧+𝑘 ). Masked value for 𝑘th feature ℎ𝑥 (𝑧−𝑘 ). Compute marginal contribution of 𝑘th feature in this coalition 𝐹 (ℎ𝑥 (𝑧+𝑘 )) − 𝐹 (ℎ𝑥 (𝑧−𝑘 )) 6: 7: 8: 9: 10: 11: end for 12: Scale importances using 𝑙1 normalization. 13: return Normalized SHAP importance. end for Average over set 𝑍′ to get SHAP importance (2.15). 18 CHAPTER 3 QUANTIFYING THE BENEFIT OF USING PATIENT-SPECIFIC BLOOD FLOW FOR ASSESSMENT OF CORONARY ARTERY DISEASE RISK The work in this chapter contributed towards the following: • M. Azmat, K. Branch, and A. Alessio. ‘Virtual Clinical Trial to Evaluate the Benefit of Patient-Specific Blood Flow in CT Assessment of Functional Significance of Coronary Artery Stenosis’. Presented at BMES Annual Meeting 2020. • M. Azmat, E. Tu, K. Branch, and A. Alessio. ‘Machine Learned Versus Analytical Models for Estimation of Fractional Flow Reserve from CT-Derived Information’. Presented at SPIE Medical Imaging Conference, 2021. 3.1 Introduction Coronary artery disease (CAD) is a highly prevalent health condition that poses a significant burden on global health, leading to considerable morbidity and mortality. Traditionally, stress tests and nuclear imaging techniques, such as exercise stress tests, pharmacological stress tests, myocar- dial perfusion imaging, and single-photon emission computed tomography (SPECT), have been commonly used to assess ischemia, which is directly related to CAD risk. This chapter proposes models to detect functionally significant ischemia, suggesting high-risk CAD; For simplicity, we will use the term "CAD risk" to denote high-risk ischemic, flow-limiting, disease. However, fractional flow reserve (FFR) is increasingly being used to assess the risk of CAD. FFR is defined as the ratio of the pressure before and after a stenosis as measured during coronary catheterization [82]. This invasive procedure provides a direct and reliable measure of the hemo- dynamic impact of a coronary stenosis, and it has been shown to avoid unnecessary interventions and improve patient outcomes when used to guide revascularization decisions [83]. Traditionally, FFR is measured invasively during coronary catheterization, which, while accu- rate, carries inherent risks, costs and recovery time. Therefore, there has been a growing interest in developing non-invasive methods for estimating FFR. One promising approach in this regard is 19 the use of cardiac computed tomography (CT) imaging combined with data-driven or analytical models. FFR estimation models use CT angiography to derive patient-specific stenosis geometry, a key determinant of the pressure drop across the stenosis and, consequently, the FFR value [84]. Typically, the FFR estimation methods then either apply computational fluid dynamics (CFD) simulations or machine-learned models to estimate FFR for the patient-specific stenosis geometry [85]. However, most of these models rely on estimated normal values for flow variables, which may not accurately reflect the patient-specific hemodynamic conditions. This limitation could potentially affect the accuracy of FFR estimation and, consequently, the assessment of CAD risk. We hypothesize that incorporating patient-specific values for hemodynamic parameters, specif- ically patient-specific blood flow, could improve the estimation accuracy of non-invasive FFR models. patient-specific blood flow can be particularly valuable in patients with microvascular dysfunction, a condition that can affect the blood flow in the coronary arteries and is not captured by traditional FFR measurements [86]. To test this hypothesis, we perform a comparative study where we compare the estimation accuracy of a variety of FFR estimation models against known and true FFR values. We construct models with varying levels of patient-specific information, both with and without patient-specific blood flow information. However, obtaining patient-specific blood flow information requires additional imaging studies, specifically CT perfusion imaging. CT perfusion imaging provides detailed information about the blood flow in the coronary arteries, which can be used to estimate the patient-specific flow rate. Since the additional imaging carries radiation and monetary costs, it is crucial to quantify the added benefit of using patient-specific blood flow relative to other inputs. This quantification can help justify the use of additional imaging studies and guide the development of cost-effective strategies for non-invasive FFR estimation. To quantify the relative importance of using patient-specific blood flow, we use a machine learning based FFR estimator and perform feature importance analysis using a variety of feature importance estimation methods. 20 3.2 Methods To assess the benefit of using patient-specific blood flow for assessment of CAD risk, we set up a virtual clinical trial. We simulated a section of the left anterior descending (LAD) artery with stenosis for 60 reference patients with varying rates of arterial blood flow at stress and different stenosis geometries. We then constructed three different FFR estimation models relying on different levels of patient-specific information: 1) 𝐹𝐹 𝑅𝐺 was the most primitive model which relies on just the geometric data, 2) 𝐹𝐹 𝑅𝑁 predicted FFR using normal values for flow parameters in Navier-Stokes equations, and 3) 𝐹𝐹 𝑅𝑃 used patient-specific values for flow parameters in Navier- Stokes equations to estimate FFR. The ground truth values of fractional flow reserve 𝐹𝐹 𝑅𝐺𝑇 were calculated using high fidelity computational fluid dynamics simulations. For the analytical FFR Table 3.1: Description of the FFR estimation models used in the comparative study. Symbol Name 𝐹𝐹 𝑅𝐺𝑇 Ground truth 𝐹𝐹 𝑅𝐺 𝐹𝐹 𝑅𝑁 𝐹𝐹 𝑅𝑃 Description 3D computational fluid dynamics simulation. Model relying only on the geometry of the stenosis. Flow-based model using a constant normal blood flow rate. Geometric only Normal flow Patient-specific flow Flow-based model using patient-specific blood flow rates. models, we approximated the stenosis geometry with a blunt-plug in a constant diameter artery as shown in Figure 3.1. Table 3.2 lists the parameters that define the patient stenosed LAD model. Figure 3.1: Model of a single blunt-plug stenosis in the artery. 3.2.1 Geometric Model (𝐹𝐹 𝑅𝐺) The geometry-only model relied solely on the stenosis geometry to get a rough estimate of the FFR across the stenosis. 𝐹𝐹 𝑅𝐺 was calculated as the ratio of maximum stenosed artery diameter 21 Table 3.2: Geometric and flow parameters used in analytical FFR estimation models. Symbol 𝑈0 𝑃0 𝐷0 𝐿0 𝑃1 𝐷1 𝐿1 𝑃2 Description Units Average velocity in unobstructed artery m/s Upstream flow pressure in unobstructed artery Pa m Diameter of unobstructed artery m Length of upstream unobstructed artery Pa Flow pressure proximal to the stenosis m Minimum stenosis diameter m Length of stenosis Pa Flow pressure distal to the stenosis to unobstructed artery diameter given by 𝐹𝐹 𝑅𝐺 △ = 𝐷1 𝐷0 . (3.1) 3.2.2 Flow-Based Model (𝐹𝐹 𝑅𝑁 , 𝐹𝐹 𝑅𝑃) The flow-based FFR estimation models were modeled using simplified Navier-Stokes equation for an incompressible Newtonian blood flow in a rigid artery of constant circular cross-section as illustrated in Figure 3.1. The flow was assumed to be fully developed and steady. The pressure drop Δ𝑃 across a single blunt-plug stenosis calculated as the sum of viscous and expansion losses in the flow [87] is given by = Δ𝑃 𝜌𝑈2 0 𝑅𝑒0 = + 𝐾𝜈 𝑅𝑒0 𝜌𝑈0𝐷0 𝜇 𝐾𝑡 2 , (cid:19) , − 1 (cid:18) 𝐴0 𝐴1 (3.2) (3.3) where 𝜌 is the density of blood, 𝜇 is the dynamic viscosity of blood, 𝐾𝜈 and 𝐾𝑡 are viscous and expansion loss coefficients, respectively, 𝑅𝑒0 is the unobstructed Reynolds number, and 𝐴0 and 𝐴1 are the unobstructed and stenosed cross-sectional areas, respectively. The loss coefficients were computed using empirical relationships 𝐾𝜈 = 32 0.83𝐿1 + 1.64𝐷1 𝐷0 (cid:19) 2 , (cid:18) 𝐴0 𝐴1 𝐾𝑡 = 1.52, (3.4) (3.5) 22 given in [88]. Finally, the flow-informed FFR was defined as the ratio of distal pressure to proximal pressure 𝐹𝐹 𝑅𝑁/𝑃 △ = 𝐹𝐹 𝑅𝑁/𝑃 = 𝑃2 𝑃1 1 𝑃1 = Δ𝑃 + 𝑃1 𝑃1 , (cid:20) (cid:18) 𝐾𝜈 𝑅𝑒0 + 𝐾𝑡 2 (cid:19)(cid:19) − 1 (cid:18) 𝐴0 𝐴1 (cid:21) . 𝜌𝑈2 0 + 𝑃1 (3.6) (3.7) In the case where Δ𝑃 was computed using patient-specific blood flow rate (𝑈0𝑃), equation (3.7) returned 𝐹𝐹 𝑅𝑃, and in the case where Δ𝑃 was computed using a constant normal blood flow rate (𝑈0𝑁 ), equation (3.7) returned 𝐹𝐹 𝑅𝑁 . 𝑃1 was calculated from a reference-normal-aortic pressure 𝑃0 by using the Hagen-Poiseuille equation for fully developed flow 𝑃1 = 𝑃0 − 32𝜇𝐿0𝑈0 𝐷0 . (3.8) 3.2.3 Computational Fluid Dynamics Model (𝐹𝐹 𝑅𝐺𝑇 ) To generate reference ground truth values of FFR, we relied on high-fidelity computational fluid dynamics simulations of Newtonian blood flow in a non-rigid artery. The simulation was conducted using Fluent software, version 20.1.0. in 3D space with a steady-state time setup. 3.2.3.1 Geometry We restricted our analysis to stenosed sections of the LAD artery. For the sake of simplicity, we fixed the geometric parameters of the LAD artery (𝐷0 = 4.6𝑚𝑚) and varied the geometric parameters 𝐿1, 𝐷1 of the stenosis across different patients. Figure 3.2: Illustration of the three-dimensional blunt-plug arterial stenosis modeled using Ansys. 23 3.2.3.2 Materials The blood flow through the artery was assumed to be laminar, and structural and thermal considerations were not taken into account in this study. Material properties used for blood and arterial wall were taken from [89] and are detailed in Table 3.3. Table 3.3: Material properties for blood and arterial walls used in the CFD simulation. Property Density Specific heat Thermal conductivity Viscosity Molecular weight Value Blood Artery wall 1050 3490 0.549 0.0028 28.966 1075 3490 0.476 - - Units 𝑘𝑔/𝑚3 𝐾/𝑘𝑔𝐾 𝑊/𝑚𝐾 𝑘𝑔/𝑚𝑠 𝑘𝑔/𝑘𝑚𝑜𝑙 3.2.3.3 Solver Settings The computational fluid dynamics simulation was conducted using Fluent version 20.1.0 to analyze blood flow behavior within a solid environment. The process involved solving the governing equations for fluid flow, and the absolute velocity formulation was activated in the numerical segment. The pressure-velocity coupling utilized the SIMPLE algorithm, and a V-Cycle solver was implemented for the pressure variable. Discretization involved a second-order scheme for pressure and a second-order upwind scheme for momentum equations. The simulation assumed laminar flow and did not take into account heat transfer, solidification, melting, species transport, pollutants, and structural effects. Relaxation factors were applied to various simulation variables. 3.2.3.4 Boundary Conditions Boundary conditions were assigned to different zones within the computational domain. The arterial wall was assigned a no-slip condition with the velocity and shear stresses set to zero at the wall. For inflow we specified the inlet velocity profile, based on a fully developed flow assumption. The inlet velocity at position (𝑥, 𝑦) on the cross-sectional inlet plane was defined as (cid:32) 𝑈 (𝑥, 𝑦) = 𝑈𝑚𝑎𝑥𝑃 1 − 4 (cid:33) , 𝑥2 + 𝑦2 𝐷2 0 24 (3.9) where 𝑈𝑚𝑎𝑥𝑃 is the patient-specific maximum value of velocity in the cross section at the mid- line, and is calculated from patient-specific flow rate inputs as detailed in Table 3.4. The outflow boundary was a pressure outlet with a target mass flow rate. The target output mass flow rate was set equal to inlet mass flow rate given in Table 3.4. Table 3.4: Range of simulated patient-specific values of blood flow through LAD artery. Aortal blood flow LAD blood flow Mass flow 𝑄 𝐴𝑜𝑟𝑡𝑎 𝑚𝐿/𝑚𝑖𝑛 120.0 240.0 360.0 480.0 𝑄 𝐿 𝐴𝐷 ≈ 𝑄 𝐴𝑜𝑟𝑡𝑎/3 𝑚3/𝑠 6.67e-07 1.33e-06 2.00e-06 2.67e-06 LAD 𝑘𝑔/𝑠 7.07e-04 1.41e-03 2.12e-03 2.83e-03 𝑈0𝑃 𝑈𝑚𝑎𝑥𝑃 𝑚/𝑠 𝑚/𝑠 0.080 0.040 0.161 0.080 0.241 0.120 0.321 0.161 At rest At stress 1 At stress 2 At stress 3 3.2.4 Comparative Analysis We used the CFD solutions of blood flow in stenosed LAD arteries for 60 patients to obtain the reference 𝐹𝐹 𝑅𝐺𝑇 . For assessment of CAD risk the patients were classified into high versus low risk of CAD by thersholding 𝐹𝐹 𝑅𝐺𝑇 using CAD risk = High risk, 𝐹𝐹 𝑅 < 0.8, Low risk, 𝐹𝐹 𝑅 ≥ 0.8.    (3.10) To conduct a comparative analysis of different analytical FFR estimation models, we generated a simulated population of 10,000 patients based on the original group of 60 patients by incorporating measurement noise into the patient-specific flow and geometry parameters. This was done to introduce realistic variations. The sampling was performed with prevalence weighting, ensuring that the resulting population reflected a realistic patient population with a prevalence of high-risk CAD set at 63%. Subsequently, the three models, namely 𝐹𝐹 𝑅𝐺, 𝐹𝐹 𝑅𝑃, and 𝐹𝐹 𝑅𝑁 , estimated the FFR values for each sample. Next, these FFR values were used to predict risk of CAD by classifying the samples into high and low risk of CAD using (3.10). 25 Figure 3.3: Schematic of the virtual clinical trial to quantify the added benefit of using patient- specific blood flow rate for CAD assessment. By comparing the classification performance of these models, we aim to demonstrate the added benefit of incorporating patient-specific blood flow information in assessing CAD risk. Figure 3.3 shows the schematic of this comparative study. 3.3 Results The computational fluid dynamics and the analytical model were used to simulate blood flow and estimate FFR across the stenosis in all simulated patients. Figure 3.4 shows the velocity profiles of the blood flow in the stenosed coronary artery after the convergence of the CFD solution. To estimate FFR, average static pressure along different axial cross-sections was plotted, which resulted in a static pressure profile along the center line of the stenosed artery, as depicted in Figure 3.5. This figure shows a significant drop in pressure, indicative of the presence of stenosis. Figure 3.4: Velocity profiles of the blood flow along different cross sections in an unobstructed artery. 26 Figure 3.5: Mean static pressure values along the arterial centerline, as determined by the converged CFD solution. The term curve length refers to distance along the length of the artery. Figure 3.6: Flow solution of the analytical model 𝐹𝐹 𝑅𝑃 implemented in Matlab; (a) shows the stenosed geometry which is the a 2D projection of the geometry used for the CFD analysis in Figure 3.5; (b) shows the static pressure along the artery, where the red dots represent the probes at which the proximal and distal pressures were measured for calculating FFR. 27 Figure 3.6 displays the pressure solution along the stenosis artery, which was obtained using simplified Navier-Stokes equations 𝐹𝐹 𝑅𝑃. For identical geometric and flow inputs, static pressure estimates from 𝐹𝐹 𝑅𝑃 align closely with the CFD results as seen in Figures 3.5 and 3.6. The estimated FFR values from all models were used to categorize patients into high and low risk of CAD. Next, the performance was compared with the ground truth values. Figure 3.7 showcases the results of this comparative study. We conducted the experiments at varying levels of measurement noise, plotting the receiver operating characteristic (ROC) curves for patient classification into high and low CAD risk categories using the three proposed models at each noise level. Of all three models, the patient-specific analytical model demonstrates the best performance, while the normal flow rate FFR model holds a slight advantage over the pure geometric model, which lacks any information on flow dynamics. The marked difference in the area-under-the-curve (AUC) for the patient-specific model underscores the importance of using patient-specific flow data when determining the risk of CAD. Figure 3.7 plots the ROC curves for the FFR estimates across varying levels of noise. The patient-specific flow-informed 𝐹𝐹 𝑅𝑃 has the best classification performance, followed by normal flow-informed 𝐹𝐹 𝑅𝑁 and then geometric 𝐹𝐹 𝑅𝐺. 3.4 Limitations and Improvements We were able to demonstrate that adding patient-specific blood flow information improves classification accuracy. However, this analysis was limited in its ability to quantify the importance of patient-specific blood flow in comparison to other anatomical features. With the aforementioned feature ranking tools at our disposal we estimated the importance of each feature separately. In order to quantify the relative contribution of patient-specific blood flow and other geometric inputs towards assessment of CAD risk, we performed a feature importance analysis. We con- structed and trained a multilayered perceptron (MLP) based binary classifier that classified patients into high versus low risk of CAD. The classifier was trained on ground truth CAD risk labels and the following features from the 28 (a) 5% noise (b) 10 % noise (c) 20 % noise Figure 3.7: ROC curves for classification into high risk versus low risk for CAD, for varying levels of noise. simulated population: 𝑈0𝑃, 𝐷1, 𝐷0, 𝐿1, and 𝐷 𝑠𝑡𝑛 △ = 𝐷1 𝐷0 . Note that the constant normal flow rate 𝑈0𝑁 was not used because it gets scaled to zero during data pre-processing. The trained model had an average test AUC of 0.98 with 0.003 standard deviation. (a) Confusion matrix (b) Receiver operating characteristic curve Figure 3.8: Performance summary of the trained CAD risk classifier on test set. 29 We performed feature importance analysis using permutation, input gradient, LIME, and SHAP to quantify the relative importance of the flow and geometric variables. We also computed the average importance generated from the four estimation methods as an additional measure of im- portance. Figure 3.9: Relative importance of patient-specific features for classification of CAD risk. Table 3.5: Average feature importance across all methods. Feature Length of stenosis Stenosed diameter Patient-specific blood flow Upstream unobstructed diameter Percent stenosis Symbol Average Importance 𝐿1 𝐷1 𝑈0𝑃 𝐷0 %𝑠𝑡𝑛 0.159 0.065 0.451 0.034 0.291 To validate the estimated importances, we used ’validation by agreement’: measuring the concordance among the importances generated by a variety of importance estimation methods. The level of agreement between these methods was quantified using cosine similarity and root- mean-square Error (RMSE). Figure 3.9 shows the relative importance of the patient-specific features returned by different methods. 30 Table 3.6: Pairwise cosine similarity and RMSE between normalized input importance estimated by permutation (PERM), input gradient (GRAD), LIME, SHAP, and average of the four methods (AVG) for the CAD risk classifier. Cosine Similarity GRAD PERM LIME SHAP AVG GRAD PERM LIME SHAP AVG GRAD PERM LIME SHAP AVG 1.00 1.00 0.99 1.00 1.00 0.99 1.00 0.98 1.00 1.00 0.97 0.98 1.00 0.98 0.99 1.00 1.00 0.98 1.00 1.00 1.00 0.99 0.97 1.00 1.00 RMSE GRAD PERM LIME SHAP AVG 0.02 0.02 0.04 0.01 0.00 0.00 0.03 0.07 0.02 0.02 0.02 0.02 0.05 0.00 0.01 0.03 0.00 0.06 0.02 0.02 0.07 0.06 0.00 0.05 0.04 The results in Table 3.6 demonstrate a high level of agreement among the feature importances generated by the different methods, thereby affirming the validity of our findings. Using the feature ranking methods we were able to successfully quantify the relative importance of patient- specific features for classification of CAD and illustrate the impact of using various patient-specific anatomical and flow features. 3.5 Summary This chapter presents a medically relevant classification problem, CAD detection, and also a virtual clinical trial to evaluate different machine learned classification models. This problem also provides a test platform for evaluating the four conventional feature importance estimation approaches advanced in the following chapters. Results demonstrate successful identification of CAD risk with different models and, moreover, successful ranking of input feature importance that aligns with expert intuition. 31 CHAPTER 4 UNIFIED FRAMEWORK FOR MULTIMODAL FEATURE IMPORTANCE The work in this chapter contributed towards the following: • M. Azmat, A. Alessio. ’Feature Importance Estimation Using Gradient Based Method for Multimodal Fused Neural Networks’. Presented at IEEE NSS-MIC-RTSD, 2022. • M. Azmat, A. Alessio. ‘Adaptable Feature Importance Estimation Framework for Fusion- based Multimodal Deep Neural Networks’. Presented at SNMMI Annual Meeting, 2023. 4.1 Introduction Multimodal neural networks (MNNs) are machine learning models that analyze data from multiple modalities, such as images, text, and audio. By combining multiple sources of information, multimodal models can often achieve higher performance than models that rely on a single modality [90, 91, 92]. In recent years, deep learning-based multimodal NNs have shown great potential as decision support systems by generating predictions that offer a comprehensive view, enhancing decision-making accuracy [93, 94, 95, 96]. Popular architectures of multimodal models are based on additive approaches. That is, input or learned features from different modalities are aggregated to make a decision such as in ensemble- based models [97], joint training models [98], and fusion-type models [99]. Fusion-type models are the most popular choice for multimodal architecture. Depending on where the modalities are fused, these models are classified as early-fusion, late-fusion, or joint- fusion models. Early fusion directly concatenates raw input features before passing them to a single neural network [100, 101]. This approach is simple but lacks explicit modeling of interactions between modalities [102]. Late fusion uses separate models for each modality and combines their outputs. Voting, stacking, and mixture of experts are common late fusion techniques. However, these models often do not effectively leverage cross-modal relationships during training [103]. Joint 32 Figure 4.1: Model architecture for various multimodal fusion strategies. The left diagram illustrates early fusion, where original or extracted features are merged at the input level. The middle diagram represents hybrid or joint fusion, where original or extracted features are combined at the input level and the model is trained end-to-end. The right diagram shows late fusion, where predictions are consolidated at the decision level. Used under CC 4.0 from [4]. or hybrid fusion models incorporate both early and late fusion. For example, separate encoders can extract features from each modality, followed by fusion layers and joint training [104]. In the healthcare domain, the ability of MNNs to integrate different types of data such as electronic health records, medical images, and genetic information provides substantial advantages. MNNs have demonstrated their efficacy in medical image analysis, where images supply visual context and text contributes descriptive insights. The fusion of data from MRI, PET scans, and electronic health records can enhance brain tumor diagnosis and prognosis predictions [4]. Similarly, MNNs that incorporate fundus images, OCT scans, and clinical data have proven useful in facilitating diabetic retinopathy screening [105]. In patient monitoring scenarios, data streams from bedside equipment, wearable devices, and medical records can be synthesized to predict adverse events or disease trajectories [106]. 33 Despite the promising advancements of MNNs, their complex architectures present unique challenges for explainability. While techniques like attention mechanisms and feature attribution can help explain predictions generated by unimodal neural networks, they often fall short with architectures that integrate cross-modal interactions. The complexity introduced by heterogeneous input types, feature blending, and high-dimensional latent spaces in multimodal networks can further obfuscate the decision-making process. The development of explainability methods capable of addressing the specific challenges of multimodal learning continues to be an active research area. Recent studies have investigated both intrinsic and post-hoc explanation techniques suited to these models. Intrinsic methods aim to construct intrinsically interpretable model architectures. For instance, using attention scores as a proxy for feature importance in late fusion models for detecting hate speech from multimodal textual, cultural, and social data [107]. Post-hoc techniques based on occlusion, gradients, and perturbations have been explored for multimodal visual question answering (VQA) models. Authors of [108] propose perceptual score: a perturbation based metric that can be used to probe multimodal models and understand their reliance on different input data types for VQA data. Multimodal explainability methods for medical applications are limited and often rely on fixed, pre-trained, modality-specific models, acting as feature extractors [109]. Alternately, they employ naive early fusion schemes [94, 110]. These approaches fail to fully leverage the capabilities of multimodal learning as they are heavily reliant on fixed pre-trained encoders, rather than learning cross-modal representations from scratch. As a result, the potential for explainability is limited, leaving interactions between modalities largely unexplored. Other multimodal implementations use medical images from different clinical modalities such as T1-weighted and T2-weighted MRIs. However, given the homogeneous nature of the input data structure, these approaches are not truly multimodal [111]. In summary, the development of explainable AI for multimodal learning in medical applications continues to pose a significant challenge. Existing methods are often restricted to single modalities or lack a comprehensive evaluation. There is a need for new techniques that can explain interactions 34 Table 4.1: Overview of Nomenclature in Multimodal Neural Networks. Term Definition Modality Refers to the different types of data sources used in a multimodal network. Input Refers to raw data that characterizes different attributes of the patient. Inputs can originate from various sources and have different data types. Example Images, Tabular data, Text. MRI image, Age, Blood pressure. Features Represent the measurable properties or characteristics of the input data that are relevant to the task at hand. These can be obtained from inputs using models or preprocessing techniques. Radiomic features, One- hot encoded categorical features. Deep Features Fusion Features Denote the high-level learned representations obtained after passing the feature through multiple layers of a neural network. Convolutional features. Specifically refers to deep features in the fusion layer where features from more than one modality are being fused. Fusion of clinical data and MRI: Concatenated deep feature vector. between modalities. Adapting these solutions to meet clinical needs and applications is a crucial direction for the future of multimodal explainable AI in healthcare. Our proposed solution is a framework for explaining the functionality of multimodal neural networks. Our framework adapts concepts from unimodal feature importance and modifies them for a multimodal model. We employ a hybrid fusion architecture where the model is trained end- to-end, enabling us to learn task-relevant features and cross-modal relationships. Our approach uses truly multimodal and heterogeneous input data, such as images, tabular data, and categorical features. The details of our proposed framework are described in the following section. 4.2 Proposed Framework Our framework relies on a hybrid fusion architecture for multimodal learning as shown in Figure 4.2. Each input uses a modality-specific module that can be treated as a feature extractor. Deep features from the feature extractors are concatenated in the fusion layer. The fusion layer is used as an input to a fully connected network that generates model predictions. All weights are learned through end-to-end training. Table 4.1 gives definitions and examples of nomenclature used in this thesis and literature in general. 35 Figure 4.2: Proposed method for multimodal feature importance. A hybrid fusion architecture supporting multimodal inputs is trained in an end-to-end manner. Features in the fusion layer are used to estimate feature importance of the upstream inputs. The post-fusion architecture typically consists of fully connected layers. The feature importance module can be replaced with any post-hoc attribution method. Since the post-fusion neural network block in Figure 4.2 takes in homogeneous fusion features from the shared representation space and returns model predictions, we can treat it as a unimodal model. As a result, we can modify the unimodal feature importance methods, described in chapter 2, to quantify multimodal input importance. Unlike prior use, we apply these methods at the fusion layer to estimate the importance of fusion features, then aggregate all of the fusion feature importances from a contributing input to estimate the importance of each input. We will now formalize the proposed framework. All methods discussed are model-agnostic and can be adopted for multi-class classification with minor modifications. Consider a binary multimodal classification problem: Θ : X −→ 𝑦 ∈ R, (4.1) where 𝐹 represents the classifier, X = 𝑋1 inputs 𝑋𝑖 which can have different dimensions R𝑑𝑖 depending on their modality. · · · 𝑋𝑚 is the multimodal input consisting of 𝑚 sub (cid:104) (cid:105)𝑇 Using the hybrid fusion architecture in Figure 4.2, each input is passed through a modality- 36 specific feature extractor 𝑓𝑖, to generate features 𝑍𝑖 corresponding to the input. 𝑍𝑖 = 𝑓𝑖 (𝑋𝑖). These features are then concatenated in the fusion layer to generate fusion features Z △ = Z (cid:104) 𝑍1 (cid:105)𝑇 , · · · 𝑍𝑚 and passed as inputs to the classifier 𝐹 to generate predictions 𝑦 𝑦 = 𝐹 (Z), (cid:18) (cid:104) Θ(X) = 𝐹 𝑓1(𝑋1) · · · 𝑓𝑚 (𝑋𝑚) (cid:105)𝑇 (cid:19) . (4.2) (4.3) (4.4) (4.5) 4.2.1 Multimodal Permutation Importance Let Score(·) denote a function that returns the average classification performance score of the classifier 𝐹 on a set of inputs. Let S𝑍 = {Z : ∀X ∈ S𝑁 } be the set of fusion features generated from S𝑁 a multimodal input set with 𝑁 samples. The permutation importance of 𝑗th fusion feature is given by PERM imp( 𝑗) = Score(S𝑍 ) − 1 𝑁 𝑝 𝑁 𝑝 ∑︁ 𝑖=1 Score (cid:0)S𝑍 ( 𝑗,𝑖) (cid:1) , (4.6) where S𝑍 ( 𝑗,𝑖) is the 𝑖th permutation of S𝑍 ( 𝑗,𝑖) in the 𝑗th fusion feature, and 𝑁 𝑝 is a hyperparameter that controls the number of permutations. In each permutation, values of the 𝑗th feature are ran- domly shuffled across the set. Importance of the 𝑘th multimodal input is computed by aggregating (averaging) the PERM importances of features from input k MM-PERM imp(𝑘) = ∑︁ 𝑗 from input 𝑘 |PERM imp( 𝑗)| . (4.7) 4.2.2 Multimodal Input Gradient Importance The importance of input 𝑘 can be approximated by aggregating the gradient-based importances of the fusion features from input 𝑘, and averaged over the set S𝑍 computed as MM-GRAD imp(𝑘) = 1 𝑁 ∑︁ Z∈S𝑍 (cid:13) (cid:13) (cid:13) (cid:13) 𝜕𝐹 (Z) 𝜕𝑍𝑘 (cid:13) (cid:13) (cid:13) (cid:13)1 , (4.8) 37 where 𝑍𝑘 is the vector of deep features from input 𝑘. Note that the 𝑙1 norm sums up the absolute values of gradients with respect to all fusion features coming from input 𝑘. 4.2.3 Multimodal LIME Let 𝐺 be a surrogate model for the classifier 𝐹 in the local neighborhood of the sample Z given by 𝐺 (V) = W𝑇 V, (4.9) where V is sampled from a multivariate Gaussian centered at Z. To ensure that 𝐺 approximates the actual model 𝐹 locally, we learn the weights 𝑊 by minimising the weighted least squares error L (𝑊) = ∑︁ V 𝑒−∥V−Z∥2 ∥𝐹 (V) − W𝑇 V∥2, (4.10) where W = (cid:104) 𝑊1 (cid:105)𝑇 · · · 𝑊𝑚 is the vector of learned coefficients, 𝑊𝑘 represents the sub vector containing LIME coefficients for fusion features 𝑍𝑘 extracted from input 𝑋𝑘 . Importance of the 𝑘th input is computed by aggregating values of coefficients corresponding to features from input 𝑘 given by 4.2.4 Multimodal SHAP MM-LIME imp(𝑘) △ = 1 𝑁 ∑︁ Z∈S𝑍 ||𝑊𝑘 ||1. (4.11) Shapley values use the net contribution of a feature across all combinations of feature interactions to quantify its importance. The marginal contribution of a feature for a particular combination is calculated as the difference in model predictions when the feature is included or excluded. To avoid exploring all possible combinations, a Monte Carlo approximation can be used to sample 𝑁𝑠 combinations. The shapely value based importance of the 𝑗th fusion feature is given by 𝜙 𝑗 (Z) = 1 𝑁𝑠 𝑁𝑠 ∑︁ 𝑖=1 𝐹 (cid:0)𝑚𝑎𝑠𝑘𝑖 (Z)+ 𝑗 (cid:1) − 𝐹 (cid:0)𝑚𝑎𝑠𝑘𝑖 (Z)+ 𝑗 (cid:1) , (4.12) where 𝑚𝑎𝑠𝑘𝑖 (.)+ 𝑗 generates a masked instance by replacing random feature values (generated by 𝑖th Monte Carlo sampling) in Z with expected values from the background dataset S𝑍 while keeping the original value for the 𝑗th. Whereas 𝑚𝑎𝑠𝑘𝑖 (.)− 𝑗 replaces the actual value for 𝑗 with the masked 38 value from the background data set. The SHAP importance of 𝑘th multimodal input is estimated by aggregating the shapely values of fusion features extracted from the 𝑘th input. MM-SHAP imp(𝑘) = 1 𝑁 ∑︁ ∑︁ (cid:12)𝜙 𝑗 (Z)(cid:12) (cid:12) (cid:12) . Z∈S𝑍 𝑗 from input 𝑘 (4.13) 4.2.5 Multimodal Average Importance We propose an additional approach, MM-AVG imp, that takes the average of the importances returned by input gradient, permutation, LIME, and SHAP methods. The key motivation is that each technique approximates model behavior and defines importance slightly differently. In real-world cases without ground truth feature importance, it may be preferable to use a more comprehensive metric that combines multiple notions of importance, rather than relying solely on one definition. 4.3 Simulation Platform For real-world problems, true values for feature importance are often unknown. As a solution, we used synthetic classification tasks with predefined decision functions, which allowed us to generate ground truth for feature importances and thereby validating our multimodal input importance methodology. We explored a variety of synthetic classification problems using controlled decision Table 4.2: Synthetic decision functions and the corresponding normalized ground truth input importance. id Decision function Ground truth feature importance 2 1 3 ||𝑍1||1 ||𝑍2||1 ||𝑍3||1 ||𝑍4||1 4 5 (cid:205)𝑖=1,2 ||𝑍𝑖 ||1 6 (cid:205)𝑖=3,4 ||𝑍𝑖 ||1 7 (cid:205)4 𝑖=1 ||𝑍𝑖 ||1 8 𝑒𝑍7,2𝑙𝑛 (cid:0)𝑍1,1 + 𝑍2,1 X1 1 0 0 0 0.5 0 X2 0 1 0 0 0.5 0 (cid:1) 2 0.25 𝑒𝑍 2,7 (𝑍1,1+𝑍1,2) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 4 0.25 (cid:12) (cid:12)𝑒𝑍2,7𝑙𝑛 (cid:0)𝑍1,1 + 𝑍1,2 (cid:1)(cid:12) (cid:12) X3 0 0 1 0 0 X4 0 0 0 1 0 0.5 0.5 0.25 0.25 0 0 Where, 𝑍𝑖,𝑘 is the 𝑖th fusion feature from the 𝑘th input used during ground truth feature importance generation. 39 functions to generate ground truth labels and feature importances for our multimodal data. The decision functions, detailed in Table 4.2, combined encoded multimodal inputs and noise to generate ground truth labels and ground truth input importances. The multimodal inputs were encoded using pre-trained, modality-specific feature extractors tailored to each input data type. It’s important to note that these pre-trained encoders were utilized solely for ground truth generation in our analysis. To ensure fairness and prevent leakage, we used a different architecture in our multimodal classifier. With the analytical form of the decision functions known, ground truth input importances were given by absolute value of gradient of the function with respect to encoded inputs, which were then summed and normalized to get input level importances. This simulation environment allowed us to precisely define the importance of different inputs and thoroughly test the approach across diverse multimodal use cases. Given that true feature importances are not known for real-world datasets, the use of real data with synthetic decision function allowed us to quantitatively validate the proposed methodology for estimating multimodal feature importance against known true values. 4.3.1 Multimodal Data We simulated a variety of synthetic multimodal data sets each containing four inputs for a binary classification task, where X1, X2 were images and X3, X4 were tabular. For image inputs, we used 28 × 28 pixel abdominal CT scan images from OrganA and OrganC medMNIST dataset [112]. Whereas tabular inputs were sampled from a multivariate Gaussian distribution, with one set of inputs drawn from a cross-correlated distribution to model dependent features. Ground truth class labels were generated by thresholding the decision function for a class-balance classification problem. In total, 10 different decision functions simulated 10 different train/test data sets containing 10000 samples each. 4.3.2 Model Architecture We used the same model architecture for all classification problems and retrained the model from scratch each time. We constructed a CNN-based encoder to process image inputs and a 40 standard scaler for tabular data, as summarized in Table 4.3. The image encoder was composed of CNN layers with 6, 16, and 64 filters of size 3×3 and each layer was followed with max pooling using a 2×2 window. The convolutional layers were then followed by a fully connected layer that flattened the convolution outputs to a vector of encoded features. The post-fusion architecture was comprised of three fully connected layers that processed the encoded images and standard-scaled tabular inputs, providing the class probabilities. Table 4.3: Description of inputs and encoders used for training the multimodal classifiers for synthetic data. Input Modality Image Image X1 Abdominal CT X2 Abdominal CT X3 X4 Correlated features Independent features Tabular Tabular Encoder (f) CNN CNN Standard Scaler Standard Scaler Encoded Dimension 1 × 10 1 × 10 1 × 10 1 × 10 To guarantee the models’ reliance on the input was not influenced by the scale of fusion features, we implemented batch normalization within the fusion layer. An important implementation detail to note is that the normalization must occur before concatenating the features, so that the gradients are evaluated on normalized fusion features. This ensures that scale does not impact the importance estimates returned by the gradient method. The models used an Adam optimizer with a fixed learning rate 1 × 10−3 for approximately 20 epochs or when the validation accuracy went over 95%. The weights were initialized using default PyTorch initialization. Training was done on a CPU with an average epoch time of approximately 70 seconds. 4.4 Results We used classification accuracy on the test set as a metric to assess if the learned model is a good approximation of the ground truth decision function. This is critical because performance of feature importance methods is limited by the performance of the learned predictive model. All trained models achieve ≥ 92% classification accuracy on independent test data. The side by side comparison of input importances generated by the different methods in Figure 4.3 shows that the 41 importance estimates are consistent across different estimation methods and our approach is able to identify the top contributing inputs in most cases. Figure 4.3: Plots of normalized ground truth versus average feature importance returned by four estimation methods plus an average value across the methods. Each subplot represents a different test case corresponding to decision functions given in Table. 4.2. The predicted feature important values closely estimate known ground truth and display a consistent ranking of features. 42 Table 4.4 shows the percent relative error of predicted and ground truth importances estimated by the different methods averaged over the different decision functions. The percent relative error for all inputs averaged over all decision functions lies within 9% of the ground truth importance. These experiments repeated for multiple decision functions and data establish the validity of our approach for estimating importance of multimodal inputs. In Chapter 5 we demonstrate performance in real data sets where we don’t have access to ground truth importances. Table 4.4: Percent relative error and RMSE in feature importance estimates for the proposed methods compared to synthetic ground truth from synthetic decisions functions employing four multimodal inputs. Feature Importance Method Gradient Permutation LIME Shapely value Average Percent relative error RMSE X1 8.67 4.02 10.1 5.07 5.84 X2 X3 7.95 6.50 4.38 8.94 8.18 5.29 5.71 X4 Mean 7.47 6.76 3.00 1.87 1.74 8.83 8.00 5.23 5.46 5.10 5.35 4.87 4.98 0.122 0.088 0.135 0.100 0.084 Average method averages the importances from aforementioned methods into a single unified importance metric. Table 4.5: Pairwise cosine similarity and RMSE between normalized input importance estimated by proposed methods for the synthetic classification problems. Cosine Similarity GRAD PERM LIME SHAP AVG GRAD PERM LIME SHAP AVG GRAD PERM LIME SHAP AVG 0.99 0.98 0.99 0.98 1.00 0.99 0.96 1.00 0.97 0.99 0.97 1.00 0.96 0.94 0.98 1.00 0.97 0.99 0.97 0.99 0.97 0.94 0.97 1.00 0.98 RMSE GRAD PERM LIME SHAP AVG 0.04 0.10 0.06 0.08 0.00 0.00 0.13 0.04 0.10 0.04 0.04 0.15 0.00 0.11 0.06 0.13 0.00 0.15 0.15 0.10 0.10 0.15 0.11 0.00 0.08 43 The input importance methods used in our analysis rely on slightly different definitions of importance, therefore, we want to validate that the importance values returned by these methods are consistent and agree well with each other. To quantify this agreement, we calculated the cosine similarity and root mean squared error (RMSE) between the normalized importances from each method. The cosine similarity results in Table 4.5 are computed across all decision functions, with pairwise similarities between all methods. The values show strong agreement in the normalized importances returned by the different techniques across a variety of decision functions. Additionally, Table 4.5 reports the RMSE of normalized importances between pairs of methods, with the values representing RMSE across all decision functions. The RMSE values indicate consistentcy across our methods. Despite the different theoretical formulations of the methods, they produce aligned rankings and similar normalized importances, as evidenced by the high similarity and low error across methods. 4.5 Summary This chapter proposes methods to apply feature importance estimation at the deep fusion layer and then aggregate those values to estimate input importance for multimodal neural networks. Through controlled simulation experiments, we demonstrate effective and consistent input impor- tance estimation using four different importance estimation techniques. These methods provide information on the relative importance of different inputs in a multimodal model decision. 44 CHAPTER 5 ESTIMATING MODEL RELIABILITY WITH MISSING DATA THROUGH MULTIMODAL IMPORTANCE M. Azmat, H. Fessler, G. Holste, A.Alessio, ‘Predicting Impact of Missing Modalities on Classifica- tion Performance in Multimodal Models: A Unified Framework for Multimodal Input Importance’, Submitted to IEEE Journal of Biomedical and Health Informatics. 5.1 Introduction In recent years, deep learning-based MNNs have shown great potential as decision support systems in healthcare [93, 94, 95, 96]. However, despite the substantial potential of MNNs for medical and clinical applications, the use of multiple modalities also introduces challenges. One leading challenge is how to handle and interpret the impact of missing or incomplete data. In medical settings, it is common to encounter scenarios where some modalities are missing or incomplete, either from incomplete medical records, excessive costs, or potential risks to the patient. Previous studies have demonstrated that missing data can result in varying levels of performance reduction for multimodal models [113, 114], but there is a lack of research on methods for predicting the impact of missing modalities in multimodal learning. In this work we develop a method to predict the performance degradation of multimodal deep NNs in cases of one or more missing input modalities. Since model performance reflects its reliability, predicting how performance declines can provide indirect insights about model reliability when inputs are missing. Predicting performance degradation resulting from missing data could have an invaluable role in increasing the interpretablity of MNNs. Additionally, it has the potential to inform decisions about which modalities and tests are critical for specific patients. This could ultimately lead to a reduction in the number of unnecessary imaging and lab procedures, potentially having a significant impact on healthcare costs and patient safety. To demonstrate significance of this approach, consider a MNN trained and deployed to use 5 multimodal inputs (labeled A-E) to perform classification. If this model encounters a patient with only 4 of the inputs (A-D), it would be beneficial to know a priori the potential value of obtaining 45 input (E). If the performance gain of adding input (E) is insignificant, the costs and risks of getting that input can be avoided. Conversely, if the performance gain is significant, the collection of input (E) can be prioritized for the patient. In short, this approach has the potential to enable clinical decision makers to better interpret the performance of machine learned models and ultimately make better informed decisions about any additional information needed for specific patients. Furthermore, knowing the performance of the model in the presence of noisy or missing data is also crucial for the change control plan portion of the FDA proposed regulatory framework for Artificial Intelligence/Machine Learning (AI/ML)-based Software as a Medical Device (SaMD) [32]. The change control plan is intended to ensure that changes to the software do not negatively impact its safety, efficacy, or performance [115]. A priori knowledge of the model performance in the presence of missing data allows for a more accurate assessment of the potential risks associated with changes to the software and enables more accurate post-market surveillance, essential for identifying any problems or issues that may arise after the software has been released. This can help to ensure that the software remains safe and effective for use in clinical settings. This work is based on the hypothesis that the performance degradation of MNNs due to a missing input is correlated with the importance of the missing input. To predict the effects of multimodal mean imputation on classification performance, we propose a two-step method reliant on 1) multimodal feature importance estimation and 2) performance degradation estimation for multimodal missing data. Step 1, multimodal input importance: We use methods developed in Chapter 4 to estimate importance of multimodal inputs. In the absence of ground truth importance values for real data, we employ a suite of distinct feature importance methods and establish a consensus across these approaches. Step 2, performance degradation estimation for multimodal missing data: During inference with a MNN, missing inputs are generally treated with imputation methods to fill in gaps in input data. There are several popular methods of input imputation, including mean and median imputation [116], K-nearest neighbors imputation [117], multiple imputation, data augmentation [118], and 46 machine learning based approaches [119]. These methods can be applied to medical datasets where information comes in various forms such as medical images, patient records, and genetic data. The choice of imputation method depends on the specific characteristics of the dataset and the goals of the machine learning model. While there has been substantial research in developing variations of data imputation methods to deal with missing data, there has been limited research in trying to predict the impact of data imputation on model performance. In this work, we implement and evaluate a method that linearly relates performance degradation to missing input importance. To summarize, we propose an approach to enhance the interpretability of multimodal models. The modality-level importance estimation step provides insight into the model’s dependence on input data. The second step leverages the importance metrics generated in the first step, utilizing them to shed light on the model’s performance limitations and behavior in the absence of certain inputs. Our primary contributions lie in offering a deeper understanding of the model’s decision- making process, and quantifying its reliability under unique circumstances. 5.2 Methods To predict classification performance in the case of missing or imputed data, we propose a linear model that relates input importance to model performance. This assumption is based on the hypothesis that input importance is, and should be, proportional to model performance. The predicted classification performance, ScoreX𝑘 , when input X𝑘 is imputed (missing) for the set of missing inputs K is ScoreX𝑘 = ScoreΦ − ∑︁ 𝑘 ∈K imp(X𝑘 ) (ScoreΦ − ScoreX1:X𝑚). (5.1) In this relationship, imp(X𝑘 ) is the normalized aggregated importance of the 𝑘th input estimated from one of the four methods discussed in Chapter 4. The ScoreΦ is the reference score without input imputation and ScoreX1:X𝑚 is the baseline score when all 𝑚 modalities are imputed. It should be stressed that the imp(X𝑘 ) terms are normalized to sum to one for a given classifier model. The relationship in (5.1) is intuitive with the difference in the later score terms providing the range of performance degradation that is tempered by the relative importance of the missing input. In the 47 extreme when all modalities are missing, the classification performance reverts to ScoreX1:X𝑚. A key advantage of this approach is its ability to predict model’s performance degradation with little added cost. For a classification problem with 𝑚 inputs and 𝑁 samples, using the linear approximation (5.1)) we can get the estimates for model performance with O (𝑚) complexity. Alternately, if we calculate the predictions by performing imputation on data the complexity would be O (𝑁) (cid:124)(cid:123)(cid:122)(cid:125) Imputation +O (𝑁 (𝑚𝑙1 + 𝑙2𝑙3 + ...) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:124) (cid:123)(cid:122) Forward pass ), (5.2) where 𝑙𝑖 is the dimension of the weights in 𝑖th layer. For machine learning models 𝑁 >> 𝑚. This added benefit becomes more significant for large data with more inputs. 5.3 Data Collection and Pre-processing We performed our analysis for two real world problems: 1) multimodal breast tumor classifica- tion problem, and 2) multimodal cardiomegaly classification problem. 5.3.1 Breast Tumor Data For the breast tumor classification study, we used fully anonymized data from an IRB-approved study of 5,248 women who had 10,185 breast cancer examinations between July 2005 and November 2015 [96]. Each patient received a dynamic contrast-enhanced magnetic resonance imaging (DCE- MRI) exam, and a subset of patients received a mammogram (76.5%) or underwent breast tissue biopsy (26.8%). We considered each breast as a separate case and cropped the DCE-MRI images to store them as single breast images. We included MRI images without artifacts that had been scored using the Breast Imaging- Reporting and Data System (BI-RADS) and had a known 12-month post-MRI cancer status. Breasts labeled ‘Malignant’ had been diagnosed with breast cancer, confirmed by pathology, either at the time of examination or within 12 months after MRI. All other breasts were labeled as ‘Benign’. We also included additional features, such as patient age, clinical indication for MRI, and background parenchymal enhancement from MRI. To overcome potential biases from artifacts or signal changes caused by biopsy, we transformed 48 the DCE-MRI images to 2D maximum intensity projection (MIP) images, which retain only high contrast enhancement information and effectively remove any artifacts. The MIP images were resized to 224 × 224 pixels, and pixel intensities in the top 0.5% were removed. The remaining intensity values were normalized, and basic information from the images and tabular clinical features were added to the dataset. Based on the unimodal feature importance results from an earlier study [96], we selected a subset of four most significant tabular inputs. Incorporating additional non-imaging features beyond this subset did not yield any notable improvements in the model performance. Consequently, we chose (1) Age: Patient’s age at the time of the MRI study, (2) Max intensity: maximum pixel intensity in the MIP image, (3) Breast Density: Mammographic breast density via BI-RADS assessment (Fibroglandular, Dense, or Extremely Dense), and (4) MRI Indication: Clinical indication for MRI study (Screening, Diagnostic, or Known Cancer). Finally, we created a balanced subset of the dataset consisting of 6,842 breast images and their associated non-image features. The non-image features were normalized, and the dataset was randomly split into three balanced sets: 4,180 cases for training (61.1%), 650 cases for validation (9.5%), and 2,012 cases for testing (29.4%). To avoid possible data leakage, samples from the sample patient were not shared across these sets. 5.3.2 MIMIC Data for Cardiomegaly Classification For the cardiomegaly classification problem, we leveraged the open source, multimodal MIMIC- IV and MIMIC-CXR datasets [120, 121, 122] available through credentialed access. The MIMIC- CXR dataset contains over 377,110 chest radiographs, each affiliated with radiology reports, corre- sponding to 227,835 radiographic studies involving around 65,000 unique patients. MIMIC-CXR contains pre-generated ground truth labels for around 12 diseases derived from the radiology reports using the CheXpert labeler [123]. We include data for studies with definite labels of cardiomegaly present or absent. We only used the Posterior-Anterior views of the radiographs that were preprocessed using a center crop on the longer dimension of the image, generating a square image. This was followed by a 49 resizing to 224×224 pixels and normalization to range between [−1024, 1024]. This pre-processing was chosen to generate images that are compatible for use with the pre-trained TorchXrayVision models [124]. Additionally, tabular demographic information matching the radiographs was extracted from MIMIC-IV. The demographic inputs include age, gender (identified as male or female), type of insurance (grouped as Medicaid, Medicare, or others), marital status (noted as divorced, mar- ried, single, or widowed), and ethnicity (categorized into American Indian/Alaska Native, Asian, Black/African American, Hispanic/Latino, White, and other demographics). The final multimodal dataset for cardiomegaly classification contained 13,786 images of 8,940 unique patients, each matched with corresponding demographic data. This dataset was then partitioned into training, testing, and validation sets using patient identifiers to prevent data leakage. Each set had approximately 65% prevalence of cardiomegaly. 5.4 Model Setup and Training 5.4.1 Multimodal Breast Tumor Classifier Figure 5.1 provides an overview of the architecture for the multimodal breast cancer classifica- tion model. For the image encoder, ResNet50 was adapted to accommodate a single-channel input. The classification head was removed, an average pooling followed by a fully connected layer was added after layer 4 bottleneck 2 in the PyTorch implementation of ResNet50, effectively encoding the 2048 convolution features from the image to 10 deep features for fusion. As discussed in [96], the choice of encoded dimension is arbitrary and has no significant impact on model performance. For the tabular inputs we use standard scaling and one-hot-encoding. Table 5.1 shows an overview of the input modalities and their corresponding encoding schemes. 5.4.2 Multimodal Cardiomegaly Classifier For the cardiomegaly classifier, we modified the pre-trained Densenet121-res224-chex (DenseNet 224×224 model trained on CheXpert dataset) to generate deep features for the image modality. The 1024-dimensional vector from the final dense-layer was passed through a fully connected layer that 50 Figure 5.1: Architecture of the hybrid fusion model used for classifying breast MRI’s using multimodal data. Resnet50 is used to extract fusion features from images while tabular inputs are pre-processed using standard scalar and one hot encoding. All weights are learnable and the model is trained end-to-end. Table 5.1: Description of Inputs used for training the multimodal breast tumor classifier and the corresponding encoding schemes. Input Modality X1 MRI Image X2 Age Tabular X3 Max Intensity Tabular X4 Breast Density Tabular X5 MRI Indication Tabular Encoder (f) ResNet50 Standard Scaler Standard Scaler One-hot-encoder One-hot-encoder Encoded Dimension 1 × 10 1 × 1 1 × 1 1 × 3 1 × 3 returned a 10-dimensional encoded image vector. The choice of encoding dimension was arbitrary and did not impact model performance significantly. Weights of the pre-trained DenseNet were frozen, whereas weights for last fully connected layer were learned during training of the multimodal classifier. Table 5.2 lists model inputs and their corresponding encoding methodologies. Table 5.2: Description of Inputs used for training the multimodal cardiomegaly classifier and the corresponding encoding schemes. Input Modality X1 Radiograph Image X2 Age Tabular X3 Gender Tabular X4 Tabular X5 Marital Status Tabular X6 Ethnicity Tabular Insurance Encoder (f) DenseNet Standard Scaler One-hot-encoder One-hot-encoder One-hot-encoder One-hot-encoder Encoded Dimension 1 × 10 1 × 1 1 × 2 1 × 3 1 × 4 1 × 6 51 The breast tumor and cardiomegaly models were trained using the Adam optimizer [125], with a fixed learning rate of 1 × 10−4 and without any learning rate scheduling. The models were trained for a total of approximately 200 epochs or until the validation area under the curve (AUC) failed to demonstrate any improvement in the last 20 epochs, whichever occurred first. The PyTorch [126] default settings were utilized for model weight initialization except for where pre-trained weights were used. The training process was executed on a single NVIDIA Tesla V100S GPU, and the average epoch time was approximately 17 seconds for the breast tumor classifier and approximately 40 seconds for the cardiomegaly classifier. Weights from the epoch with the highest validation AUC were selected and used in further analysis of the trained model. 5.5 Results 5.5.1 Breast Tumor Classification The trained breast tumor classification model has an AUC of 0.868 on a hold-out test dataset. Model sensitivity and specificity on the test dataset are 0.795 and 0.796 respectively, using the optimal threshold of 0.52. This performance is summarized in Figure 5.2. Figure 5.3 displays correctly and incorrectly classified samples from the test set and provides examples of a true positive, true negative, false positive, and false negative instance. (a) Confusion matrix (b) Receiver operating characteristic curve Figure 5.2: Performance of the the trained breast tumor classifier on test set. Figure 5.4 presents the input importance estimates. In the absence of ground truth feature importance, we validated our results by assessing the agreement between the different importance 52 (a) True positive (b) True Negative (c) False Positive (d) False Negative Figure 5.3: Examples of different classification outcomes of the trained breast tumor classifier on the test set. methods. Table 5.3 displays the cosine similarity and RMSE between methods. The average (AVG) calculates the mean importance across all these methods and renormalizes the features to sum to one. The results demonstrate a high level of agreement, with cosine similarity ≥ 0.97 and RMSE ≤ 0.11, among the importance values generated by the different estimation methods. Figure 5.4: Comparison of normalized feature importance results and associated feature ranks using gradient, permutation, LIME, and shapely values based methods for the multimodal breast tumor classifier. AVG reports the mean importance across the four methods. Another approach to validate these importance estimates is via expert opinion. For breast tumor classification, the importance estimation aligns well with expert intuition, as majority of the tumor-related information is contained in the MRI image. Furthermore, the MRI indication, which 53 Table 5.3: Pairwise cosine similarity and RMSE between normalized input importance estimated by proposed methods for the multimodal breast tumor classifier. Cosine Similarity GRAD PERM LIME SHAP AVG GRAD PERM LIME SHAP AVG GRAD PERM LIME SHAP AVG 1.00 0.99 1.00 0.99 1.00 0.97 1.00 0.99 0.97 0.99 0.99 0.99 1.00 0.99 1.00 1.00 0.97 0.99 0.99 1.00 0.99 0.97 0.99 1.00 0.99 RMSE GRAD PERM LIME SHAP AVG 0.02 0.05 0.03 0.07 0.00 0.00 0.06 0.03 0.07 0.02 0.06 0.00 0.04 0.11 0.05 0.03 0.04 0.00 0.10 0.03 0.07 0.11 0.10 0.00 0.07 includes a ‘known cancer’ category, can provide significant insights to aid in the classification decision. Table 5.4: Predicted performance of multimodal breast tumor classifier in the case of a single missing input. Imputed AVG True Accuracy Predicted Accuracy Input MRI Age Max Intensity Breast Density Indication importance Mean 0.708 0.790 0.793 0.792 0.724 0.368 0.068 0.050 0.069 0.446 STD Mean 0.685 0.010 0.772 0.009 0.778 0.009 0.772 0.009 0.662 0.010 STD RMSE 0.025 0.007 0.018 0.008 0.016 0.008 0.020 0.008 0.063 0.007 Equation (5.1) is used to predict the performance using AVG importance of imputed inputs. Once we had estimates for input importance of our model, we then used them to predict the model’s performance for missing inputs. For non-categorical data, missing inputs were replaced with their mean value, while for categorical data, the most frequent category from the training set was used. We used accuracy as the Score(.) function in (5.1) during evaluation. For missing inputs, the imputation was done at the fusion layer. Non-categorical inputs were imputed with the mean value, whereas categorical inputs were replaced with the most frequent category value. To assess 54 the effectiveness of our approach, we measured the predicted accuracy using (5.1) and compared it with the computed accuracy value after imputation, referred to as true accuracy. We used 200 bootstrap realizations of the test set to obtain the mean and standard deviation of the prediction and true accuracy. Table 5.4 presents the mean predicted and mean experimental accuracies for cases of test data with one missing input using the AVG estimated importances in (5.1). For the case of a single missing input, our proposed linear relationship is able to predict the model performance within less than 3% for most features and within 7% when Indication was missing. The analysis was then applied to cases where more than one input was absent. We designed experiments that encompassed all possible permutations of present and absent inputs. For each experiment, we predicted the missing input model performance using our proposed linear relation- ship in (5.1), and compared the predicted performance with the actual performance on a test set, where the corresponding inputs were replaced with their mean values. Figure 5.5 illustrates these comparisons, contrasting the predictive and experimental missing input model performance for four distinct importance estimation methods. Each datum point on the plot represents an experiment with a unique combination of present and absent inputs. Figure 5.6 shows the proportionality between AVG importance of missing inputs and degrada- tion in model performance, supporting our hypothesis. We show results from our proposed linear relationship (5.1) and the best linear unbiased estimator (BLUE) [127] of the true drop in model accuracy. Our proposed linear relationship predicts that for missing inputs with cumulative impor- tance of 0.1 normalized units (n.u.) the model’s accuracy decreases from its reference value by 2.89%. This is similar to the BLUE prediction of a 3.22% drop in model accuracy. In approximately 70% of the experiments, the prediction of performance loss due to missing input falls within a 5% error margin as detailed in Table 5.5. For all experiments the predicted drop in model performance is highly correlated (𝜌=0.92) with the missing input importance. 55 Figure 5.5: Comparison of predicted and true breast tumor classification performance reduction as a function of missing input importance using gradient (GRAD), permutation (PERM), LIME, and shapely values (SHAP). The Pearson correlation coefficient, 𝜌, is between the model test performance and aggregated importance of missing inputs. Figure 5.6: Breast tumor classification performance reduction as a function of missing input importance. This presents predictions, "Pred", using AVG method. BLUE represents the best linear fit of the true drop in model test accuracy. The Pearson correlation coefficient, 𝜌, between the drop in model test performance and the sum importance of missing inputs. 56 Table 5.5: Predicted and true performance of multimodal breast tumor classifier in the case of multiple missing input. Each row corresponds to a different experiment with a unique subset of missing inputs. The predicted accuracy is obtained using (5.1), and the experimental accuracy is the computed accuracy on an imputed test set. Imputed input Aggregated True Accuracy Predicted Accuracy Img. Age MIP B.Den. Ind. ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ Importance Mean 0.792 0.725 0.792 0.724 0.790 0.725 0.791 0.725 0.789 0.716 0.790 0.717 0.781 0.715 0.782 0.716 0.689 0.549 0.689 0.549 0.708 0.540 0.709 0.540 0.723 0.506 0.724 0.506 0.728 0.501 0.731 0.499 0.000 0.446 0.069 0.514 0.050 0.496 0.119 0.565 0.068 0.513 0.136 0.582 0.118 0.564 0.186 0.632 0.368 0.814 0.436 0.882 0.418 0.864 0.487 0.932 0.435 0.881 0.504 0.950 0.486 0.931 0.554 1.000 STD Mean 0.792 0.009 0.666 0.010 0.772 0.009 0.645 0.010 0.778 0.009 0.647 0.010 0.758 0.009 0.627 0.010 0.772 0.009 0.646 0.010 0.753 0.009 0.626 0.010 0.758 0.009 0.628 0.010 0.737 0.010 0.609 0.011 0.689 0.010 0.559 0.011 0.669 0.011 0.540 0.011 0.669 0.010 0.539 0.011 0.650 0.010 0.520 0.011 0.667 0.013 0.540 0.010 0.648 0.012 0.520 0.010 0.650 0.009 0.521 0.011 0.631 0.010 0.499 0.011 STD RMSE 0.004 0.009 0.060 0.006 0.020 0.008 0.080 0.008 0.014 0.008 0.078 0.007 0.034 0.008 0.098 0.007 0.018 0.009 0.070 0.008 0.037 0.008 0.092 0.007 0.025 0.008 0.087 0.007 0.045 0.008 0.107 0.008 0.011 0.007 0.015 0.009 0.023 0.007 0.015 0.010 0.040 0.007 0.009 0.010 0.060 0.007 0.022 0.011 0.057 0.007 0.034 0.009 0.077 0.007 0.015 0.010 0.079 0.007 0.020 0.010 0.100 0.007 0.000 0.011 Key: Img=MRI image, MIP=MIP maximum intensity, B.Den=Breast density, Ind=Indication. 57 5.5.2 Cardiomegaly Classification The trained model for cardiomegaly classification model shows good classification performance with an AUC of 0.896, a sensitivity of 0.84, and specificity of 0.83 on a hold-out test dataset, with an optimal threshold of 0.32. Figure 5.7 illustrates model performance on test data and Figure 5.8 shows sample images corresponding to classification outcomes of true positive, true negative, false positive, and false negative for the trained cardiomegaly classifier from the test set. (a) Confusion matrix (b) Receiver operating characteristic curve Figure 5.7: Performance of the the trained cardiomegaly classifier on test set. For the input importance estimates in Figure 5.9, the cosine similarity and RMSE values between the importance estimation methods are shown in Tables 5.6, demonstrating a strong consensus among the importance values estimated by the four distinct methods. The importance estimation, validated by agreement, also aligns well with our intuition. The majority of the information regarding heart size is present in the radiographs, and while there (a) True positive (b) True Negative (c) False Positive (d) False Negative Figure 5.8: Examples of different classification outcomes of the trained cardiomegaly classifier on the test set. 58 Figure 5.9: Comparison of normalized feature importance results and associated feature ranks using gradient, permutation, LIME, and shapely values based methods for the multimodal cardiomegaly classifier. AVG reports the mean importance across the four methods. Table 5.6: Pairwise Cosine similarity and RMSE between normalized input importances estimated by different methods for the multimodal cardiomegaly classifier. Cosine Similarity GRAD PERM LIME SHAP AVG GRAD PERM LIME SHAP AVG GRAD PERM LIME SHAP AVG 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 0.99 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 RMSE GRAD PERM LIME SHAP AVG 0.03 0.03 0.04 0.04 0.00 0.00 0.06 0.01 0.06 0.03 0.06 0.01 0.08 0.00 0.04 0.06 0.00 0.07 0.01 0.03 0.01 0.07 0.00 0.08 0.04 59 is not a causal relationship, certain ethnicities have been demonstrated to have elevated risks for cardiovascular diseases and hypertension [128, 129], which can be primary contributors to cardiomegaly [130, 131]. Consequently, these findings not only offer us an understanding of the inputs on which the models depend, but also help us calibrate our confidence in the model’s prediction based on domain knowledge. Next, we used the estimated importances to provide an additional layer of interpretability by understanding the model limitations under missing inputs. Table 5.7 illustrates the comparison between the actual and predicted model performance in cases with a single missing input. The predicted missing input model performance lies within 3% of the true value. Table 5.7: Predicted performance of multimodal cardiomegaly classifier in the case of a single missing input. Imputed AVG True Accuracy Predicted Accuracy Input Radiograph Age Gender Insurance Marital st. Ethnicity importance Mean 0.646 0.832 0.834 0.831 0.834 0.829 0.836 0.028 0.011 0.018 0.033 0.078 STD Mean 0.671 0.004 0.827 0.004 0.831 0.004 0.829 0.004 0.827 0.004 0.818 0.004 STD RMSE 0.025 0.004 0.005 0.004 0.003 0.004 0.003 0.004 0.007 0.004 0.011 0.003 Equation (5.1) is used to predict the performance using AVG importance of imputed inputs. For cases with multiple missing inputs, the experiments, each of which is represented by a point on the plots in Figure 5.10, generate model performance predictions that lie within 5% error margin. Comparison of predicted performance drop from (5.1) with predicted performance drop from BLUE of the true drop in model accuracy are illustrated in Figure 5.11. Equation (5.1) predicts that for missing inputs with cumulative importance of 0.1 normalized units (n.u.) the accuracy of cardiomegaly classifier will drop from its reference value by 1.93%. Compared to the BLUE prediction that for missing inputs with cumulative importance of 0.1 n.u. the models accuracy will drop from its reference value by 2.24%. In contrast to the breast tumor classification, the cardiomegaly dataset contains a single dominant input that governs the primary trend in model 60 performance, resulting in a high correlation between input importance and true missing input model performance. However, upon further examination of the two clusters of experiments in Figure 5.11 (those with and without radiographs), we find that the importance remains highly correlated with model performance even within the clusters. This further demonstrates that performance reduction is correlated with input importance. Figure 5.10: Comparison of predicted and true cardiomegaly classification performance reduction as a function of missing input importance in the case of one or more missing inputs using proposed methods. 𝜌 is the Pearson correlation coefficient between the model test performance and aggre- gated importance of missing inputs. 61 Figure 5.11: Cardiomegaly classification performance reduction as a function of missing input importance. Presents predictions, "Pred", using AVG importances. BLUE represents the best linear fit of the true drop in model test accuracy. 𝜌 is the Pearson correlation coefficient between the drop in model test performance and sum importance of missing inputs. 5.6 Conclusion In this study, we introduced a unified framework for estimating the importance of multimodal inputs in fusion-based multimodal neural networks. Previous interpretability work involving mul- timodal data had employed fixed feature extractors to obtain deep features from each modality [94, 132]. A novelty of our approach lay in the fact that the fusion model, including the feature extractors, was trained end-to-end for a specific task. Consequently, the features extracted were those fine-tuned, likely most pertinent, to a particular classification task. Our unified multimodal input importance framework was agnostic to the type of estimation methods used, allowing us to utilize a range of importance estimation methods. Another strength of our proposed framework was that the importance estimates did not rely on the input data dimension, allowing us to compare, for example, the importance of a 224 × 224 image input to a 1 × 3 categorical input. We addressed the challenge of validating the importance estimates by testing the framework in a controlled environment with synthetic data, custom decision functions, and complete control over the ground truth feature importance values. Our framework was then applied to provide insights into the decision-making logic of two multimodal classifiers trained to classify breast tumors and cardiomegaly from multimodal data. With this real data, we did not have ground truth 62 Table 5.8: Predicted and true performance of multimodal cardiomegaly classifier in the case of multiple missing input. Each row corresponds to a different experiment with a unique subset of missing inputs. The predicted accuracy is obtained using (5.1), and the experimental accuracy is the computed accuracy on an imputed test set. Imputed input Agg. True Accuracy Predicted Accuracy Img. Age Gen. Ins. Mar. Eth. ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ STD Mean 0.833 0.004 0.817 0.004 0.827 0.004 0.811 0.004 0.829 0.004 0.814 0.004 0.823 0.004 0.808 0.003 0.831 0.004 0.816 0.004 0.824 0.004 0.809 0.004 0.827 0.004 0.812 0.004 0.821 0.004 0.806 0.004 0.827 0.004 0.813 0.004 0.821 0.004 0.806 0.004 0.824 0.004 0.809 0.004 0.818 0.003 0.803 0.004 0.825 0.004 0.810 0.004 0.819 0.004 0.804 0.003 0.821 0.004 0.806 0.004 0.815 0.004 0.801 0.004 0.671 0.004 0.657 0.004 0.665 0.005 0.649 0.004 0.668 0.004 STD RMSE 0.001 0.004 0.011 0.003 0.008 0.004 0.019 0.003 0.003 0.004 0.013 0.004 0.010 0.004 0.021 0.003 0.003 0.004 0.014 0.003 0.010 0.004 0.021 0.003 0.005 0.004 0.016 0.003 0.013 0.003 0.023 0.004 0.005 0.004 0.016 0.003 0.012 0.003 0.024 0.003 0.008 0.004 0.019 0.003 0.015 0.003 0.026 0.003 0.008 0.004 0.019 0.004 0.015 0.004 0.026 0.003 0.010 0.004 0.021 0.003 0.017 0.004 0.029 0.003 0.025 0.004 0.011 0.005 0.023 0.004 0.007 0.005 0.021 0.004 Imp. Mean 0.833 0.000 0.829 0.078 0.834 0.033 0.830 0.111 0.831 0.018 0.828 0.096 0.833 0.051 0.829 0.129 0.833 0.011 0.829 0.089 0.834 0.044 0.830 0.122 0.832 0.029 0.828 0.107 0.833 0.062 0.829 0.140 0.832 0.028 0.829 0.106 0.833 0.061 0.830 0.139 0.832 0.046 0.827 0.124 0.833 0.079 0.828 0.157 0.833 0.039 0.829 0.117 0.834 0.072 0.830 0.150 0.832 0.057 0.827 0.135 0.832 0.090 0.830 0.168 0.646 0.836 0.646 0.913 0.642 0.869 0.642 0.947 0.647 0.854 63 Table 5.8 (cont’d.). Imputed input Agg. True Accuracy Predicted Accuracy Img. Age Gen. Ins. Mar. Eth. ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ Imp. Mean 0.646 0.932 0.642 0.887 0.643 0.965 0.645 0.846 0.645 0.924 0.642 0.880 0.642 0.957 0.645 0.864 0.645 0.942 0.642 0.898 0.642 0.975 0.639 0.864 0.640 0.942 0.640 0.897 0.640 0.975 0.639 0.882 0.640 0.960 0.640 0.915 0.640 0.993 0.640 0.875 0.639 0.953 0.639 0.908 0.639 0.986 0.640 0.893 0.639 0.971 0.640 0.926 0.639 1.000 STD Mean 0.652 0.004 0.661 0.004 0.646 0.005 0.669 0.004 0.654 0.004 0.663 0.004 0.648 0.004 0.665 0.004 0.650 0.004 0.660 0.004 0.644 0.004 0.665 0.005 0.651 0.005 0.660 0.005 0.645 0.005 0.662 0.005 0.648 0.005 0.656 0.005 0.641 0.005 0.664 0.005 0.649 0.005 0.657 0.005 0.642 0.005 0.660 0.004 0.645 0.005 0.654 0.005 0.638 0.005 STD RMSE 0.007 0.005 0.019 0.004 0.004 0.005 0.024 0.004 0.009 0.005 0.021 0.004 0.006 0.005 0.020 0.004 0.005 0.005 0.018 0.004 0.003 0.005 0.026 0.004 0.011 0.005 0.020 0.004 0.005 0.005 0.023 0.004 0.008 0.005 0.016 0.004 0.001 0.005 0.024 0.004 0.009 0.004 0.018 0.004 0.003 0.005 0.021 0.004 0.006 0.005 0.014 0.005 0.001 0.005 ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ Table continued. Key : Agg. Imp.=Aggregrated Importance, Img=Chest radiograph, Gen=Gender, Ins=Insurance, Mar=Marital status, Eth=Ethnicity feature importance knowledge and therefore validated our importance estimates by quantifying the agreement across estimates returned by different methods. Furthermore, the estimated AVG importances aligned well with expert intuition and passed the validation by agreement test. We further enhanced the model’s interpretability by using the estimated importances to predict the model’s performance in the special case of missing inputs. Our goal was to provide non- technical users with an understanding of the model’s prediction reliability in terms of accuracy. 64 We hypothesized that the degradation in model performance in the absence of certain inputs was proportional to the importance of those inputs. We designed numerous experiments to test how closely our prediction of the model performance aligned with the true model performance in the absence of inputs. Our results across two different multimodal datasets and two different fusion- based classifiers showed a high correlation between the importance of missing inputs and model performance, supporting our hypothesis. A limitation of our approach was the use of a linear relationship between input importance and missing input model performance, which might not adequately capture the combined importance of inputs. Despite this limitation, we consistently ob- served a high correlation between input importance and missing input model performance. Future work could explore different non-linear relationships. This study represented a step towards pro- viding an additional layer of understanding of the model’s limitations and operational capabilities. It also aided in answering questions related to cost-benefit analysis, such as the value of acquiring additional input data on a patient when the performance degradation might be minimal. 65 CHAPTER 6 TOWARDS SAMPLE-LEVEL RELIABILITY ESTIMATION 6.1 Introduction In Chapter 5, we discussed the use of model accuracy as a human-interpretable measure of model reliability. However, this analysis was conducted on set-level, wherein these metrics were estimated across the entire validation dataset, failing to offer individual sample-specific insights. Traditional deep network designs yield sample-specific predictions without accompanying reliability measures. Ideally, we would like to obtain sample-specific reliability estimates to quantify confidence in each individual prediction. Various approaches have been proposed to obtain sample-wise measures of prediction reliability that align with accuracy. Bayesian neural networks induce probabilistic outputs by placing prior distributions over network weights and propagating this uncertainty through to predictions [133]. Dropout sampling at test time enables uncertainty approximation through Monte Carlo simulation by running predictions on multiple dropout masked versions of the model [134]. Conformal prediction provides a distribution-free framework to derive prediction intervals guaranteed to contain new samples at a specified confidence level based on a calibration set [135]. Existing measures of sample-level reliability often rely on specific architectural modifications in the model or generate variance estimates that are less intuitive and interpretable, especially for non-experts. Therefore, in this analysis, we aim to develop a sample-level reliability metric that is straightforward and understandable. To this end, we use local accuracy as an interpretable and tangible measure of reliability at the sample level. Local accuracy simply conveys the empirical performance of the model in the vicinity of a given sample, providing an accessible reliability quantification. The methods discussed in this chapter aim to map properties of the input sample to local accuracy through calibration techniques. The key insight is that model performance depends heavily on the data sample and can vary greatly across different regions of the input space. While performance metrics like accuracy, AUC, and cross-entropy provide a set-level average view, they fail to account 66 for this variability. A model may produce reliable predictions in some areas of the input space while faltering in others. Intuitively, if a model makes accurate predictions in the region around a given sample, then it is likely to be accurate on that individual sample as well. In this analysis we generate reliability estimates using local accuracy about a sample, that are transparent and meaningful to all users. We analyze model performance across different sample populations segmented based on sample properties. Calibrating these sample-specific attributes could produce granular reliability estimates tied to local accuracy. In summary, while the set- level performance prediction offers useful insights into model reliability, sample-specific analysis is needed. We aim to transition from set-level reliability to sample-level reliability which could increase model transparency, evaluation rigor, and safety for real-world deep learning systems. 6.2 Methods We defined the reliability of a model’s prediction for a single sample as the local accuracy of the model in a small neighborhood around that sample. The key intuition being that samples surrounded by other samples on which the model performs accurately are likely to also be classified correctly. To construct neighborhoods, we used samples that were similar to the target sample based on a pre-defined property of the input for example distance in the feature space, cosine similarity, or predicted probability. We then took samples within a radius threshold on that metric to form the neighborhood. By averaging model accuracy on those neighborhood samples, we generated a local reliability estimate for the target sample. Through this process, we constructed a reliability calibration curve (RCC) that relates the pre- defined sample property to local accuracy offering a nuanced insight into model performance by providing fine-grained reliability estimates for individual samples based on model performance in local neighborhoods. It is important to note that these RCCs can be constructed using any performance metric. However, we specifically chose accuracy due to its intuitive and accessible nature, making it easily interpretable not just for experts, but for laypeople as well. We explored four main approaches for constructing RCCs, each depending on different proper- ties of the input sample. 67 1. Mahalanobis distance: we calculated the Mahalanobis distance of each sample from the distribution of the training set in fusion feature space. The Mahalanobis-based RCC maps this distance to local accuracy. 2. Cosine similarity: we computed the cosine similarity between the fusion features for each sample and the training set. The cosine similarity-based RCC maps this similarity metric to local accuracy. 3. UMAP dimensionality reduction: we used the Uniform Manifold Approximation and Pro- jection (UMAP) method, where the encoded features of the multimodal input from the fusion layer were projected down to low dimensional space. The resultant calibration curve maps distance of the sample from the training data in the UMAP-reduced feature space to local accuracy. 4. Prediction probabilities: we used the model’s prediction probabilities to construct neighbor- hoods and build a RCC that maps the prediction probability to local accuracy. The RCCs were learned using validation data by regressing local accuracy against the chosen sample property across varied neighborhood sizes. Their performance was evaluated using a hold out test set. We evaluated the RCCs on the following criteria: • Granularity: The curve should account for a wide range of local accuracy values, providing more fine-grained reliability estimates. • Convergence: As neighborhood size increases, the predicted local accuracy should approach the global accuracy on the full validation set. • Generalization: We quantified generalization via the RMSE between the predicted local accuracy from the RCC and the true local accuracy on a holdout test set. 68 6.2.1 Mahalanobis Distance-based Reliability Calibration Curve (M-RCC) Mahalanobis distance is a multivariate generalization of measuring the number of standard deviations a point is away from the mean of a distribution [136]. It equals zero when a point lies at the distribution mean and grows as the point moves away along the principal component axes. Mahalanobis distance has been used to successfully identify out-of-distribution or distribution- shifted samples by quantifying distance in the input or feature space [137]. The key intuition is that larger Mahalanobis distances indicate dissimilarity from the training distribution. Continuing with the notation introduced in Chapter 4, we define the Mahalanobis distance of a test sample Z from the set of encoded multimodal training inputs S𝑍 , in the fusion layer, as 𝐷 𝑀 (Z, S𝑍 ) = √︃ (Z − Z)𝑇 Σ−1 𝑍 (Z − Z), where Z is the mean of samples in the set S𝑍 given by Z = 1 𝑁 ∑︁ Z, Z∈S𝑍 and Σ𝑍 is the covariance matrix for samples in the set S𝑍 given by Σ𝑍 = 1 𝑁 − 1 ∑︁ Z∈S𝑍 (Z − Z) (Z − Z)𝑇 . (6.1) (6.2) (6.3) We computed the Mahalanobis distance between the multimodal fusion features of a test sample and the distribution of the training set fusion features providing a sample-specific measure of how well the model’s internal representation aligns with the training data. We used these distances to construct neighborhoods around the test sample for estimating local model accuracy. 6.2.2 Cosine Similarity-based Reliability Calibration Curve (C-RCC) The cosine similarity between a test sample Z and the mean Z of the encoded training set S𝑍 , in the fusion layer, is given by: where the numerator denotes the dot product between test sample and mean of the training set and 𝐷𝐶 (Z, S𝑍 ) = Z · Z ∥Z∥2∥Z∥2 , (6.4) ∥.∥2 denotes the 2-norm of a vector. 69 We used the cosine similarity between the input sample at inference time and the training data representations in the fusion layer of the neural network. The basic motivation being that test points located in sparse regions of the input space, far from the bulk of training data, will likely yield less reliable predictions [138]. This indicates that a functional relationship exists between the sample similarity and local model performance. The cosine similarity was therefore used to build C-RCC. 6.2.3 UMAP-based Reliability Calibration Curve (U-RCC) UMAP is a non-linear dimensionality reduction method [139]. Its goal is to find a low- dimensional embedding of the data that best preserves the global topological structure of the high-dimensional input data. We define the sets S𝑍 : set of high-dimensional input data points, S𝑧 : set of low-dimensional embeddings. UMAP finds a low-dimensional representation z ∈ S𝑧 of the high-dimensional data Z ∈ S𝑍 that preserves global data structure. This is achieved by first constructing a graph in the high-dimensional space. For each sample, the nearest neighbors are computed and weights are assigned to the graph edges connecting the sample and its neighbors. Weights 𝑤ℎ𝑖𝑔ℎ between two samples Z𝑖 and Z 𝑗 are 𝑖 𝑗 calculated as 𝑤ℎ𝑖𝑔ℎ 𝑖 𝑗 = exp (cid:18) − 𝑑 (Z𝑖, Z 𝑗 ) − 𝜌𝑖 𝜎𝑖 (cid:19) , (6.5) where 𝑑 (Z𝑖, Z 𝑗 ) is the distance between points, 𝜌𝑖 controls the local neighborhood size, and 𝜎𝑖 controls the fuzziness of neighborhoods. Next, a graph in the low-dimensional space in constructed. For the low-dimensional graph, the weights are computed as 𝑤𝑙𝑜𝑤 𝑖 𝑗 = 1 1 + 𝑎 · 𝑑 (z𝑖, z 𝑗 )2𝑏 , (6.6) where, 𝑎 and 𝑏 are hyperparameters of UMAP, 𝑑 (., .) is a distance function, and z𝑖 and z 𝑗 are low-dimensional representations of the 𝑖th and 𝑗th sample respectively. In order to learn the low-dimensional representation, the cross-entropy loss between the high-dimensional and low- 70 dimensional graphs, given by L = 𝑤ℎ𝑖𝑔ℎ 𝑖 𝑗 ∑︁ 𝑖, 𝑗 𝑤ℎ𝑖𝑔ℎ 𝑖 𝑗 𝑤𝑙𝑜𝑤 𝑖 𝑗 (cid:170) (cid:174) (cid:172) log (cid:169) (cid:173) (cid:171) + (1 − 𝑤ℎ𝑖𝑔ℎ 𝑖 𝑗 1 − 𝑤ℎ𝑖𝑔ℎ 𝑖 𝑗 1 − 𝑤𝑙𝑜𝑤 𝑖 𝑗 (cid:170) (cid:174) (cid:172) ) log (cid:169) (cid:173) (cid:171) , (6.7) is minimized using stochastic gradient descent. Rather than quantifying distances in the original fusion feature space, we first projected the fusion features into a lower-dimensional space that preserved the local structure of the data. We used UMAP to project the fusion features into a lower-dimensional space and then computed distances between the input sample and training data in this UMAP-reduced space to construct the UMAP-based reliability calibration curve (U-RCC). The UMAP induced distance between a test sample Z and the mean Z of the encoded training set S𝑍 , in the fusion layer, is given by 𝐷𝑈 (Z, S𝑍 ) = ∥Z − Z∥2, (6.8) where the numerator denotes the dot product between test sample and mean of the training set and ∥.∥2 denotes the 2-norm of a vector. 6.2.4 Prediction Probability-based Reliability Calibration Curve (P-RCC) Most neural network-based classifiers use the softmax function in the final layer to generate probabilities of class labels. Given a vector V ∈ R𝑛 from the last layer of an 𝑛-class classifier, the softmax function generates a vector 𝑃 ∈ R𝑛 of probability distribution over a list of model outputs, where for all 𝑖 = 1, · · · , 𝑛 the entries of 𝑃 are given by 𝑝𝑖 = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥(𝑉𝑖) = 𝑒𝑉𝑖 (cid:205)𝑛 𝑗=1 𝑒𝑉 𝑗 , (6.9) where 𝑃 = (cid:104) 𝑝1 (cid:105)𝑇 · · · 𝑝𝑛 , all elements of the resultant vector lie in the range (0, 1), and (cid:205)𝑛 𝑖=1 𝑝𝑖 = 1. We used the prediction probabilities generated by the model as the sample-specific attribute for generating the P-RCC. 6.2.5 Model Setup and Data For evaluating our RCCs, we utilized the same data, model setup, and classification problems as described in Chapter 4 and detailed in Tables 4.2 and 4.3. This provides a controlled environment 71 covering a variety of classification problems with existing trained models that can be readily used to test our new approach. To construct a RCC for a classifier, first we defined a sample-level property 𝑃𝑟𝑜 𝑝 that we wish to calibrate. Where 𝑃𝑟𝑜 𝑝 can be Mahalanobis distance-based (𝐷 𝑀) described in (6.1), cosine similarity-based (𝐶𝑂𝑆) described in (6.4), UMAP-based (𝐷𝑈) described in (6.8), or prediction probability-based (𝑃) described in (6.9). Using a validation set V𝑋, we stratified or binned the samples based on their 𝑃𝑟𝑜 𝑝 values. Within each bin, we computed the local accuracy ( 𝐴𝑐𝑐𝑙) and the average 𝑃𝑟𝑜 𝑝 value. We also computed the weights 𝑤 associated with each bin using the density of samples in the bin. The ordered pairs (𝑃𝑟𝑜 𝑝, 𝐴𝑐𝑐𝑙) of average property value and local accuracy in a bin were generated for all bins. The ordered pair data along with the weights were then used to fit a calibration curve ℎ. This process was repeated for 𝑁 𝑝 bootstrap iterations of the validation set over a variety of bin sizes 𝑁𝑏𝑖𝑛 generating 𝑁 𝑝 × 𝑁𝑏𝑖𝑛 calibration curves. RCC was then generated as the mean curve using the calibration curves (ℎ𝑖, 𝑗 ) where 𝑖 represents 𝑖th bootstrap iteration and 𝑗 represents 𝑗th bin size. RCC mapping can be represented as 𝑅𝐶𝐶 : X → Accuracy (cid:0)N𝑃𝑟𝑜 𝑝 (X)(cid:1) , (6.10) where N𝑝𝑟𝑜 𝑝 is a neighborhood in 𝑃𝑟𝑜 𝑝 about 𝑃𝑟𝑜 𝑝(X). The bootstrapping also provides confidence intervals around the generated RCC. An overview of the RCC generation approach is given in Algorithm 6.1. 6.3 Results As described earlier, we used three main metrics for evaluating the generated RCCs: granularity, convergence, and generalizability. All the RCCs satisfy the convergence property because for a single bin (all neighboring samples that share similar sample property), local accuracy is equal to the set accuracy. Therefore, we focus the following discussion on assessing the generalization and granularity of the RCCs. Figure 6.1 illustrates the calibration curves generated for the classification problems highlighted in Table 6.1. The red dots, representing local neighborhoods, are ordered 72 Algorithm 6.1 RCC 1: Input: Validation set V𝑋, sample property 𝑃𝑟𝑜 𝑝, number of bootstrap iterations (𝑁 𝑝), range of neighborhood sizes (𝑁𝑏𝑖𝑛𝑠). Take a random subset V𝑖 for 𝑗 ← 1 to 𝑁𝑏𝑖𝑛𝑠 do 2: for 𝑖 ← 1 to 𝑁 𝑝 do 3: 4: 5: 6: 7: 8: Generate histogram of 𝑃𝑟𝑜 𝑝 values with 𝑗 bins for V𝑖 𝑋. Calculate local accuracy ( 𝐴𝑐𝑐𝑙) for samples in each bin. for 𝑘 ← 1 to 𝑗 do 𝑋 of the validation set. Generate ordered pairs of average 𝑃𝑟𝑜 𝑝 value and local accuracy in the 𝑘th bin (𝑃𝑟𝑜 𝑝, 𝐴𝑐𝑐𝑙)𝑘 Calculate weights (𝑤 𝑘 ) corresponding to each ordered pair as the density of samples in the bin. end for Use the ordered pairs (𝑃𝑟𝑜 𝑝, 𝐴𝑐𝑐𝑙)𝑘 weighted by 𝑤 𝑘 to fit a curve ℎ𝑖, 𝑗 for 𝑖th bootstrap with 𝑗 bins. 9: 10: 11: end for 12: 13: end for 14: Use the curves ℎ𝑖, 𝑗 to compute the mean RCC and the 95% confidence interval. 15: return RCC Figure 6.1: Generated M-RCCs for problems 1,7, and 8 (L-R) in Table 6.1. The mean calibration curve, depicted by the blue line, is constructed using validation data, the blue shaded region represents a 95% confidence interval. The red dots represent pairs of average Mahalanobis distance and local accuracy of test data, calculated over local neighborhoods, repeated for multiple bootstrap iterations. The test data is weighted based on the density of samples in the neighborhood. 73 Table 6.1: Validation results of the Mahalanobis distance based reliability calibration curve (M- RCC). id Classification Average test Granularity Generalization problem ||𝑍1||1 𝑖=1 ||𝑍𝑖 ||1 1 7 (cid:205)4 8 𝑒𝑍2,7 ln (cid:0)𝑍1,1 + 𝑍1,2 accuracy min local acc. max local acc. 0.95 0.91 0.90 0.90 ± 0.117 0.91 ± 0.018 0.90 ± 0.008 1.00 ± 0.043 1.00 ± 0.167 1.00 ± 0.253 (cid:1) 2 RMSE 0.0379 0.0496 0.0532 The minimum and maximum local accuracies are reported ± standard deviation. pairs of property value and local accuracy derived from bootstrapping the test set. These pairs are weighted according to the sample density in each neighborhood. The solid blue line depicts the RCC generated from the validation set using Algorithm 6.1, and the shaded area corresponds to a 95% confidence interval. Generalization of the RCC is evaluated by calculating the RMSE between the ordered pairs de- rived from the test data and the mean calibration curve. Table 6.1 shows results for the Mahalanobis- based RCC. While the curve generalizes reasonably to the holdout test set, its granularity is limited. This means it does not reveal a wide range of local accuracy values and cannot provide fine-grained reliability measures. This suggests the Mahalanobis distance of a sample from the training set is not highly representative of the model’s local performance trends. While Mahalanobis distance has been successfully used to detect distribution shifts [137], it does not reveal local performance trends well. Similarly, the cosine similarity-based curves fail to capture informative local trends, with most curves centered around the set-level accuracy in Figure 6.2. These similarity-based approaches are often better suited for detecting out-of-distribution samples. The UMAP-based RCCs demonstrate good generalization in Table 6.3 but, like other distance- based methods, suffer from lack of granularity. The UMAP-projected fusion features in Figure 6.3 show that UMAP preserves the discriminative power of the fusion features, as the class-labeled plots remain separated. Since we construct U-RCC using the sample distance from the center of training distribution, we see the local accuracy increase with greater distances. 74 Figure 6.2: Generated C-RCCs for problems 1,7, and 8 (L-R) in Table 6.2 The mean calibration curve, depicted by the blue line, is constructed using validation data, the blue shaded region represents a 95% confidence interval. The red dots represent pairs of average cosine similarity and local accuracy of test data, calculated over local neighborhoods, repeated for multiple bootstrap iterations. The test data is weighted based on the density of samples in the neighborhood. Table 6.2: Validation results of the cosine similarity based calibration curve (C-RCC). id Classification Average test Granularity Generalization problem ||𝑍1||1 𝑖=1 ||𝑍𝑖 ||1 1 7 (cid:205)4 8 𝑒𝑍2,7 ln (cid:0)𝑍1,1 + 𝑍1,2 accuracy min local acc. max local acc. 0.95 0.91 0.90 0.93 ± 0.007 0.89 ± 0.051 0.87 ± 0.020 1.00 ± 0.013 0.95 ± 0.020 0.93 ± 0.013 (cid:1) 2 RMSE 0.0347 0.0542 0.0637 Figure 6.4 shows the generated RCCs using the model predicted probability for each sample. Our model outputs class probabilities from the softmax layer. The results demonstrate that in addition to reasonable generalization, the P-RCC based on softmax probabilities exhibits substantially more granularity compared to the previously discussed distance and similarity-based methods. This can be attributed to the fact that those approaches measured distances and similarities in the fusion layer, while the probability-based PRCC leverages the outputs from the final layer of the model. Therefore, it takes advantage of the full discriminative power of the end-to-end architecture trained specifically for this task. Since P-RCC significantly outperforms the previous metrics, we present full results for all explored cases in Table 6.4. The probability-based RCCs is a valuable tool that can enable fine-grained reliability quantification and interpretability compared to distance/similarity-based approaches in the fusion layer. 75 (a) Visualization of UMAP-reduced fusion features for problems 1,7, and 8 (L-R) in Table 6.3. Samples from all four input modalities are reduced to two discriminative features 𝑧1 and 𝑧2, colored by class label. The UMAP preserves global and local structure, maintaining class separation. The distance-based calibration curve below uses the UMAP distance of a test sample from the training distribution center. As distance increases towards the extremes, class separation improves and local accuracy increases. (b) Generated U-RCCs for problems 1,7, and 8 (L-R) in Table 6.3. The mean calibration curve, depicted by the blue line, is constructed using validation data, the blue shaded region represents a 95% confidence interval. The red dots represent pairs of average UMAP distance and local accuracy of test data, calculated over local neighborhoods, repeated for multiple bootstrap iterations. The test data is weighted based on the density of samples in the neighborhood. Figure 6.3: RCCs based on the euclidean distance of a sample from mean of training data in UMAP-projected space. Table 6.3: Validation results of the UMAP-based reliability calibration curve (U-RCC). id Classification Average test Granularity Generalization problem ||𝑍1||1 𝑖=1 ||𝑍𝑖 ||1 1 7 (cid:205)4 8 𝑒𝑍2,7 ln (cid:0)𝑍1,1 + 𝑍1,2 accuracy min local acc. max local acc. 0.95 0.91 0.90 0.81 ± 0.026 0.91 ± 0.008 0.76 ± 0.024 1.00 ± 0.003 0.99 ± 0.033 0.99 ± 0.013 (cid:1) 2 RMSE 0.0461 0.0496 0.0539 76 (a) Histogram of softmax probabilities generated by models for problems 1,7, and 8 (L-R) in Table 6.4. The bins of the histogram represent local neighborhoods in the validation set. (b) Generated P-RCCs for for problems 1,7, and 8 (L-R) in Table 6.4. The mean calibration curve, depicted by the blue line, is constructed using validation data, the blue shaded region represents a 95% confidence interval. The red dots represent pairs of average softmax probability and local accuracy of test data, calculated over local neighborhoods, repeated for multiple bootstrap iterations. The test data is weighted based on the density of samples in the neighborhood. Figure 6.4: RCCs based on model prediction probability. Table 6.4: Validation results of the prediction probability based calibration curve (P-RCC). id Classification Average test Granularity Generalization 3 1 2 problem ||𝑍1||1 ||𝑍2||1 ||𝑍3||1 ||𝑍4||1 4 5 (cid:205)𝑖=1,2 ||𝑍𝑖 ||1 6 (cid:205)𝑖=3,4 ||𝑍𝑖 ||1 7 (cid:205)4 𝑖=1 ||𝑍𝑖 ||1 8 𝑒𝑍2,7 ln (cid:0)𝑍1,1 + 𝑍1,2 accuracy min local acc. max local acc. RMSE 0.95 0.93 0.99 0.99 0.92 0.99 0.91 0.90 0.52 ± 0.044 0.59 ± 0.034 0.68 ± 0.065 0.74 ± 0.066 0.53 ± 0.034 0.68 ± 0.074 0.25 ± 0.052 0.46 ± 0.047 1.00 ± 0.062 1.00 ± 0.052 1.00 ± 0.048 1.00 ± 0.042 1.00 ± 0.057 1.00 ± 0.056 1.00 ± 0.177 1.00 ± 0.049 0.0882 0.0844 0.0667 0.0670 0.1040 0.0669 0.1163 0.0992 (cid:1) 2 77 6.4 RCCs in the Case of Missing Inputs After constructing and evaluating the reliability curves, we used them to extend the analysis to missing data scenarios. Specifically, we proposed a modified version of (5.1) to generate sample-level reliability estimates in the case of missing inputs: Local ScoreX𝑘 = Local ScoreΦ − ∑︁ 𝑘∈K imp(X𝑘 ) (Local ScoreΦ − ScoreX1:X𝑚). (6.11) Where the reference local score without missing data came from the RCC. In this formulation, the local accuracy with missing input 𝑋1 is proportional to the importance of 𝑋1 scaled by the reference local accuracy. This allowed us to estimate the impact of missing data on a sample at inference time using the multimodal input importance and RCC. Using the same framework as before, we estimated P-RCCs for missing input cases across the classifiers trained on problems in Table 6.4. To estimate the P-RCC with missing inputs, we used (6.11), relating local accuracy to the importance of the missing input. We used this to predict how the calibration curve would change for different missing inputs. We compared the estimated P-RCC to the true P-RCC generated on modified validation data where the missing input was mean- imputed. We perform this analysis for each classifier, removing one input at a time and imputing it with the mean value. Figure 6.5 shows results for a subset of classification problems, clearly demonstrating that the model prediction probability distribution and the calibration curve behavior change drastically when an important input is missing, while remaining relatively unchanged for less important inputs. Our estimated P-RCC aligns well with the true P-RCC. This supports our initial intuition that the drop in local model performance, caused by missing inputs, is proportional to the importance of the missing input. However, since we use a simple linear relationship, and testing is on controlled classification tasks, further ablation experiments are needed to robustly estimate sample-level reliability under missing data. 78 (a) Average feature importances from left to right: 𝑋1 = 1.00, 𝑋2 = 0.00, 𝑋3 = 0.00, 𝑋4 = 0.00. Top row: softmax probability distributions on the validation set for each missing input case. Bottom row: True P-RCC generated on data with missing input (blue line) compared to predicted PRCC estimated from (6.11) (red line). When the most important input 𝑋1 is missing, the softmax probability distribution shifts lower and the true P-RCC drops steeply, aligned with the prediction. With less important inputs 𝑋2, · · · , 𝑋4 missing, the probability and PRCC remain relatively unchanged, also matched by the estimate. (b) Average feature importances from left to right: 𝑋1 = 0.25, 𝑋2 = 0.25, 𝑋3 = 0.25, 𝑋4 = 0.25. Top row: softmax probability distributions on the validation set for each missing input case. Bottom row: True P-RCC generated on data with missing input (blue line) compared to predicted PRCC estimated from (6.11) (red line). Since all inputs are equally important, shifts in the softmax probability distribution and P-RCC are consistent across missing input cases. Figure 6.5: Comparison of estimated and true P-RCCs for holdout test data in the case of missing multimodal inputs. (a) shows results for problem 1, (b) shows results for problem 7, and (c) shows results for problem 8 in Table 6.4. 6.5 Conclusion and Future Work This work explored RCCs for multimodal neural networks to provide sample-level reliability quantification. While promising, more work needs to be done to address granularity limitations for the distance-based methods. Representative sample properties need to be explored to construct bet- 79 Figure 6.5 (cont’d.). (c) Average feature importances from left to right: 𝑋1 = 0.97, 𝑋2 = 0.02, 𝑋3 = 0.00, 𝑋4 = 0.00. Top row: softmax probability distributions on the validation set for each missing input case. Bottom row: True P-RCC generated on data with missing input (blue line) compared to predicted PRCC estimated from (6.11) (red line). shifts in the softmax probability distribution and P-RCC are aligned with estimates and are proportional to the importance of missing inputs. ter calibration curves. For example, this current work relies on one dimensional sample properties; future work could look for RCC relationships between model performance and multi-dimensional representations of a sample. As the current distance and similarity metrics lack sufficient granular- ity, potential future work may achieve better localization by looking at class-wise distances. Additionally, the simplicity of the importance-based RCC estimation provided reasonable but preliminary reliability estimates for missing inputs. More extensive validation across diverse problems is needed to fully develop a robust methodology for missing data scenarios. In conclusion, this preliminary work highlighted several opportunities to refine the RCC ap- proach, including improving localization, validating on more complex examples, identifying opti- mal model layers for distance based methods, and applying to real classification problems. Address- ing these limitations is an important next step in improving sample-level reliability quantification in multimodal models. The methods developed here provide a foundation to build on, through improvements in granularity, generalization, and missing data handling. 80 CHAPTER 7 CONCLUSION This dissertation explored methods to improve the human-interpretability of multimodal deep learning models for healthcare applications. We proposed a unified framework for estimating input feature importance in multimodal classifiers. Validation on synthetic data with ground truth importances showed our approach could accurately recover true feature importance. Analysis of real multimodal tumor classification and cardiomegaly detection models provided intuitive explanations of the black-box models. Our framework was agnostic to the underlying importance estimation technique, providing flexibility. By comparing importance across multimodal inputs, we gained insight into how different data types like images, text, and lab tests contributed to predictions. To further enhance interpretability, we used the estimated importances to predict how model performance degraded with missing inputs. Across two clinical tasks, we showed input importance was strongly correlated with a drop in accuracy when that input was removed. This will enable understanding of model limitations and cost-benefit analysis for acquiring additional patient data. While our results demonstrated a strong correlation between performance degradation and input importance, future work could explore other functional relationships between the two. Additionally, our analysis has centered on binary classification problems using accuracy as the evaluation metric. An important next step would be extending the techniques to handle multi-class tasks, addressing extreme class imbalance, and leveraging metrics such as balanced accuracy to improve robustness. With these kinds of expansions, the foundations established in this dissertation can continue to mature. We also constructed reliability calibration curves to quantify model reliability at the sample level. Initial results demonstrated the promise of this approach for per-sample reliability estimation. At the same time, they revealed opportunities to enhance the granularity and expand validation across diverse real-world tasks. Our missing data importance model provided reasonable prelimi- nary reliability estimates, establishing a foundation to build upon through further refinement and extensive real-world testing. 81 In conclusion, this dissertation makes significant strides in advancing the interpretability of multimodal deep learning for healthcare applications through input importance estimations and novel reliability calibration techniques. While work remains in improving localization, expanding validation, and clinical translation, the critical foundations have been laid. The opportunities uncovered to refine the methodology highlight the fruitful research directions ahead. Overall, this dissertation establishes a robust framework and springboard for increasing interpretability of powerful multimodal AI systems poised to transform medicine. 82 BIBLIOGRAPHY [1] [2] [3] [4] [5] [6] [7] [8] [9] V. Kaul, S. Enslin, and S. A. Gross, “History of artificial intelligence in medicine,” Gas- trointestinal endoscopy, vol. 92, no. 4, pp. 807–812, 2020. A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al., “Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai,” Information fusion, vol. 58, pp. 82–115, 2020. T. Speith, “A review of taxonomies of explainable artificial intelligence (xai) methods,” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 2239–2250, 2022. S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, and M. P. Lungren, “Fusion of medi- cal imaging and electronic health records using deep learning: a systematic review and implementation guidelines,” NPJ digital medicine, vol. 3, no. 1, p. 136, 2020. S. M. Wraith, J. S. Aikins, W. J. Clancey, L. M. Fagan, W. J. van Melle, B. G. Buchanan, R. Davis, A. C. Scott, E. H. Shortliffe, S. G. Axline, et al., “Computerized consultation system for selection of antimicrobial therapy,” American Journal of Hospital Pharmacy, vol. 33, no. 12, pp. 1304–1308, 1976. P. Szolovits, R. S. Patil, and W. B. Schwartz, “Artificial intelligence in medical diagnosis,” Annals of internal medicine, vol. 108, no. 1, pp. 80–87, 1988. S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K. Mun, “Artificial convolution neural network for medical image pattern recognition,” Neural networks, vol. 8, no. 7-8, pp. 1201–1214, 1995. A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,” nature, vol. 542, no. 7639, pp. 115–118, 2017. J. De Fauw, J. R. Ledsam, B. Romera-Paredes, S. Nikolov, N. Tomasev, S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. Visentin, et al., “Clinically applicable deep learning for diagnosis and referral in retinal disease,” Nature medicine, vol. 24, no. 9, pp. 1342–1350, 2018. [10] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, et al., “Chexnet radiologist-level pneumonia detection on chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225, 2017. [11] K.-H. Yu, A. L. Beam, and I. S. Kohane, “Artificial intelligence in healthcare,” Nature biomedical engineering, vol. 2, no. 10, pp. 719–731, 2018. [12] J. Burrell, “How the machine ‘thinks’: Understanding opacity in machine learning algo- rithms,” Big data & society, vol. 3, no. 1, p. 2053951715622512, 2016. 83 [13] Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan, “Dissecting racial bias in an algorithm used to manage the health of populations,” Science, vol. 366, no. 6464, pp. 447– 453, 2019. [14] A. J. Larrazabal, N. Nieto, V. Peterson, D. H. Milone, and E. Ferrante, “Gender imbal- ance in medical imaging datasets produces biased classifiers for computer-aided diagnosis,” Proceedings of the National Academy of Sciences, vol. 117, no. 23, pp. 12592–12594, 2020. [15] E. Röösli, S. Bozkurt, and T. Hernandez-Boussard, “Peeking into a black box, the fairness and generalizability of a mimic-iii benchmarking model,” Scientific Data, vol. 9, no. 1, p. 24, 2022. [16] C. Liu, X. Liu, F. Wu, M. Xie, Y. Feng, and C. Hu, “Using artificial intelligence (watson for oncology) for treatment recommendations amongst chinese patients with lung cancer: feasibility study,” Journal of medical Internet research, vol. 20, no. 9, p. e11087, 2018. [17] B. Institution, “Risks and remedies for artificial intelligence in health care,” 2021. Accessed: 2023-06-14. [18] D. S. Watson, J. Krutzinna, I. N. Bruce, C. E. Griffiths, I. B. McInnes, M. R. Barnes, and L. Floridi, “Clinical applications of machine learning algorithms: beyond the black box,” Bmj, vol. 364, 2019. [19] E. Vayena, A. Blasimme, and I. G. Cohen, “Machine learning in medicine: addressing ethical challenges,” PLoS medicine, vol. 15, no. 11, p. e1002689, 2018. [20] L. Hoffman, E. Benedetto, H. Huang, E. Grossman, D. Kaluma, Z. Mann, and J. Torous, “Augmenting mental health in primary care: a 1-year study of deploying smartphone apps in a multi-site primary care/behavioral health integration program,” Frontiers in psychiatry, p. 94, 2019. [21] A. Holzinger, B. Haibe-Kains, and I. Jurisica, “Why imaging data alone is not enough: Ai-based integration of imaging, omics, and clinical data,” European Journal of Nuclear Medicine and Molecular Imaging, vol. 46, pp. 2722–2730, 2019. [22] C. Macrae, “Governing the safety of artificial intelligence in healthcare,” BMJ quality & safety, vol. 28, no. 6, pp. 495–498, 2019. [23] OECD, “Recommendation of the council on artificial intelligence.” https://legalinstruments. oecd.org/en/instruments/OECD-LEGAL-0449, 2019. Accessed: 2023-07-18. [24] N. I. of Standards and T. (NIST), “Nist explainable ai workshop summary.” https://www.nist. gov/publications/nist-explainable-ai-workshop-summary, 2020. Accessed: 2023-07-18. [25] U. Congress, “H.r.6216 - national artificial intelligence initiative act of 2020.” https://www. congress.gov/bill/116th-congress/house-bill/6216, 2020. Accessed: 2023-07-18. [26] D. Gunning and D. Aha, “Darpa’s explainable artificial intelligence (xai) program,” AI magazine, vol. 40, no. 2, pp. 44–58, 2019. 84 [27] U. Congress, “H.r.2231 - algorithmic accountability act of 2019.” https://www.congress. gov/bill/116th-congress/house-bill/2231, 2020. Accessed: 2023-07-18. [28] F. T. Commission, un- https://www.ftc.gov/system/files/documents/reports/ derstanding big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf, 2016. Accessed: 2023-07-18. “Big data: issues.” inclusion or exclusion? A tool the for [29] “Explaining C. Office, I. https://ico.org.uk/media/for-organisations/guide-to-data-protection/key-dp-themes/ explaining-decisions-made-with-artificial-intelligence-1-0.pdf, 2020. Accessed: 2023-07- 18. decisions made with intelligence.” artificial [30] E. Union, “Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation).” https://eur-lex.europa.eu/eli/reg/2016/679/oj, 2016. Accessed: 2023-07-18. [31] E. Union, “Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending cer- tain union legislative acts.” https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX: 52021PC0206, 2021. Accessed: 2023-07-18. [32] Food, D. Administration, et al., “Proposed regulatory framework for modifications to ar- tificial intelligence/machine learning (ai/ml)-based software as a medical device (samd),” 2019. [33] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi, “A survey of methods for explaining black box models,” ACM computing surveys (CSUR), vol. 51, no. 5, pp. 1–42, 2018. [34] E. Shortliffe, Computer-based medical consultations: MYCIN, vol. 2. Elsevier, 2012. [35] C. Molnar, Interpretable machine learning. Lulu. com, 2020. [36] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001. [37] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-cam: Why did you say that?,” arXiv preprint arXiv:1611.07450, 2016. [38] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833, Springer, 2014. [39] R. K. Mothilal, A. Sharma, and C. Tan, “Explaining machine learning classifiers through diverse counterfactual explanations,” in Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 607–617, 2020. 85 [40] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee, “Counterfactual visual explana- tions,” 2019. [41] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al., “Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav),” in International conference on machine learning, pp. 2668–2677, PMLR, 2018. [42] R. Caruana, H. Kangarloo, J. D. Dionisio, U. Sinha, and D. Johnson, “Case-based explanation of non-case-based learning methods.,” in Proceedings of the AMIA Symposium, p. 212, American Medical Informatics Association, 1999. [43] P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” 2020. [44] A. Nguyen, J. Yosinski, and J. Clune, “Understanding neural networks via feature visualiza- tion: A survey,” 2019. [45] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that: deep learning for interpretable image recognition,” Advances in neural information processing systems, vol. 32, 2019. [46] E. A. Barnes, R. J. Barnes, Z. K. Martin, and J. K. Rader, “This looks like that there: Interpretable neural networks for image tasks when location matters,” Artificial Intelligence for the Earth Systems, vol. 1, no. 3, p. e220001, 2022. [47] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015. [48] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021. [49] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature machine intelligence, vol. 1, no. 5, pp. 206–215, 2019. [50] A. d. Garcez and L. C. Lamb, “Neurosymbolic ai: The 3 rd wave,” Artificial Intelligence Review, pp. 1–20, 2023. [51] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196, 2015. [52] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1721–1730, 2015. [53] S. Rüping et al., “Learning interpretable models,” 2006. [54] A. A. Freitas, “Comprehensible classification models: A position paper,” SIGKDD Explor. Newsl., vol. 15, p. 1–10, mar 2014. 86 [55] A. Jacovi, A. Marasović, T. Miller, and Y. Goldberg, “Formalizing trust in artificial intel- ligence: Prerequisites, causes and goals of human trust in ai,” in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 624–635, 2021. [56] D. Wang, Q. Yang, A. Abdul, and B. Y. Lim, “Designing theory-driven user-centric explain- able ai,” in Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1–15, 2019. [57] Z. C. Lipton, “The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.,” Queue, vol. 16, no. 3, pp. 31–57, 2018. [58] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propa- gating activation differences,” 2019. [59] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visual- ising image classification models and saliency maps,” CoRR, vol. abs/1312.6034, 2014. [60] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017. [61] [62] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Mach. Learn. Res., vol. 3, p. 1157–1182, Mar. 2003. J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, “Feature selection: A data perspective,” vol. 50, pp. 1–45, Jan. 2018. [63] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you? explaining the predictions of any classifier,” ACM, Aug. 2016. [64] Y. Hechtlinger, “Interpretation of prediction models using the input gradient,” 2016. [65] M. Wojtas and K. Chen, “Feature importance ranking for deep learning,” CoRR, vol. abs/2010.08973, 2020. [66] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PLOS ONE, vol. 10, p. e0130140, July 2015. [67] Y. Li, C.-Y. Chen, and W. W. Wasserman, “Deep feature selection: Theory and application to identify enhancers and promoters,” vol. 23, pp. 322–336, May 2016. [68] B. Skrlj, S. Dzeroski, N. Lavrac, and M. Petkovic, “Feature importance estimation with self-attention networks,” in Proceedings of the 24th European Conference on Artificial Intelligence, 2020. [69] L. Breiman Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [70] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visu- alising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013. 87 [71] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014. [72] A. Datta, S. Sen, and Y. Zick, “Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems,” in 2016 IEEE symposium on security and privacy (SP), pp. 598–617, IEEE, 2016. [73] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in Inter- national conference on machine learning, pp. 3319–3328, PMLR, 2017. [74] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “Smoothgrad: removing noise by adding noise,” arXiv preprint arXiv:1706.03825, 2017. [75] P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” in International conference on machine learning, pp. 1885–1894, PMLR, 2017. [76] R. Fong, M. Patrick, and A. Vedaldi, “Understanding deep networks via extremal pertur- bations and smooth masks,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 2950–2958, 2019. [77] W. J. Murdoch and A. Szlam, “Automatic rule extraction from long short term memory networks,” arXiv preprint arXiv:1702.02540, 2017. [78] A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam, and P. Das, “Expla- nations based on the missing: Towards contrastive explanations with pertinent negatives,” Advances in neural information processing systems, vol. 31, 2018. [79] M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision model-agnostic expla- nations,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018. [80] G. Plumb, D. Molitor, and A. S. Talwalkar, “Model agnostic supervised local explanations,” Advances in neural information processing systems, vol. 31, 2018. [81] E. Štrumbelj and I. Kononenko, “Explaining prediction models and individual predictions with feature contributions,” Knowledge and information systems, vol. 41, no. 3, pp. 647–665, 2014. [82] N. H. Pijls, B. de Bruyne, K. Peels, P. H. van der Voort, H. J. Bonnier, J. Bartunek, and J. J. Koolen, “Measurement of fractional flow reserve to assess the functional severity of coronary-artery stenoses,” New England Journal of Medicine, vol. 334, no. 26, pp. 1703– 1708, 1996. [83] P. A. Tonino, B. De Bruyne, N. H. Pijls, U. Siebert, F. Ikeno, M. vant Veer, V. Klauss, G. Manoharan, T. Engstrøm, K. G. Oldroyd, et al., “Fractional flow reserve versus angiog- raphy for guiding percutaneous coronary intervention,” New England Journal of Medicine, vol. 360, no. 3, pp. 213–224, 2009. 88 [84] P. D. Morris, D. Ryan, A. C. Morton, R. Lycett, P. V. Lawford, D. R. Hose, and J. P. Gunn, “Virtual fractional flow reserve from coronary angiography: modeling the significance of coronary lesions: results from the virtu-1 (virtual fractional flow reserve from coronary angiography) study,” JACC: Cardiovascular Interventions, vol. 6, no. 2, pp. 149–157, 2013. [85] A. Coenen, Y.-H. Kim, M. Kruk, C. Tesche, J. De Geer, A. Kurata, M. L. Lubbers, J. Daemen, L. Itu, S. Rapaka, et al., “Diagnostic accuracy of a machine-learning approach to coronary computed tomographic angiography-based fractional flow reserve: result from the machine consortium,” Circulation: Cardiovascular Imaging, vol. 11, no. 6, p. e007217, 2018. [86] V. R. Taqueti, L. J. Shaw, N. R. Cook, V. L. Murthy, N. R. Shah, C. R. Foster, J. Hainer, R. Blankstein, S. Dorbala, and M. F. Di Carli, “Excess cardiovascular risk in women relative to men referred for coronary angiography is associated with severely impaired coronary flow reserve, not obstructive disease,” Circulation, vol. 135, no. 6, pp. 566–577, 2017. [87] D. F. Young and F. Y. Tsai, “Flow characteristics in models of arterial stenoses—i. steady flow,” Journal of biomechanics, vol. 6, no. 4, pp. 395–410, 1973. [88] B. Seeley and D. Young, “Effect of geometry on pressure losses across models of arterial stenoses,” Journal of Biomechanics, vol. 9, pp. 439–448, Jan. 1976. [89] F. A. Duck, Physical properties of tissues: a comprehensive reference book. Academic press, 2013. [90] Z. Guo, X. Li, H. Huang, N. Guo, and Q. Li, “Deep learning-based image segmentation on multimodal medical imaging,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 3, no. 2, pp. 162–169, 2019. [91] T. Xu, H. Zhang, X. Huang, S. Zhang, and D. N. Metaxas, “Multimodal deep learning for cervical dysplasia diagnosis,” in International conference on medical image computing and computer-assisted intervention, pp. 115–123, Springer, 2016. [92] A. Akselrod-Ballin, M. Chorev, Y. Shoshan, A. Spiro, A. Hazan, R. Melamed, E. Barkan, E. Herzel, S. Naor, E. Karavani, et al., “Predicting breast cancer by applying deep learning to linked health records and mammograms,” Radiology, vol. 292, no. 2, pp. 331–342, 2019. [93] X. Ma and F. Jia, “Brain tumor classification with multimodal mr and pathology images,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 5th International Workshop, BrainLes 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Revised Selected Papers, Part II 5, pp. 343–352, Springer, 2020. [94] L. R. Soenksen, Y. Ma, C. Zeng, L. Boussioux, K. Villalobos Carballo, L. Na, H. M. Wiberg, M. L. Li, I. Fuentes, and D. Bertsimas, “Integrated multimodal artificial intelligence framework for healthcare applications,” NPJ Digital Medicine, vol. 5, no. 1, p. 149, 2022. [95] R. Yan, F. Zhang, X. Rao, Z. Lv, J. Li, L. Zhang, S. Liang, Y. Li, F. Ren, C. Zheng, et al., “Richer fusion network for breast cancer classification based on multimodal data,” BMC Medical Informatics and Decision Making, vol. 21, no. 1, pp. 1–15, 2021. 89 [96] G. Holste, S. C. Partridge, H. Rahbar, D. Biswas, C. I. Lee, and A. M. Alessio, “End-to-end learning of fused image and non-image features for improved breast cancer classification from mri,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3294–3303, 2021. [97] T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems, pp. 1–15, Springer Berlin Heidelberg, 2000. [98] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: Adaptive multi-modal gesture recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, p. 1692–1706, aug 2016. [99] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” 2011. [100] C. G. M. Snoek, M. Worring, and A. W. M. Smeulders, “Early versus late fusion in semantic video analysis,” in ACM International Conference on Multimedia, pp. 399—-402, 2005. [101] H. Gunes and M. Piccardi, “Affect recognition from face and body: early fusion vs. late fusion,” 2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 3437–3443 Vol. 4, 2005. [102] D. Ramachandram and G. W. Taylor, “Deep multimodal learning: A survey on recent advances and trends,” IEEE signal processing magazine, vol. 34, no. 6, pp. 96–108, 2017. [103] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018. [104] P. Atrey, M. Hossain, A. El Saddik, and M. Kankanhalli, “Multimodal fusion for multimedia analysis: A survey,” Multimedia Syst., vol. 16, pp. 345–379, 11 2010. [105] Y. Li, M. El Habib Daho, P.-H. Conze, H. Al Hajj, S. Bonnin, H. Ren, N. Manivannan, S. Magazzeni, R. Tadayoni, B. Cochener, et al., “Multimodal information fusion for glaucoma and diabetic retinopathy classification,” in International Workshop on Ophthalmic Medical Image Analysis, pp. 53–62, Springer, 2022. [106] S. Niu, Q. Yin, Y. Song, Y. Guo, and X. Yang, “Label dependent attention model for disease risk prediction using multimodal electronic health records,” in 2021 IEEE International Conference on Data Mining (ICDM), pp. 449–458, IEEE, 2021. [107] P. Vijayaraghavan, H. Larochelle, and D. Roy, “Interpretable multi-modal hate speech de- tection,” arXiv preprint arXiv:2103.01616, 2021. [108] I. Gat, I. Schwartz, and A. Schwing, “Perceptual score: What data modalities does your model perceive?,” Advances in Neural Information Processing Systems, vol. 34, pp. 21630– 21643, 2021. [109] H. Suresh, N. Hunt, A. Johnson, L. A. Celi, P. Szolovits, and M. Ghassemi, “Clinical interven- tion prediction and understanding using deep networks,” arXiv preprint arXiv:1705.08498, 2017. 90 [110] S. El-Sappagh, J. M. Alonso, S. R. Islam, A. M. Sultan, and K. S. Kwak, “A multilayer multimodal detection and prediction model based on explainable artificial intelligence for alzheimer’s disease,” Scientific reports, vol. 11, no. 1, p. 2660, 2021. [111] W. Jin, X. Li, and G. Hamarneh, “Evaluating explainable ai on a multi-modal medical imaging task: Can existing algorithms fulfill clinical requirements?,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11945–11953, 2022. [112] J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni, “Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification,” arXiv preprint arXiv:2110.14795, 2021. [113] N. Z. Abidin, A. R. Ismail, and N. A. Emran, “Performance analysis of machine learning algorithms for missing value imputation,” International Journal of Advanced Computer Science and Applications, vol. 9, no. 6, 2018. [114] E. T. Capariño, A. M. Sison, and R. P. Medina, “Application of the modified imputation method to missing data to increase classification performance,” in 2019 IEEE 4th Interna- tional Conference on Computer and Communication Systems (ICCCS), pp. 134–139, IEEE, 2019. [115] K. N. Vokinger, S. Feuerriegel, and A. S. Kesselheim, “Continual learning in medical devices: Fda’s action plan and beyond,” The Lancet Digital Health, vol. 3, no. 6, pp. e337–e338, 2021. [116] A. R. T. Donders, G. J. Van Der Heijden, T. Stijnen, and K. G. Moons, “A gentle introduction to imputation of missing values,” Journal of clinical epidemiology, vol. 59, no. 10, pp. 1087– 1091, 2006. [117] P. Jonsson and C. Wohlin, “An evaluation of k-nearest neighbour imputation using likert data,” in 10th International Symposium on Software Metrics, 2004. Proceedings., pp. 108– 118, IEEE, 2004. [118] D. B. Rubin, “The calculation of posterior distributions by data augmentation: Comment: A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: The sir algorithm,” Journal of the American Statistical Association, vol. 82, no. 398, pp. 543–546, 1987. [119] L. Tran, X. Liu, J. Zhou, and R. Jin, “Missing modalities imputation via cascaded resid- ual autoencoder,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1405–1414, 2017. [120] A. Johnson, M. Lungren, Y. Peng, Z. Lu, R. Mark, S. Berkowitz, and S. Horng, “Mimic-cxr- jpg-chest radiographs with structured labels,” PhysioNet, 2019. [121] A. E. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J. Pollard, B. Moody, B. Gow, L.-w. H. Lehman, et al., “Mimic-iv, a freely accessible electronic health record dataset,” Scientific data, vol. 10, no. 1, p. 1, 2023. 91 [122] A. Johnson, L. Bulgarelli, T. Pollard, L. A. Celi, R. Mark, and S. Horng IV, “Mimic-iv-ed,” PhysioNet, 2021. [123] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al., “Chexpert: A large chest radiograph dataset with uncer- tainty labels and expert comparison,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 590–597, 2019. [124] J. P. Cohen, J. D. Viviano, P. Bertin, P. Morrison, P. Torabian, M. Guarrera, M. P. Lungren, A. Chaudhari, R. Brooks, M. Hashir, et al., “Torchxrayvision: A library of chest x-ray datasets and models,” in International Conference on Medical Imaging with Deep Learning, pp. 231–249, PMLR, 2022. [125] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015. [126] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019. [127] C. R. Henderson, “Best linear unbiased estimation and prediction under a selection model,” Biometrics, pp. 423–447, 1975. [128] N. Chaturvedi, “Ethnic differences in cardiovascular disease,” Heart, vol. 89, no. 6, pp. 681– 686, 2003. [129] A. K. Kurian and K. M. Cardarelli, “Racial and ethnic differences in cardiovascular disease risk factors,” Ethnicity & disease, vol. 17, no. 1, pp. 143–152, 2007. [130] “Cardiomegaly.” https://www.ncbi.nlm.nih.gov/books/NBK542296/. Accessed: 2023-06- 12. [131] “Enlarged heart - symptoms and causes.” https://www.mayoclinic.org/diseases-conditions/ enlarged-heart/symptoms-causes/syc-20355436. Accessed: 2023-06-12. [132] Q. McNamara, A. De La Vega, and T. Yarkoni, “Developing a comprehensive framework for multimodal feature extraction,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1567–1574, 2017. [133] Y. Kwon, J.-H. Won, B. J. Kim, and M. C. Paik, “Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation,” Compu- tational Statistics & Data Analysis, vol. 142, p. 106816, 2020. [134] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of The 33rd International Conference on Machine Learning (M. F. Balcan and K. Q. Weinberger, eds.), vol. 48 of Proceedings of Machine Learning Research, (New York, New York, USA), pp. 1050–1059, PMLR, 20–22 Jun 2016. 92 [135] G. Shafer and V. Vovk, “A tutorial on conformal prediction.,” Journal of Machine Learning Research, vol. 9, no. 3, 2008. [136] P. C. Mahalanobis, “On the generalized distance in statistics,” Proceedings of the National Institute of Sciences (Calcutta), vol. 2, pp. 49–55, 1936. [137] J. Ren, S. Fort, J. Liu, A. G. Roy, S. Padhy, and B. Lakshminarayanan, “A simple fix to mahalanobis distance for improving near-ood detection,” arXiv preprint arXiv:2106.09022, 2021. [138] K. Lee, H. Lee, K. Lee, and J. Shin, “Training confidence-calibrated classifiers for detecting out-of-distribution samples,” arXiv preprint arXiv:1711.09325, 2017. [139] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projec- tion for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018. 93 APPENDIX Open Source Code Code developed for Chapter 3 is available: https://github.com/MA/FFR. Code developed for Chapter 4 is available: https://github.com/MA/SYN. Code developed for Chapter 5 is available: https://github.com/MA/MII. 94